Weaviate Vector Database Destination

Ingest unstructured documents (PDFs, Word docs, text files, HTML) into Weaviate, an open-source vector database for RAG and semantic search applications. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a Weaviate class.

Configuration

"source": {
    "fileAttributes": {
        "unstructuredAttributes": {
            "fileExtension": "pdf",
            "preserveFilename": true
        }
    }
},
"destination": {
    "weaviate": {
        "className": "FinancialDocuments",
        "chunking": {
            "strategy": "recursive",
            "chunkSize": 500,
            "chunkOverlap": 50
        },
        "metadata": {
            "company": "Apple Inc",
            "documentType": "10-Q",
            "filingDate": "2026-01-30"
        },
        "weaviateSecretName": "oss/weaviate"
    }
}

Field	Type	Default	Description
`className`	string	(required)	Weaviate class name. Must be PascalCase (e.g., `FinancialDocuments`). Auto-created if it doesn’t exist.
`chunking`	object	recursive, 500/50	Chunking strategy configuration (see below).
`metadata`	map	`{}`	Static key-value metadata stored on every chunk as Weaviate properties. Used for filtered search.
`embeddingSecretName`	string	server default	Optional override of the embedding Vault secret. Defaults to the server-level `ai.embedding.secretName` (`oss/embedding`), which is seeded automatically — set this only to point a single pipeline at a different embedding model.
`weaviateSecretName`	string	(required)	Vault secret name for Weaviate connection.

Supported File Types

Format	Description
PDF (`.pdf`)	Text extracted via Apache PDFBox
Word (`.doc`)	Text extracted via Apache POI (legacy format)
Word (`.docx`)	Text extracted via Apache POI (modern format)
PowerPoint (`.ppt`)	Text extracted via Apache POI (legacy format)
PowerPoint (`.pptx`)	Text extracted via Apache POI (modern format)
Excel (`.xls`, `.xlsx`)	Cell values extracted via Apache POI
HTML (`.html`, `.htm`)	Text extracted via JSoup (tags stripped)
RTF (`.rtf`)	Text extracted via javax.swing RTF parser
Email (`.msg`)	Subject, from, to, and body extracted via Apache POI
Email (`.eml`)	Subject, from, and body extracted via Jakarta Mail
EPUB (`.epub`)	XHTML content extracted and parsed via JSoup
Plain text (`.txt`, `.md`, `.csv`, `.json`, `.xml`)	Content used directly

For structured data (CSV, JSON, XML), use database destinations (PostgreSQL, MongoDB) instead.

Vault Secrets

Weaviate connection

vault kv put secret/oss/weaviate \
  host="host.docker.internal" \
  port="8079" \
  apiKey=""

Field	Description
`host`	Weaviate server hostname. Use `host.docker.internal` for local Weaviate.
`port`	REST API port (default 8079 to avoid conflict with pipeline’s 8080).
`scheme`	Protocol — `http` (default) or `https` for Weaviate Cloud.
`apiKey`	API key for Weaviate Cloud. Empty string for local instances.

Embedding API

The embedding secret is server-level (ai.embedding.secretName, default oss/embedding) and is seeded automatically by docker/vault-init.sh. See AI Configuration and the Qdrant docs for the full picture.

vault kv put secret/oss/embedding \
  provider="openai" \
  endpoint="https://api.openai.com/v1/embeddings" \
  model="text-embedding-3-small" \
  apiKey="sk-..."

For Anthropic-only deployments, the bundled TEI sidecar serves bge-m3 and vault-init.sh seeds the embedding secret to point at it — no OpenAI key required.

Chunking Strategies

Documents are split into chunks before embedding. Each chunk becomes a separate Weaviate object with the document’s metadata plus chunk_index, filename, and source_pipeline properties.

"chunking": {
    "strategy": "recursive",
    "chunkSize": 500,
    "chunkOverlap": 50
}

Strategy	Description
`none`	No chunking — one embedding per document. Only for very short documents.
`fixed`	Split by character count. Fast but may cut mid-sentence.
`sentence`	Split on sentence boundaries (`.` `!` `?`). Preserves semantic units.
`paragraph`	Split on double newlines. Ideal for structured documents with clear sections.
`recursive`	Try `\n\n`, then `\n`, then `.`, then space — best general-purpose default.

chunkSize (default 500): maximum characters per chunk
chunkOverlap (default 50): characters of overlap between consecutive chunks

Metadata

Static metadata is attached to every chunk as Weaviate properties, enabling filtered semantic search:

"metadata": {
    "company": "Apple Inc",
    "documentType": "10-Q",
    "filingDate": "2026-01-30"
}

This allows queries like “find chunks about revenue from Apple 10-Q filings” by combining vector similarity with metadata filters. In addition to static metadata, every chunk automatically includes:

text — the chunk text
chunk_index — position of the chunk in the document
filename — original uploaded filename
source_pipeline — pipeline name

How It Works

Upload — an unstructured file (PDF, DOC, DOCX, HTML, text) is uploaded via POST /api/v1/pipeline/upload
Extract — text is extracted from the document (PDFBox for PDFs, Apache POI for Word, JSoup for HTML)
Chunk — text is split into chunks using the configured strategy
Embed — each chunk is sent to the embedding API to generate a vector
Upsert — vectors are upserted into the Weaviate class with metadata properties
Notify — a pipeline notification is published on completion

The class is auto-created on first upsert using HNSW indexing with cosine distance.

Running Weaviate

Weaviate runs outside Docker (like Qdrant and Ollama). Run locally via Docker:

docker run -p 8079:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest

Port 8079 is used to avoid conflict with the pipeline server on port 8080. Or install from weaviate.io/developers/weaviate/installation.

Verifying

# List all classes
curl http://localhost:8079/v1/schema

# Get class details
curl http://localhost:8079/v1/schema/FinancialDocuments

# Get objects
curl "http://localhost:8079/v1/objects?class=FinancialDocuments&limit=5"

# Get objects with specific metadata filter
curl -X POST http://localhost:8079/v1/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ Get { FinancialDocuments(limit: 5, where: {path: [\"company\"], operator: Equal, valueText: \"Apple Inc\"}) { text chunk_index filename company documentType } } }"
  }'

​Configuration

​Supported File Types

​Vault Secrets

​Weaviate connection

​Embedding API

​Chunking Strategies

​Metadata

​How It Works

​Running Weaviate

​Verifying