Skip to main content
Ingest unstructured documents (PDFs, Word docs, text files, HTML) into Weaviate, an open-source vector database for RAG and semantic search applications. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a Weaviate class.

Configuration

"source": {
    "fileAttributes": {
        "unstructuredAttributes": {
            "fileExtension": "pdf",
            "preserveFilename": true
        }
    }
},
"destination": {
    "weaviate": {
        "className": "FinancialDocuments",
        "chunking": {
            "strategy": "recursive",
            "chunkSize": 500,
            "chunkOverlap": 50
        },
        "metadata": {
            "company": "Apple Inc",
            "documentType": "10-Q",
            "filingDate": "2026-01-30"
        },
        "embeddingSecretName": "oss/embedding",
        "weaviateSecretName": "oss/weaviate"
    }
}
FieldTypeDefaultDescription
classNamestring(required)Weaviate class name. Must be PascalCase (e.g., FinancialDocuments). Auto-created if it doesn’t exist.
chunkingobjectrecursive, 500/50Chunking strategy configuration (see below).
metadatamap{}Static key-value metadata stored on every chunk as Weaviate properties. Used for filtered search.
embeddingSecretNamestring(required)Vault secret name for the embedding API configuration.
weaviateSecretNamestring(required)Vault secret name for Weaviate connection.

Supported File Types

FormatDescription
PDF (.pdf)Text extracted via Apache PDFBox
Word (.doc)Text extracted via Apache POI (legacy format)
Word (.docx)Text extracted via Apache POI (modern format)
PowerPoint (.ppt)Text extracted via Apache POI (legacy format)
PowerPoint (.pptx)Text extracted via Apache POI (modern format)
Excel (.xls, .xlsx)Cell values extracted via Apache POI
HTML (.html, .htm)Text extracted via JSoup (tags stripped)
RTF (.rtf)Text extracted via javax.swing RTF parser
Email (.msg)Subject, from, to, and body extracted via Apache POI
Email (.eml)Subject, from, and body extracted via Jakarta Mail
EPUB (.epub)XHTML content extracted and parsed via JSoup
Plain text (.txt, .md, .csv, .json, .xml)Content used directly
For structured data (CSV, JSON, XML), use database destinations (PostgreSQL, MongoDB) instead.

Vault Secrets

Weaviate connection

vault kv put secret/oss/weaviate \
  host="host.docker.internal" \
  port="8079" \
  apiKey=""
FieldDescription
hostWeaviate server hostname. Use host.docker.internal for local Weaviate.
portREST API port (default 8079 to avoid conflict with pipeline’s 8080).
schemeProtocol — http (default) or https for Weaviate Cloud.
apiKeyAPI key for Weaviate Cloud. Empty string for local instances.

Embedding API

The embedding API is shared with other vector database destinations (e.g., Qdrant). See Qdrant documentation for full details.
vault kv put secret/oss/embedding \
  endpoint="https://api.openai.com/v1/embeddings" \
  model="text-embedding-3-small" \
  apiKey="sk-..."

Chunking Strategies

Documents are split into chunks before embedding. Each chunk becomes a separate Weaviate object with the document’s metadata plus chunk_index, filename, and source_pipeline properties.
"chunking": {
    "strategy": "recursive",
    "chunkSize": 500,
    "chunkOverlap": 50
}
StrategyDescription
noneNo chunking — one embedding per document. Only for very short documents.
fixedSplit by character count. Fast but may cut mid-sentence.
sentenceSplit on sentence boundaries (. ! ?). Preserves semantic units.
paragraphSplit on double newlines. Ideal for structured documents with clear sections.
recursiveTry \n\n, then \n, then ., then space — best general-purpose default.
  • chunkSize (default 500): maximum characters per chunk
  • chunkOverlap (default 50): characters of overlap between consecutive chunks

Metadata

Static metadata is attached to every chunk as Weaviate properties, enabling filtered semantic search:
"metadata": {
    "company": "Apple Inc",
    "documentType": "10-Q",
    "filingDate": "2026-01-30"
}
This allows queries like “find chunks about revenue from Apple 10-Q filings” by combining vector similarity with metadata filters. In addition to static metadata, every chunk automatically includes:
  • text — the chunk text
  • chunk_index — position of the chunk in the document
  • filename — original uploaded filename
  • source_pipeline — pipeline name

How It Works

  1. Upload — an unstructured file (PDF, DOC, DOCX, HTML, text) is uploaded via POST /api/v1/pipeline/upload
  2. Extract — text is extracted from the document (PDFBox for PDFs, Apache POI for Word, JSoup for HTML)
  3. Chunk — text is split into chunks using the configured strategy
  4. Embed — each chunk is sent to the embedding API to generate a vector
  5. Upsert — vectors are upserted into the Weaviate class with metadata properties
  6. Notify — a pipeline notification is published on completion
The class is auto-created on first upsert using HNSW indexing with cosine distance.

Running Weaviate

Weaviate runs outside Docker (like Qdrant and Ollama). Run locally via Docker:
docker run -p 8079:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest
Port 8079 is used to avoid conflict with the pipeline server on port 8080. Or install from weaviate.io/developers/weaviate/installation.

Verifying

# List all classes
curl http://localhost:8079/v1/schema

# Get class details
curl http://localhost:8079/v1/schema/FinancialDocuments

# Get objects
curl "http://localhost:8079/v1/objects?class=FinancialDocuments&limit=5"

# Get objects with specific metadata filter
curl -X POST http://localhost:8079/v1/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ Get { FinancialDocuments(limit: 5, where: {path: [\"company\"], operator: Equal, valueText: \"Apple Inc\"}) { text chunk_index filename company documentType } } }"
  }'