Configuration
| Field | Type | Default | Description |
|---|---|---|---|
className | string | (required) | Weaviate class name. Must be PascalCase (e.g., FinancialDocuments). Auto-created if it doesn’t exist. |
chunking | object | recursive, 500/50 | Chunking strategy configuration (see below). |
metadata | map | {} | Static key-value metadata stored on every chunk as Weaviate properties. Used for filtered search. |
embeddingSecretName | string | (required) | Vault secret name for the embedding API configuration. |
weaviateSecretName | string | (required) | Vault secret name for Weaviate connection. |
Supported File Types
| Format | Description |
|---|---|
PDF (.pdf) | Text extracted via Apache PDFBox |
Word (.doc) | Text extracted via Apache POI (legacy format) |
Word (.docx) | Text extracted via Apache POI (modern format) |
PowerPoint (.ppt) | Text extracted via Apache POI (legacy format) |
PowerPoint (.pptx) | Text extracted via Apache POI (modern format) |
Excel (.xls, .xlsx) | Cell values extracted via Apache POI |
HTML (.html, .htm) | Text extracted via JSoup (tags stripped) |
RTF (.rtf) | Text extracted via javax.swing RTF parser |
Email (.msg) | Subject, from, to, and body extracted via Apache POI |
Email (.eml) | Subject, from, and body extracted via Jakarta Mail |
EPUB (.epub) | XHTML content extracted and parsed via JSoup |
Plain text (.txt, .md, .csv, .json, .xml) | Content used directly |
Vault Secrets
Weaviate connection
| Field | Description |
|---|---|
host | Weaviate server hostname. Use host.docker.internal for local Weaviate. |
port | REST API port (default 8079 to avoid conflict with pipeline’s 8080). |
scheme | Protocol — http (default) or https for Weaviate Cloud. |
apiKey | API key for Weaviate Cloud. Empty string for local instances. |
Embedding API
The embedding API is shared with other vector database destinations (e.g., Qdrant). See Qdrant documentation for full details.Chunking Strategies
Documents are split into chunks before embedding. Each chunk becomes a separate Weaviate object with the document’s metadata pluschunk_index, filename, and source_pipeline properties.
| Strategy | Description |
|---|---|
none | No chunking — one embedding per document. Only for very short documents. |
fixed | Split by character count. Fast but may cut mid-sentence. |
sentence | Split on sentence boundaries (. ! ?). Preserves semantic units. |
paragraph | Split on double newlines. Ideal for structured documents with clear sections. |
recursive | Try \n\n, then \n, then ., then space — best general-purpose default. |
chunkSize(default 500): maximum characters per chunkchunkOverlap(default 50): characters of overlap between consecutive chunks
Metadata
Static metadata is attached to every chunk as Weaviate properties, enabling filtered semantic search:text— the chunk textchunk_index— position of the chunk in the documentfilename— original uploaded filenamesource_pipeline— pipeline name
How It Works
- Upload — an unstructured file (PDF, DOC, DOCX, HTML, text) is uploaded via
POST /api/v1/pipeline/upload - Extract — text is extracted from the document (PDFBox for PDFs, Apache POI for Word, JSoup for HTML)
- Chunk — text is split into chunks using the configured strategy
- Embed — each chunk is sent to the embedding API to generate a vector
- Upsert — vectors are upserted into the Weaviate class with metadata properties
- Notify — a pipeline notification is published on completion