Configuration
| Field | Type | Default | Description |
|---|---|---|---|
collectionName | string | (required) | Milvus collection name. Auto-created if it doesn’t exist. |
chunking | object | recursive, 500/50 | Chunking strategy configuration (see below). |
metadata | map | {} | Static key-value metadata stored as dynamic fields on every chunk. |
embeddingSecretName | string | (required) | Vault secret name for the embedding API configuration. |
milvusSecretName | string | (required) | Vault secret name for Milvus connection. |
Supported File Types
| Format | Description |
|---|---|
PDF (.pdf) | Text extracted via Apache PDFBox |
Word (.doc) | Text extracted via Apache POI (legacy format) |
Word (.docx) | Text extracted via Apache POI (modern format) |
PowerPoint (.ppt) | Text extracted via Apache POI (legacy format) |
PowerPoint (.pptx) | Text extracted via Apache POI (modern format) |
Excel (.xls, .xlsx) | Cell values extracted via Apache POI |
HTML (.html, .htm) | Text extracted via JSoup (tags stripped) |
RTF (.rtf) | Text extracted via javax.swing RTF parser |
Email (.msg) | Subject, from, to, and body extracted via Apache POI |
Email (.eml) | Subject, from, and body extracted via Jakarta Mail |
EPUB (.epub) | XHTML content extracted and parsed via JSoup |
Plain text (.txt, .md, .csv, .json, .xml) | Content used directly |
Vault Secrets
Milvus connection
| Field | Description |
|---|---|
host | Milvus server hostname. Use host.docker.internal for local Milvus. |
port | gRPC port (default 19530). |
apiKey | API key / token for Milvus Cloud. Empty string for local instances. |
Embedding API
The embedding API is shared with other vector database destinations. See Qdrant documentation for full details.Chunking Strategies
Documents are split into chunks before embedding. Each chunk becomes a separate entity in the Milvus collection with the document’s metadata pluschunk_index, filename, and source_pipeline fields.
| Strategy | Description |
|---|---|
none | No chunking — one embedding per document. Only for very short documents. |
fixed | Split by character count. Fast but may cut mid-sentence. |
sentence | Split on sentence boundaries (. ! ?). Preserves semantic units. |
paragraph | Split on double newlines. Ideal for structured documents with clear sections. |
recursive | Try \n\n, then \n, then ., then space — best general-purpose default. |
chunkSize(default 500): maximum characters per chunkchunkOverlap(default 50): characters of overlap between consecutive chunks
Metadata
Static metadata is stored as dynamic fields on every chunk in the Milvus collection:id— deterministic UUID (idempotent upserts)text— the chunk textchunk_index— position of the chunk in the documentfilename— original uploaded filenamesource_pipeline— pipeline nameembedding— float vector for similarity search
How It Works
- Upload — an unstructured file is uploaded via
POST /api/v1/pipeline/upload - Extract — text is extracted from the document
- Chunk — text is split into chunks using the configured strategy
- Embed — each chunk is sent to the embedding API to generate a vector
- Upsert — vectors are inserted into the Milvus collection with metadata
- Notify — a pipeline notification is published on completion
Running Milvus
Milvus runs outside the pipeline’s Docker Compose (like Qdrant and Weaviate). Unlike simpler vector databases, Milvus requires etcd and MinIO as internal dependencies, so a simpledocker run won’t work.
Option 1 — Milvus standalone script (recommended):
bash standalone_embed.sh stop
Option 2 — Milvus docker-compose: