Configuration
| Field | Type | Default | Description |
|---|---|---|---|
collectionName | string | (required) | Qdrant collection name. Auto-created if it doesn’t exist. |
chunking | object | recursive, 500/50 | Chunking strategy configuration (see below). |
metadata | map | {} | Static key-value metadata stored on every chunk as Qdrant payload. Used for filtered search. |
embeddingSecretName | string | (required) | Vault secret name for the embedding API configuration. |
qdrantSecretName | string | (required) | Vault secret name for Qdrant connection. |
Supported File Types
| Format | Description |
|---|---|
PDF (.pdf) | Text extracted via Apache PDFBox |
Word (.doc) | Text extracted via Apache POI (legacy format) |
Word (.docx) | Text extracted via Apache POI (modern format) |
PowerPoint (.ppt) | Text extracted via Apache POI (legacy format) |
PowerPoint (.pptx) | Text extracted via Apache POI (modern format) |
Excel (.xls, .xlsx) | Cell values extracted via Apache POI |
HTML (.html, .htm) | Text extracted via JSoup (tags stripped) |
RTF (.rtf) | Text extracted via javax.swing RTF parser |
Email (.msg) | Subject, from, to, and body extracted via Apache POI |
Email (.eml) | Subject, from, and body extracted via Jakarta Mail |
EPUB (.epub) | XHTML content extracted and parsed via JSoup |
Plain text (.txt, .md, .csv, .json, .xml) | Content used directly |
Vault Secrets
Qdrant connection
| Field | Description |
|---|---|
host | Qdrant server hostname. Use host.docker.internal for local Qdrant. |
port | gRPC port (default 6334). |
apiKey | API key for Qdrant Cloud. Empty string for local instances. |
Embedding API
The embedding API is configured separately from the pipeline’s AI provider, allowing you to use different services for embeddings vs. AI features.| Field | Description |
|---|---|
endpoint | Embedding API URL. Supports OpenAI and OpenAI-compatible APIs (Ollama). |
model | Embedding model name. |
apiKey | API key. Empty string for local Ollama. |
| Provider | Endpoint | Model | Dimensions |
|---|---|---|---|
| OpenAI | https://api.openai.com/v1/embeddings | text-embedding-3-small | 1536 |
| OpenAI | https://api.openai.com/v1/embeddings | text-embedding-3-large | 3072 |
| Ollama | http://host.docker.internal:11434/api/embeddings | nomic-embed-text | 768 |
Chunking Strategies
Documents are split into chunks before embedding. Each chunk becomes a separate Qdrant point with the document’s metadata pluschunk_index, filename, and source_pipeline fields.
| Strategy | Description |
|---|---|
none | No chunking — one embedding per document. Only for very short documents. |
fixed | Split by character count. Fast but may cut mid-sentence. |
sentence | Split on sentence boundaries (. ! ?). Preserves semantic units. |
paragraph | Split on double newlines. Ideal for structured documents with clear sections. |
recursive | Try \n\n, then \n, then ., then space — best general-purpose default. |
chunkSize(default 500): maximum characters per chunkchunkOverlap(default 50): characters of overlap between consecutive chunks
Metadata
Static metadata is attached to every chunk in the Qdrant payload, enabling filtered semantic search:text— the chunk textchunk_index— position of the chunk in the documentfilename— original uploaded filenamesource_pipeline— pipeline name
How It Works
- Upload — an unstructured file (PDF, text) is uploaded via
POST /api/v1/pipeline/upload - Extract — text is extracted from the document (PDFBox for PDFs, UTF-8 for text files)
- Chunk — text is split into chunks using the configured strategy
- Embed — each chunk is sent to the embedding API to generate a vector
- Upsert — vectors are upserted into the Qdrant collection with metadata payload
- Notify — a pipeline notification is published on completion