Documentation Index
Fetch the complete documentation index at: https://docs.datris.ai/llms.txt
Use this file to discover all available pages before exploring further.
Ingest unstructured documents (PDFs, text files, HTML) into Qdrant, a high-performance vector database for RAG and semantic search applications. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a Qdrant collection.
Configuration
"source": {
"fileAttributes": {
"unstructuredAttributes": {
"fileExtension": "pdf",
"preserveFilename": true
}
}
},
"destination": {
"qdrant": {
"collectionName": "financial_documents",
"chunking": {
"strategy": "recursive",
"chunkSize": 500,
"chunkOverlap": 50
},
"metadata": {
"company": "Apple Inc",
"documentType": "10-Q",
"filingDate": "2026-01-30"
},
"qdrantSecretName": "oss/qdrant"
}
}
| Field | Type | Default | Description |
|---|
collectionName | string | (required) | Qdrant collection name. Auto-created if it doesn’t exist. |
chunking | object | recursive, 500/50 | Chunking strategy configuration (see below). |
metadata | map | {} | Static key-value metadata stored on every chunk as Qdrant payload. Used for filtered search. |
embeddingSecretName | string | server default | Optional override of the embedding Vault secret. Defaults to the server-level ai.embedding.secretName (oss/embedding), which is seeded automatically — set this only to point a single pipeline at a different embedding model. |
qdrantSecretName | string | (required) | Vault secret name for Qdrant connection. |
Supported File Types
| Format | Description |
|---|
PDF (.pdf) | Text extracted via Apache PDFBox |
Word (.doc) | Text extracted via Apache POI (legacy format) |
Word (.docx) | Text extracted via Apache POI (modern format) |
PowerPoint (.ppt) | Text extracted via Apache POI (legacy format) |
PowerPoint (.pptx) | Text extracted via Apache POI (modern format) |
Excel (.xls, .xlsx) | Cell values extracted via Apache POI |
HTML (.html, .htm) | Text extracted via JSoup (tags stripped) |
RTF (.rtf) | Text extracted via javax.swing RTF parser |
Email (.msg) | Subject, from, to, and body extracted via Apache POI |
Email (.eml) | Subject, from, and body extracted via Jakarta Mail |
EPUB (.epub) | XHTML content extracted and parsed via JSoup |
Plain text (.txt, .md, .csv, .json, .xml) | Content used directly |
For structured data (CSV, JSON, XML), use database destinations (PostgreSQL, MongoDB) instead.
Vault Secrets
Qdrant connection
vault kv put secret/oss/qdrant \
host="host.docker.internal" \
port="6334" \
apiKey=""
| Field | Description |
|---|
host | Qdrant server hostname. Use host.docker.internal for local Qdrant. |
port | gRPC port (default 6334). |
apiKey | API key for Qdrant Cloud. Empty string for local instances. |
Embedding API
The embedding secret is server-level (ai.embedding.secretName, default oss/embedding) and is seeded automatically by docker/vault-init.sh. Each Vault secret is self-describing — the resolver reads provider, endpoint, model, apiKey, and (optionally) version from inside. See AI Configuration for the full picture.
vault kv put secret/oss/embedding \
provider="openai" \
endpoint="https://api.openai.com/v1/embeddings" \
model="text-embedding-3-small" \
apiKey="sk-..."
| Field | Description |
|---|
provider | openai or ollama. (Anthropic has no embeddings API.) |
endpoint | Embedding API URL. Must speak the OpenAI embeddings contract. |
model | Embedding model name. |
apiKey | API key. Omit for local Ollama. |
Common embedding configurations:
| Provider | Endpoint | Model | Dimensions |
|---|
| OpenAI | https://api.openai.com/v1/embeddings | text-embedding-3-small | 1536 |
| OpenAI | https://api.openai.com/v1/embeddings | text-embedding-3-large | 3072 |
| TEI (bundled sidecar) | http://tei:80/v1/embeddings | BAAI/bge-m3 | 1024 |
For Anthropic-only deployments, the bundled TEI sidecar serves bge-m3 and vault-init.sh seeds the embedding secret to point at it — no OpenAI key required.
Chunking Strategies
Documents are split into chunks before embedding. Each chunk becomes a separate Qdrant point with the document’s metadata plus chunk_index, filename, and source_pipeline fields.
"chunking": {
"strategy": "recursive",
"chunkSize": 500,
"chunkOverlap": 50
}
| Strategy | Description |
|---|
none | No chunking — one embedding per document. Only for very short documents. |
fixed | Split by character count. Fast but may cut mid-sentence. |
sentence | Split on sentence boundaries (. ! ?). Preserves semantic units. |
paragraph | Split on double newlines. Ideal for structured documents with clear sections. |
recursive | Try \n\n, then \n, then ., then space — best general-purpose default. |
chunkSize (default 500): maximum characters per chunk
chunkOverlap (default 50): characters of overlap between consecutive chunks
Static metadata is attached to every chunk in the Qdrant payload, enabling filtered semantic search:
"metadata": {
"company": "Apple Inc",
"documentType": "10-Q",
"filingDate": "2026-01-30"
}
This allows queries like “find chunks about revenue from Apple 10-Q filings” by combining vector similarity with metadata filters.
In addition to static metadata, every chunk automatically includes:
text — the chunk text
chunk_index — position of the chunk in the document
filename — original uploaded filename
source_pipeline — pipeline name
How It Works
- Upload — an unstructured file (PDF, text) is uploaded via
POST /api/v1/pipeline/upload
- Extract — text is extracted from the document (PDFBox for PDFs, UTF-8 for text files)
- Chunk — text is split into chunks using the configured strategy
- Embed — each chunk is sent to the embedding API to generate a vector
- Upsert — vectors are upserted into the Qdrant collection with metadata payload
- Notify — a pipeline notification is published on completion
The collection is auto-created on first upsert using cosine distance and the embedding dimension detected from the model.
Running Qdrant
Qdrant runs outside Docker (like Ollama). Run locally via Docker:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
Or install natively from qdrant.tech/documentation/quick-start.
Verifying
# List collections
curl http://localhost:6333/collections
# Get collection info
curl http://localhost:6333/collections/financial_documents
# Scroll points with payload
curl -X POST http://localhost:6333/collections/financial_documents/points/scroll \
-H "Content-Type: application/json" \
-d '{"limit": 5, "with_payload": true, "with_vector": false}'
# Search with metadata filter
curl -X POST http://localhost:6333/collections/financial_documents/points/scroll \
-H "Content-Type: application/json" \
-d '{"limit": 5, "with_payload": true, "filter": {"must": [{"key": "company", "match": {"value": "Apple Inc"}}]}}'