Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.datris.ai/llms.txt

Use this file to discover all available pages before exploring further.

Ingest unstructured documents (PDFs, text files, HTML) into Qdrant, a high-performance vector database for RAG and semantic search applications. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a Qdrant collection.

Configuration

"source": {
    "fileAttributes": {
        "unstructuredAttributes": {
            "fileExtension": "pdf",
            "preserveFilename": true
        }
    }
},
"destination": {
    "qdrant": {
        "collectionName": "financial_documents",
        "chunking": {
            "strategy": "recursive",
            "chunkSize": 500,
            "chunkOverlap": 50
        },
        "metadata": {
            "company": "Apple Inc",
            "documentType": "10-Q",
            "filingDate": "2026-01-30"
        },
        "qdrantSecretName": "oss/qdrant"
    }
}
FieldTypeDefaultDescription
collectionNamestring(required)Qdrant collection name. Auto-created if it doesn’t exist.
chunkingobjectrecursive, 500/50Chunking strategy configuration (see below).
metadatamap{}Static key-value metadata stored on every chunk as Qdrant payload. Used for filtered search.
embeddingSecretNamestringserver defaultOptional override of the embedding Vault secret. Defaults to the server-level ai.embedding.secretName (oss/embedding), which is seeded automatically — set this only to point a single pipeline at a different embedding model.
qdrantSecretNamestring(required)Vault secret name for Qdrant connection.

Supported File Types

FormatDescription
PDF (.pdf)Text extracted via Apache PDFBox
Word (.doc)Text extracted via Apache POI (legacy format)
Word (.docx)Text extracted via Apache POI (modern format)
PowerPoint (.ppt)Text extracted via Apache POI (legacy format)
PowerPoint (.pptx)Text extracted via Apache POI (modern format)
Excel (.xls, .xlsx)Cell values extracted via Apache POI
HTML (.html, .htm)Text extracted via JSoup (tags stripped)
RTF (.rtf)Text extracted via javax.swing RTF parser
Email (.msg)Subject, from, to, and body extracted via Apache POI
Email (.eml)Subject, from, and body extracted via Jakarta Mail
EPUB (.epub)XHTML content extracted and parsed via JSoup
Plain text (.txt, .md, .csv, .json, .xml)Content used directly
For structured data (CSV, JSON, XML), use database destinations (PostgreSQL, MongoDB) instead.

Vault Secrets

Qdrant connection

vault kv put secret/oss/qdrant \
  host="host.docker.internal" \
  port="6334" \
  apiKey=""
FieldDescription
hostQdrant server hostname. Use host.docker.internal for local Qdrant.
portgRPC port (default 6334).
apiKeyAPI key for Qdrant Cloud. Empty string for local instances.

Embedding API

The embedding secret is server-level (ai.embedding.secretName, default oss/embedding) and is seeded automatically by docker/vault-init.sh. Each Vault secret is self-describing — the resolver reads provider, endpoint, model, apiKey, and (optionally) version from inside. See AI Configuration for the full picture.
vault kv put secret/oss/embedding \
  provider="openai" \
  endpoint="https://api.openai.com/v1/embeddings" \
  model="text-embedding-3-small" \
  apiKey="sk-..."
FieldDescription
provideropenai or ollama. (Anthropic has no embeddings API.)
endpointEmbedding API URL. Must speak the OpenAI embeddings contract.
modelEmbedding model name.
apiKeyAPI key. Omit for local Ollama.
Common embedding configurations:
ProviderEndpointModelDimensions
OpenAIhttps://api.openai.com/v1/embeddingstext-embedding-3-small1536
OpenAIhttps://api.openai.com/v1/embeddingstext-embedding-3-large3072
TEI (bundled sidecar)http://tei:80/v1/embeddingsBAAI/bge-m31024
For Anthropic-only deployments, the bundled TEI sidecar serves bge-m3 and vault-init.sh seeds the embedding secret to point at it — no OpenAI key required.

Chunking Strategies

Documents are split into chunks before embedding. Each chunk becomes a separate Qdrant point with the document’s metadata plus chunk_index, filename, and source_pipeline fields.
"chunking": {
    "strategy": "recursive",
    "chunkSize": 500,
    "chunkOverlap": 50
}
StrategyDescription
noneNo chunking — one embedding per document. Only for very short documents.
fixedSplit by character count. Fast but may cut mid-sentence.
sentenceSplit on sentence boundaries (. ! ?). Preserves semantic units.
paragraphSplit on double newlines. Ideal for structured documents with clear sections.
recursiveTry \n\n, then \n, then ., then space — best general-purpose default.
  • chunkSize (default 500): maximum characters per chunk
  • chunkOverlap (default 50): characters of overlap between consecutive chunks

Metadata

Static metadata is attached to every chunk in the Qdrant payload, enabling filtered semantic search:
"metadata": {
    "company": "Apple Inc",
    "documentType": "10-Q",
    "filingDate": "2026-01-30"
}
This allows queries like “find chunks about revenue from Apple 10-Q filings” by combining vector similarity with metadata filters. In addition to static metadata, every chunk automatically includes:
  • text — the chunk text
  • chunk_index — position of the chunk in the document
  • filename — original uploaded filename
  • source_pipeline — pipeline name

How It Works

  1. Upload — an unstructured file (PDF, text) is uploaded via POST /api/v1/pipeline/upload
  2. Extract — text is extracted from the document (PDFBox for PDFs, UTF-8 for text files)
  3. Chunk — text is split into chunks using the configured strategy
  4. Embed — each chunk is sent to the embedding API to generate a vector
  5. Upsert — vectors are upserted into the Qdrant collection with metadata payload
  6. Notify — a pipeline notification is published on completion
The collection is auto-created on first upsert using cosine distance and the embedding dimension detected from the model.

Running Qdrant

Qdrant runs outside Docker (like Ollama). Run locally via Docker:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
Or install natively from qdrant.tech/documentation/quick-start.

Verifying

# List collections
curl http://localhost:6333/collections

# Get collection info
curl http://localhost:6333/collections/financial_documents

# Scroll points with payload
curl -X POST http://localhost:6333/collections/financial_documents/points/scroll \
  -H "Content-Type: application/json" \
  -d '{"limit": 5, "with_payload": true, "with_vector": false}'

# Search with metadata filter
curl -X POST http://localhost:6333/collections/financial_documents/points/scroll \
  -H "Content-Type: application/json" \
  -d '{"limit": 5, "with_payload": true, "filter": {"must": [{"key": "company", "match": {"value": "Apple Inc"}}]}}'