Skip to main content
Ingest unstructured documents into Chroma, a lightweight, developer-friendly open-source vector database. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a Chroma collection via its REST API. Chroma is the simplest vector database to set up — a single Docker container with no external dependencies.

Configuration

"source": {
    "fileAttributes": {
        "unstructuredAttributes": {
            "fileExtension": "pdf",
            "preserveFilename": true
        }
    }
},
"destination": {
    "chroma": {
        "collectionName": "financial_documents",
        "chunking": {
            "strategy": "recursive",
            "chunkSize": 500,
            "chunkOverlap": 50
        },
        "metadata": {
            "company": "Apple Inc",
            "documentType": "10-Q",
            "filingDate": "2026-01-30"
        },
        "embeddingSecretName": "oss/embedding",
        "chromaSecretName": "oss/chroma"
    }
}
FieldTypeDefaultDescription
collectionNamestring(required)Chroma collection name. Auto-created with cosine distance if it doesn’t exist.
chunkingobjectrecursive, 500/50Chunking strategy configuration (see below).
metadatamap{}Static key-value metadata stored on every chunk. Used for filtered search.
embeddingSecretNamestring(required)Vault secret name for the embedding API configuration.
chromaSecretNamestring(required)Vault secret name for Chroma connection.

Supported File Types

FormatDescription
PDF (.pdf)Text extracted via Apache PDFBox
Word (.doc)Text extracted via Apache POI (legacy format)
Word (.docx)Text extracted via Apache POI (modern format)
PowerPoint (.ppt)Text extracted via Apache POI (legacy format)
PowerPoint (.pptx)Text extracted via Apache POI (modern format)
Excel (.xls, .xlsx)Cell values extracted via Apache POI
HTML (.html, .htm)Text extracted via JSoup (tags stripped)
RTF (.rtf)Text extracted via javax.swing RTF parser
Email (.msg)Subject, from, to, and body extracted via Apache POI
Email (.eml)Subject, from, and body extracted via Jakarta Mail
EPUB (.epub)XHTML content extracted and parsed via JSoup
Plain text (.txt, .md, .csv, .json, .xml)Content used directly
For structured data (CSV, JSON, XML), use database destinations (PostgreSQL, MongoDB) instead.

Vault Secrets

Chroma connection

vault kv put secret/oss/chroma \
  host="host.docker.internal" \
  port="8000"
FieldDescription
hostChroma server hostname. Use host.docker.internal for local Chroma.
portREST API port (default 8000).
Chroma does not require authentication by default. For Chroma Cloud or authenticated instances, add token-based auth to the loader as needed.

Embedding API

The embedding API is shared with other vector database destinations. See Qdrant documentation for full details.
vault kv put secret/oss/embedding \
  endpoint="https://api.openai.com/v1/embeddings" \
  model="text-embedding-3-small" \
  apiKey="sk-..."

Chunking Strategies

Documents are split into chunks before embedding. Each chunk becomes a separate entry in the Chroma collection with the document’s metadata plus chunk_index, filename, and source_pipeline fields.
"chunking": {
    "strategy": "recursive",
    "chunkSize": 500,
    "chunkOverlap": 50
}
StrategyDescription
noneNo chunking — one embedding per document. Only for very short documents.
fixedSplit by character count. Fast but may cut mid-sentence.
sentenceSplit on sentence boundaries (. ! ?). Preserves semantic units.
paragraphSplit on double newlines. Ideal for structured documents with clear sections.
recursiveTry \n\n, then \n, then ., then space — best general-purpose default.
  • chunkSize (default 500): maximum characters per chunk
  • chunkOverlap (default 50): characters of overlap between consecutive chunks

Metadata

Static metadata is stored on every chunk in the Chroma collection:
"metadata": {
    "company": "Apple Inc",
    "documentType": "10-Q",
    "filingDate": "2026-01-30"
}
This allows filtered queries combining similarity with metadata predicates using Chroma’s where clause. Every entry automatically includes:
  • text — the chunk text (stored as Chroma document)
  • chunk_index — position of the chunk in the document
  • filename — original uploaded filename
  • source_pipeline — pipeline name

How It Works

  1. Upload — an unstructured file is uploaded via POST /api/v1/pipeline/upload
  2. Extract — text is extracted from the document
  3. Chunk — text is split into chunks using the configured strategy
  4. Embed — each chunk is sent to the embedding API to generate a vector
  5. Upsert — vectors are upserted into the Chroma collection via REST API with metadata
  6. Notify — a pipeline notification is published on completion
The collection is auto-created on first upsert with cosine distance. No Java client library is needed — the loader communicates with Chroma’s REST API directly via HTTP.

Running Chroma

Chroma runs outside Docker (like the other vector databases). Start locally with a single command:
docker run -d --name chroma -p 8000:8000 chromadb/chroma:latest
That’s it — no external dependencies like etcd or MinIO. Chroma is the simplest vector database to run.

Verifying

# Check Chroma is running
curl http://localhost:8000/api/v2/heartbeat

# List collections
curl http://localhost:8000/api/v2/tenants/default_tenant/databases/default_database/collections

# Get collection details
curl http://localhost:8000/api/v2/tenants/default_tenant/databases/default_database/collections/financial_documents

# Count entries in collection (replace COLLECTION_ID with the actual UUID)
curl http://localhost:8000/api/v2/tenants/default_tenant/databases/default_database/collections/COLLECTION_ID/count