pgvector (PostgreSQL) Vector Database Destination

Ingest unstructured documents into PostgreSQL with the pgvector extension for retrieval-augmented generation (RAG) and semantic search. The pipeline extracts text from the document, chunks it using a configurable strategy, generates vector embeddings, and upserts the chunks with metadata into a PostgreSQL table with a vector column. pgvector uses your existing PostgreSQL infrastructure — no separate vector database server required. Standard SQL can be used to combine vector similarity search with traditional filters.

Configuration

"source": {
    "fileAttributes": {
        "unstructuredAttributes": {
            "fileExtension": "pdf",
            "preserveFilename": true
        }
    }
},
"destination": {
    "pgvector": {
        "tableName": "financial_documents",
        "schemaName": "public",
        "chunking": {
            "strategy": "recursive",
            "chunkSize": 500,
            "chunkOverlap": 50
        },
        "metadata": {
            "company": "Apple Inc",
            "document_type": "10-Q",
            "filing_date": "2026-01-30"
        },
        "postgresSecretName": "oss/pgvector"
    }
}

Field	Type	Default	Description
`tableName`	string	(required)	PostgreSQL table name. Auto-created with vector column if it doesn’t exist.
`schemaName`	string	`"public"`	PostgreSQL schema name. Auto-created if it doesn’t exist.
`chunking`	object	recursive, 500/50	Chunking strategy configuration (see below).
`metadata`	map	`{}`	Static key-value metadata stored as columns on every chunk. Use snake_case for column names.
`embeddingSecretName`	string	server default	Optional override of the embedding Vault secret. Defaults to the server-level `ai.embedding.secretName` (`oss/embedding`), which is seeded automatically — set this only to point a single pipeline at a different embedding model.
`postgresSecretName`	string	(required)	Vault secret name for PostgreSQL connection.

Supported File Types

Format	Description
PDF (`.pdf`)	Text extracted via Apache PDFBox
Word (`.doc`)	Text extracted via Apache POI (legacy format)
Word (`.docx`)	Text extracted via Apache POI (modern format)
PowerPoint (`.ppt`)	Text extracted via Apache POI (legacy format)
PowerPoint (`.pptx`)	Text extracted via Apache POI (modern format)
Excel (`.xls`, `.xlsx`)	Cell values extracted via Apache POI
HTML (`.html`, `.htm`)	Text extracted via JSoup (tags stripped)
RTF (`.rtf`)	Text extracted via javax.swing RTF parser
Email (`.msg`)	Subject, from, to, and body extracted via Apache POI
Email (`.eml`)	Subject, from, and body extracted via Jakarta Mail
EPUB (`.epub`)	XHTML content extracted and parsed via JSoup
Plain text (`.txt`, `.md`, `.csv`, `.json`, `.xml`)	Content used directly

For structured data (CSV, JSON, XML), use the standard PostgreSQL database destination instead.

Vault Secrets

PostgreSQL connection (for pgvector)

vault kv put secret/oss/pgvector \
  jdbcUrl="jdbc:postgresql://postgres:5432/datris" \
  username="postgres" \
  password="postgres"

Field	Description
`jdbcUrl`	JDBC URL for the PostgreSQL database with pgvector extension installed.
`username`	PostgreSQL username.
`password`	PostgreSQL password.

This is separate from the pipeline’s standard oss/postgres secret so the vector store can target a different database or server.

Embedding API

The embedding secret is server-level (ai.embedding.secretName, default oss/embedding) and is seeded automatically by docker/vault-init.sh. See AI Configuration and the Qdrant docs for the full picture.

vault kv put secret/oss/embedding \
  provider="openai" \
  endpoint="https://api.openai.com/v1/embeddings" \
  model="text-embedding-3-small" \
  apiKey="sk-..."

For Anthropic-only deployments, the bundled TEI sidecar serves bge-m3 and vault-init.sh seeds the embedding secret to point at it — no OpenAI key required.

Chunking Strategies

Documents are split into chunks before embedding. Each chunk becomes a row in the PostgreSQL table with the document’s metadata columns plus chunk_index, filename, and source_pipeline.

"chunking": {
    "strategy": "recursive",
    "chunkSize": 500,
    "chunkOverlap": 50
}

Strategy	Description
`none`	No chunking — one embedding per document. Only for very short documents.
`fixed`	Split by character count. Fast but may cut mid-sentence.
`sentence`	Split on sentence boundaries (`.` `!` `?`). Preserves semantic units.
`paragraph`	Split on double newlines. Ideal for structured documents with clear sections.
`recursive`	Try `\n\n`, then `\n`, then `.`, then space — best general-purpose default.

chunkSize (default 500): maximum characters per chunk
chunkOverlap (default 50): characters of overlap between consecutive chunks

Metadata

Static metadata is stored as dedicated columns in the PostgreSQL table:

"metadata": {
    "company": "Apple Inc",
    "document_type": "10-Q",
    "filing_date": "2026-01-30"
}

Use snake_case for metadata keys since they become PostgreSQL column names. This enables powerful combined queries:

SELECT text FROM financial_documents
WHERE company = 'Apple Inc' AND document_type = '10-Q'
ORDER BY embedding <=> '[query_vector]'
LIMIT 5;

Every row automatically includes:

id — deterministic UUID (idempotent upserts)
text — the chunk text
chunk_index — position of the chunk in the document
filename — original uploaded filename
source_pipeline — pipeline name
embedding — vector column for similarity search

How It Works

Upload — an unstructured file is uploaded via POST /api/v1/pipeline/upload
Extract — text is extracted from the document
Chunk — text is split into chunks using the configured strategy
Embed — each chunk is sent to the embedding API to generate a vector
Upsert — chunks are upserted into PostgreSQL with INSERT ... ON CONFLICT DO UPDATE
Notify — a pipeline notification is published on completion

The pgvector extension and table are auto-created on first upsert. Vector dimension is detected from the embedding model.

Running PostgreSQL with pgvector

The standard PostgreSQL Docker image does not include pgvector. Use the pgvector image:

docker run -p 5433:5432 -e POSTGRES_PASSWORD=postgres pgvector/pgvector:pg16

Port 5433 avoids conflict with the pipeline’s PostgreSQL instance on 5432. To add pgvector to an existing PostgreSQL instance:

CREATE EXTENSION IF NOT EXISTS vector;

Verifying

# Check pgvector extension is installed
psql -h localhost -p 5433 -U postgres -d datris -c "SELECT extname FROM pg_extension WHERE extname = 'vector';"

# List chunks
psql -h localhost -p 5433 -U postgres -d datris -c "SELECT id, chunk_index, filename, company FROM public.financial_documents LIMIT 5;"

# Similarity search (replace [...] with your query vector)
psql -h localhost -p 5433 -U postgres -d datris -c "
  SELECT text, 1 - (embedding <=> '[...]') AS similarity
  FROM public.financial_documents
  ORDER BY embedding <=> '[...]'
  LIMIT 5;
"

# Combined filter + similarity search
psql -h localhost -p 5433 -U postgres -d datris -c "
  SELECT text, chunk_index
  FROM public.financial_documents
  WHERE company = 'Apple Inc'
  ORDER BY embedding <=> '[...]'
  LIMIT 5;
"

Advantages Over Dedicated Vector Databases

No separate server — uses your existing PostgreSQL infrastructure
Standard SQL — combine vector search with traditional WHERE clauses, JOINs, aggregations
ACID transactions — full transactional guarantees on vector data
Familiar tooling — use psql, pgAdmin, any PostgreSQL client
No new dependencies — uses the existing PostgreSQL JDBC driver

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

pgvector (PostgreSQL) Vector Database Destination

Configuration

Supported File Types

Vault Secrets

PostgreSQL connection (for pgvector)

Embedding API

Chunking Strategies

Metadata

How It Works

Running PostgreSQL with pgvector

Verifying

Advantages Over Dedicated Vector Databases

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

Documentation Index

​Configuration

​Supported File Types

​Vault Secrets

​PostgreSQL connection (for pgvector)

​Embedding API

​Chunking Strategies

​Metadata

​How It Works

​Running PostgreSQL with pgvector

​Verifying

​Advantages Over Dedicated Vector Databases

Configuration

Supported File Types

Vault Secrets

PostgreSQL connection (for pgvector)

Embedding API

Chunking Strategies

Metadata

How It Works

Running PostgreSQL with pgvector

Verifying

Advantages Over Dedicated Vector Databases