vector column.
pgvector uses your existing PostgreSQL infrastructure — no separate vector database server required. Standard SQL can be used to combine vector similarity search with traditional filters.
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
tableName | string | (required) | PostgreSQL table name. Auto-created with vector column if it doesn’t exist. |
schemaName | string | "public" | PostgreSQL schema name. Auto-created if it doesn’t exist. |
chunking | object | recursive, 500/50 | Chunking strategy configuration (see below). |
metadata | map | {} | Static key-value metadata stored as columns on every chunk. Use snake_case for column names. |
embeddingSecretName | string | (required) | Vault secret name for the embedding API configuration. |
postgresSecretName | string | (required) | Vault secret name for PostgreSQL connection. |
Supported File Types
| Format | Description |
|---|---|
PDF (.pdf) | Text extracted via Apache PDFBox |
Word (.doc) | Text extracted via Apache POI (legacy format) |
Word (.docx) | Text extracted via Apache POI (modern format) |
PowerPoint (.ppt) | Text extracted via Apache POI (legacy format) |
PowerPoint (.pptx) | Text extracted via Apache POI (modern format) |
Excel (.xls, .xlsx) | Cell values extracted via Apache POI |
HTML (.html, .htm) | Text extracted via JSoup (tags stripped) |
RTF (.rtf) | Text extracted via javax.swing RTF parser |
Email (.msg) | Subject, from, to, and body extracted via Apache POI |
Email (.eml) | Subject, from, and body extracted via Jakarta Mail |
EPUB (.epub) | XHTML content extracted and parsed via JSoup |
Plain text (.txt, .md, .csv, .json, .xml) | Content used directly |
Vault Secrets
PostgreSQL connection (for pgvector)
| Field | Description |
|---|---|
jdbcUrl | JDBC URL for the PostgreSQL database with pgvector extension installed. |
username | PostgreSQL username. |
password | PostgreSQL password. |
oss/postgres secret so the vector store can target a different database or server.
Embedding API
The embedding API is shared with other vector database destinations (e.g., Qdrant, Weaviate). See Qdrant documentation for full details.Chunking Strategies
Documents are split into chunks before embedding. Each chunk becomes a row in the PostgreSQL table with the document’s metadata columns pluschunk_index, filename, and source_pipeline.
| Strategy | Description |
|---|---|
none | No chunking — one embedding per document. Only for very short documents. |
fixed | Split by character count. Fast but may cut mid-sentence. |
sentence | Split on sentence boundaries (. ! ?). Preserves semantic units. |
paragraph | Split on double newlines. Ideal for structured documents with clear sections. |
recursive | Try \n\n, then \n, then ., then space — best general-purpose default. |
chunkSize(default 500): maximum characters per chunkchunkOverlap(default 50): characters of overlap between consecutive chunks
Metadata
Static metadata is stored as dedicated columns in the PostgreSQL table:id— deterministic UUID (idempotent upserts)text— the chunk textchunk_index— position of the chunk in the documentfilename— original uploaded filenamesource_pipeline— pipeline nameembedding— vector column for similarity search
How It Works
- Upload — an unstructured file is uploaded via
POST /api/v1/pipeline/upload - Extract — text is extracted from the document
- Chunk — text is split into chunks using the configured strategy
- Embed — each chunk is sent to the embedding API to generate a vector
- Upsert — chunks are upserted into PostgreSQL with
INSERT ... ON CONFLICT DO UPDATE - Notify — a pipeline notification is published on completion
Running PostgreSQL with pgvector
The standard PostgreSQL Docker image does not include pgvector. Use the pgvector image:Verifying
Advantages Over Dedicated Vector Databases
- No separate server — uses your existing PostgreSQL infrastructure
- Standard SQL — combine vector search with traditional WHERE clauses, JOINs, aggregations
- ACID transactions — full transactional guarantees on vector data
- Familiar tooling — use psql, pgAdmin, any PostgreSQL client
- No new dependencies — uses the existing PostgreSQL JDBC driver