The pipeline server is configured via application.yaml (or application.properties). This page documents all available properties.
Changing Settings After Install
Datris settings fall into two buckets, and which bucket a setting is in determines how you change it after the first install. Editing the wrong place — or restarting the wrong way — is the most common source of “I changed it and nothing happened.”
Bucket 1 — Configuration tab (Vault-backed)
AI provider / model / API keys, the embedding slot, and all connection secrets (Postgres, MongoDB, MinIO, vector stores, API keys) are stored in Vault and edited in the Configuration tab.
.env seeds these only on the very first boot. After that, Vault persists on disk and the Configuration tab is the source of truth — editing the corresponding .env value later has no effect. Saves in the tab go straight to Vault and take effect right away (AI config hot-reloads in-process; other secrets are picked up on next use), with no restart needed.
To deliberately throw away the persisted config and re-seed from .env, wipe the Vault volume — see How Configuration Persists.
Bucket 2 — Container environment variables (.env, read every boot)
The deployment-level settings in the Container Environment Variables table — JAVA_OPTS, TAP_MAX_OUTPUT_MB, USE_USER_AUTH, USE_API_KEYS, CAPABILITY_ENFORCEMENT, the tap-runner vars, etc. — are not Vault-seeded. They’re plain container env vars read fresh on every boot, so changing them in .env after install works normally:
# 1. Edit the value in .env, then:
docker compose up -d datris
up -d recreates the datris container with the new environment. (Use --force-recreate datris if Compose doesn’t detect the change.)
Do not use docker compose restart to apply an .env change. restart reuses the existing container with its old environment, so your edit is silently ignored. Always use up -d (which recreates the container) or a down / up.
Enabling USE_USER_AUTH is a clean toggle. A default admin user is pre-seeded on every startup, so after you recreate the container your first login (username admin, blank password) simply prompts you to set a password — there’s no separate provisioning step and no lockout risk. See User Authentication for the full walkthrough.
Full Reference
Spring Boot
| Property | Default | Description |
|---|
spring.servlet.multipart.max-file-size | 1GB | Maximum upload file size |
spring.servlet.multipart.max-request-size | 1GB | Maximum request size |
spring.server.tomcat.connection-timeout | 600000 | Tomcat connection timeout (ms) |
Logging
| Property | Default | Description |
|---|
logging.level.root | INFO | Root log level |
logging.level.org.springframework.web | INFO | Spring web log level |
logging.level.ai.datris | INFO | Pipeline log level |
Scheduling
| Property | Default | Description |
|---|
schedule.checkFileNotifierQueue | 5000 | Polling interval for file notification queue (ms) |
schedule.findJobsToStart | 5000 | Interval to check for queued jobs (ms) |
schedule.checkDatabaseSourceQueries | 30000 | Interval to check for database pulls (ms) |
schedule.checkTapSchedules | 30000 | Interval to check for taps with CRON schedules due to run (ms) |
Pipeline
| Property | Default | Description |
|---|
environment | oss | Environment name. Used as prefix for bucket names (oss-raw, oss-data, etc.) and table names |
useApiKeys | false | Enable API key authentication. See Authentication and the API Keys guide |
multiTenant | false | Enable per-request tenant resolution. When true, the postgres database is overridden per-request with the tenant name |
sendPipelineNotifications | true | Enable pipeline event notifications |
ttlFileNotifierQueueMessages | 60 | Days to retain processed message IDs for deduplication |
tapScriptTimeoutSeconds | 300 | Maximum tap script execution time in seconds |
tapMaxOutputMB | 100 | Hard cap on a tap script’s stdout output, in megabytes. Runs that exceed the cap fail fast with an actionable error (the agent is told to chunk the source range smaller via run_tap params) before the JSON is parsed — preventing the whole batch from buffering in JVM heap and OOM-killing the server. Override per deployment with the TAP_MAX_OUTPUT_MB env var. |
Authentication
Two independent flags control access. Set them together — see the supported matrix — and use User Authentication and API Keys for the full setup walkthroughs.
| Env var | Default | Description |
|---|
USE_USER_AUTH | false | Require login for the UI; gate Configuration and Users tabs to admin. See User Authentication |
USE_API_KEYS | false | Require x-api-key on every REST and MCP request. Maps to useApiKeys above |
MULTI_TENANT | false | Per-tenant database isolation; routes requests by API-key value. Hosted/managed deployments only |
CAPABILITY_ENFORCEMENT | enforce | Default: scoped keys missing the required capability get HTTP 403. Override to log-only to emit would-deny log lines without rejecting — useful only when iterating on a new scope policy and you want a trial run before flipping enforcement on. Legacy *:* keys are unaffected either way |
With USE_USER_AUTH=true + USE_API_KEYS=true (recommended), the UI authenticates via session cookie — no x-api-key value to seed or paste. Issue programmatic-client keys from Configuration → API-Keys.
For USE_USER_AUTH=false + USE_API_KEYS=true (legacy paste-and-go), the seeded fallback value default-ui-key is what the user pastes into the Connect prompt on first load. Rotate it any time from Configuration → API-Keys.
CORS
Cross-Origin Resource Sharing controls which browser origins can call the Datris API directly. The default * allows any origin and is appropriate for local development. In production, lock this down to your real frontend origin(s).
| Property | Default | Description |
|---|
cors.allowedOrigins | * | Comma-separated list of allowed origins. Use * for any origin (development only), or specific URLs like https://app.example.com,https://admin.example.com. Applied globally to all /api/** endpoints. |
In the deploy config, this reads from the CORS_ALLOWED_ORIGINS environment variable so you can change it without rebuilding the image:
cors:
allowedOrigins: ${CORS_ALLOWED_ORIGINS:*}
Date / Timezone
All display timestamps across the platform (pipeline status, tap run history, etc.) are formatted using these settings.
| Property | Default | Description |
|---|
dateFormat | yyyy-MM-dd HH:mm:ss z | Java SimpleDateFormat pattern. Use z to print the timezone abbreviation (e.g., UTC, EDT, EST) |
dateTimezone | America/New_York | IANA timezone ID (e.g., UTC, America/New_York, Europe/London). When the format includes z, daylight saving is handled automatically |
Example — Eastern time with auto-DST:
dateFormat: "yyyy-MM-dd HH:mm:ss z"
dateTimezone: "America/New_York"
# Displays: 2026-04-05 14:30:00 EDT (summer) or 2026-11-05 14:30:00 EST (winter)
MinIO (Object Store)
| Property | Description |
|---|
minio.server | MinIO endpoint URL (e.g., http://localhost:9000) |
MinIO credentials are stored in Vault under the secret specified by secrets.minIOSecretName:
{
"accessKey": "minioadmin",
"secretKey": "minioadmin"
}
AWS S3 (per-pipeline credentials)
S3 destinations don’t use a global secret. Each pipeline that writes to S3 references a Platform-tab secret by name via the objectStore config’s credentialsSecret field. Create the secret in the UI under Configuration → Secrets → Platform with the fields below:
{
"accessKey": "AKIA…",
"secretKey": "…",
"region": "us-east-1",
"sessionToken": "…"
}
Required: accessKey, secretKey, region. Optional: sessionToken (for temporary STS credentials). Field-name lookups are case-insensitive and accept the AWS_ACCESS_KEY / AWS_SECRET_KEY / AWS_REGION style as well.
Region lives in the credentials secret rather than on the pipeline config so the credential and its scope travel together — one place to rotate, no silent mismatches. See S3 Destination for the full pipeline-config shape.
Secrets (HashiCorp Vault)
| Property | Description |
|---|
secrets.apiKeysSecretName | Vault path for API keys |
secrets.postgresSecretName | Vault path for PostgreSQL credentials |
secrets.minIOSecretName | Vault path for MinIO credentials |
secrets.activeMQSecretName | Vault path for ActiveMQ credentials |
secrets.mongoDbSecretName | Vault path for MongoDB credentials |
secrets.kafkaProducerSecretName | Vault path for Kafka producer credentials |
secrets.qdrantSecretName | Vault path for Qdrant connection |
secrets.weaviateSecretName | Vault path for Weaviate connection |
secrets.milvusSecretName | Vault path for Milvus connection |
secrets.chromaSecretName | Vault path for Chroma connection |
secrets.pgvectorSecretName | Vault path for pgvector PostgreSQL connection |
Vault connection is configured via environment variables:
VAULT_ADDR - Vault server URL (e.g., http://vault:8200)
VAULT_TOKEN - Authentication token
ActiveMQ (Queue & Notifications)
| Property | Description |
|---|
activemq.server | ActiveMQ broker URL (e.g., tcp://localhost:61616) |
ActiveMQ credentials are stored in Vault under secrets.activeMQSecretName:
{
"username": "admin",
"password": "admin"
}
MongoDB (NoSQL Store)
| Property | Description |
|---|
mongodb.connectionString | MongoDB connection URI (e.g., mongodb://localhost:27017) |
mongodb.database | User-facing database name (default: datris). Pipelines write here, the UI Data tab reads here (semantic-search panel), and tap scripts get this as DATRIS_MONGODB_DATABASE. In multi-tenant mode the tenant environment name is used instead. |
mongodb.internalDatabase | Platform-internal database name (default: oss). Holds pipeline/tap configs, run status, job queues — never surfaced in the UI. Keep this distinct from mongodb.database so user data and platform state don’t mix. |
PostgreSQL
| Property | Default | Description |
|---|
postgres.database | datris | Default database name used by /api/v1/query/postgres and /api/v1/metadata/postgres/* when no database parameter is supplied. Also injected into tap scripts as the DATRIS_POSTGRES_DATABASE environment variable. In multi-tenant mode this value is automatically overridden per-request with the tenant name. |
PostgreSQL connection details (username, password, jdbcUrl) are stored in Vault under secrets.postgresSecretName — see the Vault Secret Formats section below.
Kafka Consumer (Optional)
| Property | Default | Description |
|---|
kafkaConsumer.enabled | false | Enable Kafka topic consumption |
kafkaConsumer.bootstrapServers | | Kafka broker address |
kafkaConsumer.groupId | | Consumer group ID |
kafkaConsumer.topicPollingInterval | 500 | Topic polling interval (ms) |
kafkaConsumer.topicPrefix | | Prefix for topic names |
AI (Required)
AI configuration is split into three independent slots, each pointing at its own self-describing Vault secret. The resolver reads whatever it finds in the secret — provider, endpoint, model, apiKey, and (optionally) version — so the YAML side never needs a provider field.
| Property | Default | Description |
|---|
ai.enabled | true | Enable AI features (required for the platform to start) |
ai.aiPrimary.secretName | oss/ai-primary | Vault secret for the main AI model used for general reasoning (NL→SQL, search answers, etc.). |
ai.codegen.secretName | oss/codegen | Vault secret for the code-generation model (tap scripts, AI DQ, AI transformations, schema generation). Seeded with the strongest available model — Anthropic gets claude-opus-4-8, OpenAI gets gpt-5.5. |
ai.embedding.secretName | oss/embedding | Vault secret for the embedding model used by vector destinations (Chroma, Qdrant, Milvus, Weaviate, pgvector) and search. For Anthropic-only deployments this is seeded to point at the bundled TEI sidecar serving BAAI/bge-m3 (1024-dim), so vector destinations work out of the box without an OpenAI key. |
Each Vault secret is self-describing and looks like:
vault kv put secret/oss/ai-primary \
provider="anthropic" \
endpoint="https://api.anthropic.com/v1/messages" \
model="claude-sonnet-4-6" \
apiKey="sk-ant-..." \
version="2023-06-01"
docker/vault-init.sh seeds all three secrets automatically based on which key is present in .env (ANTHROPIC_API_KEY or OPENAI_API_KEY). For multi-tenant deployments, per-tenant override secrets live at {env}/ai-primary, {env}/codegen, {env}/embedding.
All AI calls (callAI, callAIWithSystem, callAIWithMessages) share a unified retry helper that automatically retries on transient 429 (rate limited), 503 (service unavailable), and 529 (overloaded) responses with linear backoff (5s, 10s, 15s, 20s, 25s) for up to 5 attempts. This applies uniformly across all configured providers.
See AI Configuration for full setup details.
PostgreSQL
{
"username": "postgres",
"password": "password",
"jdbcUrl": "jdbc:postgresql://localhost:5432"
}
MySQL
{
"username": "root",
"password": "password",
"jdbcUrl": "jdbc:mysql://localhost:3306"
}
Kafka Producer
{
"bootstrapServers": "kafka:9092",
"username": null,
"password": null
}
MongoDB
{
"connectionString": "mongodb://localhost:27017"
}
Embedding
{
"endpoint": "https://api.openai.com/v1/embeddings",
"model": "text-embedding-3-small",
"apiKey": "sk-...",
"batchSize": "32",
"maxTokens": "8192",
"tokenizer": "openai",
"tokensPerCharRatio": "2.0",
"oversize": "split",
"retryIndividualOnFailure": "false"
}
endpoint, model, apiKey are required; batchSize defaults to 32. The remaining fields tune the server-side token-count guard that prevents oversized chunks from failing an embedding batch:
| Field | Default | Description |
|---|
maxTokens | model-default | Per-model input cap. Built-in defaults cover common OpenAI / Cohere / Voyage / BAAI / Nomic / E5 / Mistral models; unknown models get 6000. A 90% safety margin is applied automatically. |
tokenizer | auto | Token-count strategy. Auto picks openai (exact, via jtokkit) when the model name matches an OpenAI family, otherwise heuristic. Set explicitly to override. |
tokensPerCharRatio | 2.0 | Chars-per-token for the heuristic counter. Lower is more conservative (more splits); raise to ~3.5 for predictable Latin prose. Ignored when tokenizer=openai. |
oversize | split | What to do with chunks over the cap: split (lossless — fan one input into N sub-chunks, token-boundary with openai, char-boundary with heuristic), truncate (lossy — drop the tail), or fail (raise an error naming the chunk index). |
retryIndividualOnFailure | false | When the embedding API rejects a batch with a 4xx, retry each chunk one-at-a-time so a single poison chunk doesn’t lose the whole batch. Costs N HTTP calls in the failure path. |
The guard always runs. When maxChunkTokens is set on the destination’s chunking config (see Pipeline configuration), the chunker stops merging segments before they cross the token cap, leaving the guard as a true safety net rather than the primary defender.
MinIO Buckets
The following buckets are created automatically by the minio-init container:
| Bucket | Purpose |
|---|
{environment}-raw | File upload staging |
{environment}-raw-plus | Processed file staging |
{environment}-temp | Temporary processing files |
{environment}-data | Object store destination output |
{environment}-config | Configuration files (validation schemas) |
MongoDB Collections
| Collection | Purpose |
|---|
{environment}-pipeline | Pipeline configurations |
{environment}-pipeline-status | Job processing status |
{environment}-archived-metadata | File ingestion metadata |
{environment}-file-notifier-message | Processed message deduplication |
{environment}-data-pull | Database pull scheduling state |
Container Environment Variables
These environment variables are read by docker-compose.yml and passed into the datris container. They tune deployment-level concerns rather than business logic. Override in your .env file.
| Variable | Default | Description |
|---|
JAVA_OPTS | -Xms512m -Xmx2g | JVM heap sizing for the datris-server process. The default fits an 8 GB host alongside the other bundled services. Bump on larger hosts (e.g. -Xms1g -Xmx8g for 24 GB). See Installation → JVM Heap Sizing for the full sizing table and Docker Desktop ceiling notes. |
TAP_MAX_OUTPUT_MB | 100 | Override the tapMaxOutputMB cap. Applies to all tap scripts; affects when the platform fails fast on oversized output. Raise on hosts with larger heaps when single-shot backfills genuinely need more buffer; lower on memory-constrained deployments to fail earlier. |
USE_USER_AUTH | false | Enable username/password login + admin gating. See User Authentication. |
USE_API_KEYS | false | Enable per-tenant API-key gating on the REST/MCP API. See API Keys. |
MULTI_TENANT | false | Switch to per-tenant database isolation (intended for hosted/managed deployments). |
CAPABILITY_ENFORCEMENT | enforce | enforce returns HTTP 403 on policy violations by scoped keys. Set to log-only to emit would-deny log lines without rejecting — useful when iterating on a new scope policy. |