datris.ai Ingest, validate, transform, store, and retrieve your data — whether you’re an AI agent talking through MCP or a developer writing config. One platform for both. Deploy on any cloud provider, on-premise, or locally — with no vendor lock-in. Define your entire data pipeline through simple JSON configuration, or extend it with AI instructions and preprocessors at every stage of the flow. Built entirely on open-source infrastructure, it runs anywhere Docker does.Documentation Index
Fetch the complete documentation index at: https://docs.datris.ai/llms.txt
Use this file to discover all available pages before exploring further.
Agent-Ready: Built-In MCP Server
Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.AI-Powered Features
Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.- MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
- AI-powered data quality - Validate with plain English rules via
aiRule. Examples:- “Validate that all email addresses are properly formatted and all phone numbers contain 7–15 digits”
- “Ensure price is positive, quantity is a whole number, and discount never exceeds price”
- “Check that start_date is before end_date and both are in YYYY-MM-DD format”
- “Verify that country codes are valid ISO 3166-1 alpha-2 codes”
- “Flag any row where revenue minus cost does not equal profit within a 0.01 tolerance”
- AI transformations - Describe transformations in natural language. Examples:
- “Convert all dates to YYYY-MM-DD format and normalize phone numbers to E.164”
- “Categorize transactions as ‘small’ under 100–1000”
- “Extract city and state from the address column into separate columns”
- “Standardize company names — remove Inc, LLC, Corp suffixes and trim whitespace”
- “Convert all currency amounts from EUR to USD using a rate of 1.08”
- Discovery - One wizard, six steps: chat about a source (“yfinance daily prices for the S&P 500”), pick the datasets you want, and Datris generates the tap scripts, builds the pipelines, schedules them, and runs them — all grouped into a Data Catalog for organization. The fastest way to onboard a new external data source
- Taps - AI-generated Python scripts that fetch data from external sources (APIs, web scraping, databases) and push it into pipelines. Describe what data you want in plain English, and Datris generates the script. Includes AI diagnosis when scripts fail, CRON scheduling, and credentials management via Vault secrets
- Datris CLI - Command-line interface for ingesting data, running queries, and managing pipelines.
datris ingest data.csv --ai-validate "prices > 0" --ai-transform "convert dates to YYYY/MM/DD" - AI schema generation - Upload any data and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
- AI data profiling - Upload a data and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
- AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
- AI providers - Anthropic Claude (choose any model), OpenAI (chose any model, plus embeddings models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config
RAG Pipeline
Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.- 5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
- Chunking strategies - Fixed-size, sentence, paragraph, recursive
- Embedding providers - OpenAI or Ollama (local models)
- Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text
Key Features
- Configuration-driven - Define pipelines entirely via MCP, the Datris UI, or directly with JSON. Extend it with AI instructions for data quality and transformations. You even have the option of defining your own preprocessor via a REST endpoint
- Multiple ingestion methods - MCP, data upload API, MinIO bucket events, database polling, Kafka streaming
- Data quality - AI rules (LLM-generated), and/or JSON/XML schema validation
- Transformations - AI transformations (LLM-generated), destination schema (drop/rename/retype columns) for structure data
- Multiple destinations - Write to PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, pgvector, or MinIO (Parquet/ORC) in parallel
- Event notifications - Subscribe to pipeline processing events via ActiveMQ topics
Architecture
Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via MCP or API. Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:| Service | Purpose |
|---|---|
| MinIO | S3-compatible object store for file staging and data output |
| MongoDB | Configuration store, job status tracking, metadata |
| ActiveMQ | File notification queue, pipeline event notifications |
| HashiCorp Vault | Secrets management (database credentials, API keys) |
| Apache Kafka | Optional streaming source and destination |
| Apache Spark | Local Spark for writing Parquet/ORC to MinIO |
Processing Flow
Retrieval Flow
Supported Data Formats
| Format | Input | Output |
|---|---|---|
| CSV | Configurable delimiter, header, encoding | Parquet, ORC, database, Kafka, ActiveMQ |
| JSON | Single object or NDJSON (one per line) | MongoDB, Kafka, REST |
| XML | Single document or one per line | Database, Kafka, REST |
| Excel (XLS) | Worksheet selection, auto-CSV conversion | Same as CSV |
| Unstructured | PDF, Word, PowerPoint, Excel, HTML, email, EPUB, text | Object store, Qdrant, Weaviate, pgvector |
| Archives | .zip, .tar, .gz, .jar | Extracted and processed individually |
Quick Links
- Installation - Get running with Docker Compose
- Quick Start - End-to-end walkthrough
- Discovery - Six-step wizard that goes from “I want this data” to running taps + pipelines
- Data Catalog - Organize related taps and pipelines into named groups
- Datris CLI - Command-line interface for ingesting, querying, and managing pipelines
- Pipeline Configuration - Full JSON configuration reference
- Preprocessor - External preprocessing via REST endpoints
- API Reference - REST API documentation
- AI Schema Generation - Generate pipeline configs from files using AI
- AI Configuration - Configure AI providers (Anthropic, OpenAI, Ollama)
- AI Data Quality - Natural language validation
- AI Transformation - Natural language transformation
- AI Data Profiling - Profile data files and get recommended rules
- AI Error Explanation - Automatic plain-English error analysis
- Qdrant Destination - Vector database for RAG with chunking, embeddings, and metadata
- Weaviate Destination - Vector database for RAG with chunking, embeddings, and metadata
- Milvus Destination - Scalable vector database for RAG
- Chroma Destination - Lightweight vector database for RAG — single container
- pgvector Destination - PostgreSQL vector database for RAG — no separate server required
- Query API - Query PostgreSQL and MongoDB via REST API
- Search API - Semantic search across vector databases via REST API
- MCP Server - AI agent integration via Model Context Protocol
- Example Applications - Vector store chat, Kafka loader, preprocessor, and more
