Agent-Ready: Built-In MCP Server
Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.AI-Powered Features
Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.- MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
- AI-powered data quality (CodeGen) - Validate with plain English rules via
aiRule. Datris generates a Python validation script from your instruction and runs it locally (~$0.003/rule). Works for CSV, JSON, and XML - AI transformations (CodeGen) - Describe transformations in natural language — date format conversion, data categorization, phone number standardization, entity extraction. Datris generates a Python script and runs it locally
- Datris CLI - Command-line interface for ingesting data, running queries, and managing pipelines.
datris ingest data.csv --ai-validate "prices > 0" --ai-transform "convert dates to YYYY/MM/DD" - AI schema generation - Upload any CSV, JSON, or XML file and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
- AI data profiling - Upload a file and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
- AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
- AI providers - Anthropic Claude (Opus 4.6, Sonnet 4.6, Haiku), OpenAI (GPT-5, GPT-4.1, o3, embedding models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config
RAG Pipeline
Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.- 5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
- Chunking strategies - Fixed-size, sentence, paragraph, recursive
- Embedding providers - OpenAI or Ollama (local models)
- Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text
Key Features
- Configuration-driven - Define pipelines entirely through JSON, or extend with AI instructions and preprocessors at every stage of the data flow
- Multiple ingestion methods - File upload API, MinIO bucket events, database polling, Kafka streaming
- Data quality - CodeGen AI rules (LLM-generated Python validation), JSON/XML schema validation
- Transformations - CodeGen AI transformations, destination schema (drop/rename/retype columns)
- Multiple destinations - Write to MinIO (Parquet/ORC), PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, or pgvector in parallel
- Event notifications - Subscribe to pipeline processing events via ActiveMQ topics
Architecture
Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via API or MCP. Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:| Service | Purpose |
|---|---|
| MinIO | S3-compatible object store for file staging and data output |
| MongoDB | Configuration store, job status tracking, metadata |
| ActiveMQ | File notification queue, pipeline event notifications |
| HashiCorp Vault | Secrets management (database credentials, API keys) |
| Apache Kafka | Optional streaming source and destination |
| Apache Spark | Local Spark for writing Parquet/ORC to MinIO |
Processing Flow
Retrieval Flow
Supported Data Formats
| Format | Input | Output |
|---|---|---|
| CSV | Configurable delimiter, header, encoding | Parquet, ORC, database, Kafka, ActiveMQ |
| JSON | Single object or NDJSON (one per line) | MongoDB, Kafka, REST |
| XML | Single document or one per line | Database, Kafka, REST |
| Excel (XLS) | Worksheet selection, auto-CSV conversion | Same as CSV |
| Unstructured | PDF, Word, PowerPoint, Excel, HTML, email, EPUB, text | Object store, Qdrant, Weaviate, pgvector |
| Archives | .zip, .tar, .gz, .jar | Extracted and processed individually |
Quick Links
- Installation - Get running with Docker Compose
- Quick Start - End-to-end walkthrough
- Datris CLI - Command-line interface for ingesting, querying, and managing pipelines
- Pipeline Configuration - Full JSON configuration reference
- Preprocessor - External preprocessing via REST endpoints
- API Reference - REST API documentation
- AI Schema Generation - Generate pipeline configs from files using AI
- AI Configuration - Configure AI providers (Anthropic, OpenAI, Ollama)
- AI Data Quality (CodeGen) - Natural language validation — generates Python scripts
- AI Transformation (CodeGen) - Natural language transformation — generates Python scripts
- AI Data Profiling - Profile data files and get recommended rules
- AI Error Explanation - Automatic plain-English error analysis
- Qdrant Destination - Vector database for RAG with chunking, embeddings, and metadata
- Weaviate Destination - Vector database for RAG with chunking, embeddings, and metadata
- Milvus Destination - Scalable vector database for RAG
- Chroma Destination - Lightweight vector database for RAG — single container
- pgvector Destination - PostgreSQL vector database for RAG — no separate server required
- Query API - Query PostgreSQL and MongoDB via REST API
- Search API - Semantic search across vector databases via REST API
- OpenAPI Spec - OpenAPI 3.0 spec for Postman, code generation, and non-MCP integrations
- MCP Server - AI agent integration via Model Context Protocol
- Example Applications - Vector store chat, Kafka loader, preprocessor, and more