Datris

datris.ai Ingest, validate, transform, store, and retrieve your data — whether you’re an AI agent talking through MCP or a developer writing config. One platform for both. Deploy on any cloud provider, on-premise, or locally — with no vendor lock-in. Define your entire data pipeline through simple JSON configuration, or extend it with AI instructions and preprocessors at every stage of the flow. Built entirely on open-source infrastructure, it runs anywhere Docker does.

Agent-Ready: Built-In MCP Server

Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.

AI-Powered Features

Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.

MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
AI-powered data quality - Validate with plain English rules via aiRule. Examples:
- “Validate that all email addresses are properly formatted and all phone numbers contain 7–15 digits”
- “Ensure price is positive, quantity is a whole number, and discount never exceeds price”
- “Check that start_date is before end_date and both are in YYYY-MM-DD format”
- “Verify that country codes are valid ISO 3166-1 alpha-2 codes”
- “Flag any row where revenue minus cost does not equal profit within a 0.01 tolerance”
AI transformations - Describe transformations in natural language. Examples:
- “Convert all dates to YYYY-MM-DD format and normalize phone numbers to E.164”
- “Categorize transactions as ‘small’ under $100, 'medium'$ 100– $1000, or 'large' over$ 1000”
- “Extract city and state from the address column into separate columns”
- “Standardize company names — remove Inc, LLC, Corp suffixes and trim whitespace”
- “Convert all currency amounts from EUR to USD using a rate of 1.08”
Discovery - One wizard, six steps: chat about a source (“yfinance daily prices for the S&P 500”), pick the datasets you want, and Datris generates the tap scripts, builds the pipelines, schedules them, and runs them — all grouped into a Data Catalog for organization. The fastest way to onboard a new external data source
Taps - AI-generated Python scripts that fetch data from external sources (APIs, web scraping, databases) and push it into pipelines. Describe what data you want in plain English, and Datris generates the script. Includes AI diagnosis when scripts fail, CRON scheduling, and credentials management via Vault secrets
Datris CLI - Command-line interface for ingesting data, running queries, and managing pipelines. datris ingest data.csv --ai-validate "prices > 0" --ai-transform "convert dates to YYYY/MM/DD"
AI schema generation - Upload any data and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
AI data profiling - Upload a data and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
AI providers - Anthropic Claude (choose any model), OpenAI (chose any model, plus embeddings models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config

RAG Pipeline

Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.

5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
Chunking strategies - Fixed-size, sentence, paragraph, recursive
Embedding providers - OpenAI or Ollama (local models)
Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text

Key Features

Configuration-driven - Define pipelines entirely via MCP, the Datris UI, or directly with JSON. Extend it with AI instructions for data quality and transformations. You even have the option of defining your own preprocessor via a REST endpoint
Multiple ingestion methods - MCP, data upload API, MinIO bucket events, database polling, Kafka streaming
Data quality - AI rules (LLM-generated), and/or JSON/XML schema validation
Transformations - AI transformations (LLM-generated), destination schema (drop/rename/retype columns) for structure data
Multiple destinations - Write to PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, pgvector, or MinIO (Parquet/ORC) in parallel
Event notifications - Subscribe to pipeline processing events via ActiveMQ topics

Architecture

Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via MCP or API. Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:

Service	Purpose
MinIO	S3-compatible object store for file staging and data output
MongoDB	Configuration store, job status tracking, metadata
ActiveMQ	File notification queue, pipeline event notifications
HashiCorp Vault	Secrets management (database credentials, API keys)
Apache Kafka	Optional streaming source and destination
Apache Spark	Local Spark for writing Parquet/ORC to MinIO

Processing Flow

Source (Data Upload / MinIO Event / Database Pull / Kafka)
  |
  v
Preprocessor (optional REST endpoint)
  |
  v
Data Quality (AI rules, header validation, schema validation)
  |
  v
Transformation (AI transformations)
  |
  v
Destinations (executed in parallel)
  ├── Object Store (MinIO - Parquet, ORC, CSV)
  ├── PostgreSQL (COPY bulk insert)
  ├── MongoDB (document upsert)
  ├── Kafka (topic producer)
  ├── ActiveMQ (queue)
  ├── REST Endpoint (HTTP POST)
  ├── Qdrant (vector database - chunking, embeddings, RAG)
  ├── Weaviate (vector database - chunking, embeddings, RAG)
  ├── Milvus (vector database - chunking, embeddings, RAG)
  ├── Chroma (vector database - chunking, embeddings, RAG)
  └── pgvector (PostgreSQL vector database - chunking, embeddings, RAG)
  |
  v
Notifications (published to ActiveMQ topic)

Retrieval Flow

Client (AI Agent via MCP / Developer via REST API)
  |
  v
Interface
  ├── MCP Server (stdio or SSE)
  └── REST API (POST /api/v1/query/* and /api/v1/search/*)
  |
  v
Query Source
  ├── PostgreSQL (read-only SQL SELECT queries)
  ├── MongoDB (document queries with filters and projections)
  ├── Qdrant (semantic search)
  ├── Weaviate (semantic search)
  ├── Milvus (semantic search)
  ├── Chroma (semantic search)
  ├── pgvector (semantic search via PostgreSQL)
  ├── Pipeline Configurations (list/get)
  ├── Job Status (by pipeline token or pipeline name)
  ├── AI Schema Generation (from uploaded files)
  └── AI Data Profiling (statistics, quality issues, suggested rules)

Supported Data Formats

Format	Input	Output
CSV	Configurable delimiter, header, encoding	Parquet, ORC, database, Kafka, ActiveMQ
JSON	Single object or NDJSON (one per line)	MongoDB, Kafka, REST
XML	Single document or one per line	Database, Kafka, REST
Excel (XLS)	Worksheet selection, auto-CSV conversion	Same as CSV
Unstructured	PDF, Word, PowerPoint, Excel, HTML, email, EPUB, text	Object store, Qdrant, Weaviate, pgvector
Archives	.zip, .tar, .gz, .jar	Extracted and processed individually

Quick Links

Installation - Get running with Docker Compose
Quick Start - End-to-end walkthrough
Discovery - Six-step wizard that goes from “I want this data” to running taps + pipelines
Data Catalog - Organize related taps and pipelines into named groups
Datris CLI - Command-line interface for ingesting, querying, and managing pipelines
Pipeline Configuration - Full JSON configuration reference
Preprocessor - External preprocessing via REST endpoints
API Reference - REST API documentation
AI Schema Generation - Generate pipeline configs from files using AI
AI Configuration - Configure AI providers (Anthropic, OpenAI, Ollama)
AI Data Quality - Natural language validation
AI Transformation - Natural language transformation
AI Data Profiling - Profile data files and get recommended rules
AI Error Explanation - Automatic plain-English error analysis
Qdrant Destination - Vector database for RAG with chunking, embeddings, and metadata
Weaviate Destination - Vector database for RAG with chunking, embeddings, and metadata
Milvus Destination - Scalable vector database for RAG
Chroma Destination - Lightweight vector database for RAG — single container
pgvector Destination - PostgreSQL vector database for RAG — no separate server required
Query API - Query PostgreSQL and MongoDB via REST API
Search API - Semantic search across vector databases via REST API
MCP Server - AI agent integration via Model Context Protocol
Example Applications - Vector store chat, Kafka loader, preprocessor, and more

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

Agent-Ready: Built-In MCP Server

AI-Powered Features

RAG Pipeline

Key Features

Architecture

Processing Flow

Retrieval Flow

Supported Data Formats

Quick Links

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

Documentation Index

​Agent-Ready: Built-In MCP Server

​AI-Powered Features

​RAG Pipeline

​Key Features

​Architecture

​Processing Flow

​Retrieval Flow

​Supported Data Formats

​Quick Links

Agent-Ready: Built-In MCP Server

AI-Powered Features

RAG Pipeline

Key Features

Architecture

Processing Flow

Retrieval Flow

Supported Data Formats

Quick Links