Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.datris.ai/llms.txt

Use this file to discover all available pages before exploring further.

datris.ai Ingest, validate, transform, store, and retrieve your data — whether you’re an AI agent talking through MCP or a developer writing config. One platform for both. Deploy on any cloud provider, on-premise, or locally — with no vendor lock-in. Define your entire data pipeline through simple JSON configuration, or extend it with AI instructions and preprocessors at every stage of the flow. Built entirely on open-source infrastructure, it runs anywhere Docker does.

Agent-Ready: Built-In MCP Server

Your AI agents are first-class pipeline operators. Datris ships with a native MCP (Model Context Protocol) server — the first open-source data platform natively accessible to AI agents. Claude, Cursor, OpenClaw, and any MCP-compatible agent can register pipelines, upload files, trigger processing, monitor job status, profile data, run semantic searches across vector databases, and query PostgreSQL and MongoDB — all through natural conversation. Supports stdio and SSE transports.

AI-Powered Features

Intelligence at every stage — from ingestion to delivery, Datris makes data engineering accessible through natural language.
  • MCP server (AI agent integration) - Built-in MCP server lets AI agents (Claude, Cursor, OpenClaw, custom frameworks) natively interact with the pipeline — register pipelines, upload files, trigger jobs, profile data, run semantic searches, and query databases. Supports stdio and SSE transports
  • AI-powered data quality - Validate with plain English rules via aiRule. Examples:
    • “Validate that all email addresses are properly formatted and all phone numbers contain 7–15 digits”
    • “Ensure price is positive, quantity is a whole number, and discount never exceeds price”
    • “Check that start_date is before end_date and both are in YYYY-MM-DD format”
    • “Verify that country codes are valid ISO 3166-1 alpha-2 codes”
    • “Flag any row where revenue minus cost does not equal profit within a 0.01 tolerance”
  • AI transformations - Describe transformations in natural language. Examples:
    • “Convert all dates to YYYY-MM-DD format and normalize phone numbers to E.164”
    • “Categorize transactions as ‘small’ under 100,medium100, 'medium' 100–1000,orlargeover1000, or 'large' over 1000”
    • “Extract city and state from the address column into separate columns”
    • “Standardize company names — remove Inc, LLC, Corp suffixes and trim whitespace”
    • “Convert all currency amounts from EUR to USD using a rate of 1.08”
  • Discovery - One wizard, six steps: chat about a source (“yfinance daily prices for the S&P 500”), pick the datasets you want, and Datris generates the tap scripts, builds the pipelines, schedules them, and runs them — all grouped into a Data Catalog for organization. The fastest way to onboard a new external data source
  • Taps - AI-generated Python scripts that fetch data from external sources (APIs, web scraping, databases) and push it into pipelines. Describe what data you want in plain English, and Datris generates the script. Includes AI diagnosis when scripts fail, CRON scheduling, and credentials management via Vault secrets
  • Datris CLI - Command-line interface for ingesting data, running queries, and managing pipelines. datris ingest data.csv --ai-validate "prices > 0" --ai-transform "convert dates to YYYY/MM/DD"
  • AI schema generation - Upload any data and receive a complete, ready-to-register pipeline configuration — field names and types inferred automatically
  • AI data profiling - Upload a data and get summary statistics, quality issues, and suggested validation rules — all powered by AI analysis
  • AI error explanation - When jobs fail, AI analyzes the error chain and explains the root cause in plain English. No more digging through stack traces
  • AI providers - Anthropic Claude (choose any model), OpenAI (chose any model, plus embeddings models), or local models via Ollama (Llama, Mistral, Phi). No vendor lock-in — switch providers without changing your pipeline config

RAG Pipeline

Full RAG pipeline built in. Extract, chunk, embed, and upsert documents into any major vector database — build retrieval-augmented generation workflows without leaving your pipeline.
  • 5 vector databases - Qdrant, Weaviate, Milvus, Chroma, pgvector (PostgreSQL)
  • Chunking strategies - Fixed-size, sentence, paragraph, recursive
  • Embedding providers - OpenAI or Ollama (local models)
  • Document extraction - PDF, Word, PowerPoint, Excel, HTML, email, EPUB, plain text

Key Features

  • Configuration-driven - Define pipelines entirely via MCP, the Datris UI, or directly with JSON. Extend it with AI instructions for data quality and transformations. You even have the option of defining your own preprocessor via a REST endpoint
  • Multiple ingestion methods - MCP, data upload API, MinIO bucket events, database polling, Kafka streaming
  • Data quality - AI rules (LLM-generated), and/or JSON/XML schema validation
  • Transformations - AI transformations (LLM-generated), destination schema (drop/rename/retype columns) for structure data
  • Multiple destinations - Write to PostgreSQL, MongoDB, Kafka, ActiveMQ, REST endpoints, Qdrant, Weaviate, Milvus, Chroma, pgvector, or MinIO (Parquet/ORC) in parallel
  • Event notifications - Subscribe to pipeline processing events via ActiveMQ topics

Architecture

Push and pull — one platform, two interfaces. AI agents and humans ingest data through the pipeline, store it across databases and vector stores, and retrieve it back — via MCP or API. Self-hosted on proven open-source infrastructure — no proprietary services, no vendor lock-in, no surprise bills:
ServicePurpose
MinIOS3-compatible object store for file staging and data output
MongoDBConfiguration store, job status tracking, metadata
ActiveMQFile notification queue, pipeline event notifications
HashiCorp VaultSecrets management (database credentials, API keys)
Apache KafkaOptional streaming source and destination
Apache SparkLocal Spark for writing Parquet/ORC to MinIO

Processing Flow

Source (Data Upload / MinIO Event / Database Pull / Kafka)
  |
  v
Preprocessor (optional REST endpoint)
  |
  v
Data Quality (AI rules, header validation, schema validation)
  |
  v
Transformation (AI transformations)
  |
  v
Destinations (executed in parallel)
  ├── Object Store (MinIO - Parquet, ORC, CSV)
  ├── PostgreSQL (COPY bulk insert)
  ├── MongoDB (document upsert)
  ├── Kafka (topic producer)
  ├── ActiveMQ (queue)
  ├── REST Endpoint (HTTP POST)
  ├── Qdrant (vector database - chunking, embeddings, RAG)
  ├── Weaviate (vector database - chunking, embeddings, RAG)
  ├── Milvus (vector database - chunking, embeddings, RAG)
  ├── Chroma (vector database - chunking, embeddings, RAG)
  └── pgvector (PostgreSQL vector database - chunking, embeddings, RAG)
  |
  v
Notifications (published to ActiveMQ topic)

Retrieval Flow

Client (AI Agent via MCP / Developer via REST API)
  |
  v
Interface
  ├── MCP Server (stdio or SSE)
  └── REST API (POST /api/v1/query/* and /api/v1/search/*)
  |
  v
Query Source
  ├── PostgreSQL (read-only SQL SELECT queries)
  ├── MongoDB (document queries with filters and projections)
  ├── Qdrant (semantic search)
  ├── Weaviate (semantic search)
  ├── Milvus (semantic search)
  ├── Chroma (semantic search)
  ├── pgvector (semantic search via PostgreSQL)
  ├── Pipeline Configurations (list/get)
  ├── Job Status (by pipeline token or pipeline name)
  ├── AI Schema Generation (from uploaded files)
  └── AI Data Profiling (statistics, quality issues, suggested rules)

Supported Data Formats

FormatInputOutput
CSVConfigurable delimiter, header, encodingParquet, ORC, database, Kafka, ActiveMQ
JSONSingle object or NDJSON (one per line)MongoDB, Kafka, REST
XMLSingle document or one per lineDatabase, Kafka, REST
Excel (XLS)Worksheet selection, auto-CSV conversionSame as CSV
UnstructuredPDF, Word, PowerPoint, Excel, HTML, email, EPUB, textObject store, Qdrant, Weaviate, pgvector
Archives.zip, .tar, .gz, .jarExtracted and processed individually