Documentation Index
Fetch the complete documentation index at: https://docs.datris.ai/llms.txt
Use this file to discover all available pages before exploring further.
Schemas define the structure of data flowing through the pipeline. Every pipeline requires a source schema that describes the incoming fields and their types. Optionally, a destination schema can override types or rename fields when writing to a target.
Defining Schemas
Schemas are declared in the source.schemaProperties.fields array of a pipeline configuration. Each field entry specifies a name and a data type.
{
"source": {
"schemaProperties": {
"fields": [
{ "name": "id", "type": "bigint" },
{ "name": "email", "type": "varchar(255)" },
{ "name": "signup_date", "type": "date" },
{ "name": "balance", "type": "decimal(12,2)" },
{ "name": "is_active", "type": "boolean" },
{ "name": "notes", "type": "string" }
]
}
}
}
Supported Data Types
When the UI or AI schema generation infers types, it uses these common types:
| Type | Description |
|---|
boolean | True/false value |
int | 32-bit signed integer |
bigint | 64-bit signed integer |
float | 32-bit floating point |
double | 64-bit floating point |
string | Variable-length text |
date | Calendar date (no time component) |
timestamp | Date and time with microsecond precision |
MCP Server (AI Agents)
When an AI agent creates a pipeline via the MCP server, all fields are set to string. This is intentional — it ensures data is always ingested successfully regardless of format inconsistencies. Type enforcement happens at the destination (e.g., PostgreSQL column types) rather than at ingestion time, avoiding failed jobs from unexpected values.
Advanced Types
These types are supported by the server but are never auto-generated. Use them when writing pipeline JSON configuration manually:
| Type | Description |
|---|
tinyint | 8-bit signed integer |
smallint | 16-bit signed integer |
decimal(p,s) | Fixed-precision decimal with p total digits and s scale digits |
varchar(n) | Variable-length text with maximum length n |
char(n) | Fixed-length text of exactly n characters |
Refer to data-types for type mappings to PostgreSQL and Spark.
Column Naming Rules
Column names in pipeline schemaProperties.fields must match [A-Za-z0-9_]+ — letters, digits, and underscores only. Spaces, parentheses, percent signs, hyphens, and other punctuation are rejected by the pipeline validator at registration time. Names are also case-insensitive for duplicate detection: Foo and foo collide.
Recommended convention: lowercase snake_case (e.g. customer_id, eps_estimate).
Tap auto-normalization
When a tap returns records whose keys contain disallowed characters (common with pandas DataFrames or APIs that preserve human-readable column titles), the platform automatically normalizes them in TapScriptRunner before they enter the pipeline.
The conversion runs in this order:
- Spell out semantically meaningful special characters as
_word_ tokens (see table below)
- Replace any remaining non-alphanumeric character with
_
- Collapse runs of
_ into a single _
- Trim leading/trailing
_
- Lowercase
Special character spell-out table
| Character | Replacement | Example |
|---|
% | percent | Surprise(%) → surprise_percent |
# | num | Order# → order_num |
$ | dollars | Price$ → price_dollars |
& | and | R&D Spending → r_and_d_spending |
@ | at | Email@Domain → email_at_domain |
+ | plus | Type A+ → type_a_plus |
= | equals | a=b → a_equals_b |
< | lt | a<b → a_lt_b |
> | gt | a>b → a_gt_b |
/ | per | miles/hour → miles_per_hour |
* | star | count* → count_star |
^ | pow | x^2 → x_pow_2 |
~ | approx | ~Population → approx_population |
! | bang | Score! → score_bang |
Other special characters (spaces, hyphens, parentheses, brackets, dots, commas, colons, quotes, etc.) are simply replaced with _. So EPS Estimate becomes eps_estimate and Cost ($) becomes cost_dollars.
This normalization only applies to tabular results (dataType: csv). JSON and XML results destined for MongoDB are passed through unchanged in the _json field.
See Taps → Script Requirements for the script-side perspective.
Auto-Generating a Schema
If you have a representative CSV file, the pipeline can infer a schema automatically using AI. POST the file to the /api/v1/pipeline/generate endpoint:
curl -X POST "http://localhost:8080/api/v1/pipeline/generate" \
-F "file=@sample.csv" \
-F "pipeline=my_pipeline"
The AI analyzes the file content (up to 100 lines for CSV, or 10,000 characters for JSON/XML) and returns a complete pipeline configuration with inferred field names and data types. You can edit the output before saving it to a pipeline configuration.
In the UI, this happens automatically in Step 1 of the pipeline creation wizard when you upload a sample file and click “Analyze File”.
AI-Generated Validation Schemas
For JSON and XML pipelines, you can also generate validation schemas using AI:
- JSON Schema (Draft 4) — for validating JSON data against an Everit-compatible schema
- W3C XSD — for validating XML data against an XML Schema
In the UI (Step 4 — Data Quality), choose “Generate schema with AI”, enter a schema name, and provide sample data (or load it from your uploaded file). The AI generates a compliant schema that is stored in MinIO and referenced automatically in the pipeline config.
Via API:
curl -X POST "http://localhost:8080/api/v1/config/generate-schema" \
-H "Content-Type: application/json" \
-d '{
"type": "json-schema",
"name": "stock_prices_schema",
"sampleData": "{\"symbol\": \"AAPL\", \"price\": 150.25}"
}'
Valid types: json-schema (generates Draft 4 JSON Schema) and xsd (generates W3C XSD). The generated schema is stored at {environment}-config/validation-schema/{name}.json or {name}.xsd.
Schema Evolution
Datris supports additive schema evolution — when new columns appear in uploaded data, the pipeline schema automatically expands to include them. No manual intervention or pipeline recreation is needed.
How it works
- You create a pipeline with a 3-column CSV (e.g.,
name, age, city)
- Later, you upload a CSV with a 4th column (e.g.,
name, age, city, email)
- Datris detects the new
email column, adds it to the pipeline schema as type string, and runs ALTER TABLE on the PostgreSQL destination to add the column
- The data is processed with all 4 columns
- Previous rows have
NULL for the new column; new rows have the full data
Each schema change increments a schemaVersion counter in the pipeline config, so you can track how the schema has evolved over time.
What triggers evolution
- New columns in the CSV header that are not in the stored schema → automatically added as
string type
What does NOT trigger evolution
- Missing columns — if a CSV is missing a non-key column from the schema, the missing values are filled with empty strings (existing behavior). Key fields must always be present.
- Column renames — treated as a removal + addition. The old column gets empty values, and the new column is added.
- Type changes — new columns default to
string. Changing an existing column’s type requires manual pipeline update.
- JSON/XML data — these use a single
_json or _xml field, so column-level evolution does not apply.
Example
# Create pipeline with 3-column data
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
-F "file=@data_v1.csv" -F "pipeline=customers"
# data_v1.csv: name,age,city
# Later, upload data with a new column
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
-F "file=@data_v2.csv" -F "pipeline=customers"
# data_v2.csv: name,age,city,email
# → Schema evolves automatically, ALTER TABLE adds "email" column
Check the evolved schema:
curl http://localhost:8080/api/v1/pipeline?pipeline=customers
# source.schemaProperties.schemaVersion: 2
# source.schemaProperties.fields: [name, age, city, email]
Source vs Destination Schemas
A source schema describes the data as it arrives (columns, JSON keys, database columns). It is always required.
A destination schema describes the data as it should be written to the target system. It is optional. When omitted, the destination inherits the source schema unchanged.
Use a destination schema when you need to:
- Widen a type (e.g.,
int to bigint) for the target table.
- Rename a field between ingestion and storage.
- Drop fields that should not reach the destination.
{
"source": {
"schemaProperties": {
"fields": [
{ "name": "user_id", "type": "int" },
{ "name": "full_name", "type": "varchar(100)" }
]
}
},
"destination": {
"schemaProperties": {
"fields": [
{ "name": "user_id", "type": "bigint" },
{ "name": "full_name", "type": "varchar(200)" }
]
}
}
}
When both schemas are present, the pipeline maps source fields to destination fields by position. Ensure the field count matches or use a transformation step to reconcile differences.