Defining Schemas
Schemas are declared in thesource.schemaProperties.fields array of a pipeline configuration. Each field entry specifies a name and a data type.
Supported Data Types
When the UI or AI schema generation infers types, it uses these common types:| Type | Description |
|---|---|
boolean | True/false value |
int | 32-bit signed integer |
bigint | 64-bit signed integer |
float | 32-bit floating point |
double | 64-bit floating point |
string | Variable-length text |
date | Calendar date (no time component) |
timestamp | Date and time with microsecond precision |
MCP Server (AI Agents)
When an AI agent creates a pipeline via the MCP server, all fields are set tostring. This is intentional — it ensures data is always ingested successfully regardless of format inconsistencies. Type enforcement happens at the destination (e.g., PostgreSQL column types) rather than at ingestion time, avoiding failed jobs from unexpected values.
Advanced Types
These types are supported by the server but are never auto-generated. Use them when writing pipeline JSON configuration manually:| Type | Description |
|---|---|
tinyint | 8-bit signed integer |
smallint | 16-bit signed integer |
decimal(p,s) | Fixed-precision decimal with p total digits and s scale digits |
varchar(n) | Variable-length text with maximum length n |
char(n) | Fixed-length text of exactly n characters |
Column Naming Rules
Column names in pipelineschemaProperties.fields must match [A-Za-z0-9_]+ — letters, digits, and underscores only. Spaces, parentheses, percent signs, hyphens, and other punctuation are rejected by the pipeline validator at registration time. Names are also case-insensitive for duplicate detection: Foo and foo collide.
Recommended convention: lowercase snake_case (e.g. customer_id, eps_estimate).
Tap auto-normalization
When a tap returns records whose keys contain disallowed characters (common with pandas DataFrames or APIs that preserve human-readable column titles), the platform automatically normalizes them inTapScriptRunner before they enter the pipeline.
The conversion runs in this order:
- Spell out semantically meaningful special characters as
_word_tokens (see table below) - Replace any remaining non-alphanumeric character with
_ - Collapse runs of
_into a single_ - Trim leading/trailing
_ - Lowercase
Special character spell-out table
| Character | Replacement | Example |
|---|---|---|
% | percent | Surprise(%) → surprise_percent |
# | num | Order# → order_num |
$ | dollars | Price$ → price_dollars |
& | and | R&D Spending → r_and_d_spending |
@ | at | Email@Domain → email_at_domain |
+ | plus | Type A+ → type_a_plus |
= | equals | a=b → a_equals_b |
< | lt | a<b → a_lt_b |
> | gt | a>b → a_gt_b |
/ | per | miles/hour → miles_per_hour |
* | star | count* → count_star |
^ | pow | x^2 → x_pow_2 |
~ | approx | ~Population → approx_population |
! | bang | Score! → score_bang |
_. So EPS Estimate becomes eps_estimate and Cost ($) becomes cost_dollars.
This normalization only applies to tabular results (dataType: csv). JSON and XML results destined for MongoDB are passed through unchanged in the _json field.
See Taps → Script Requirements for the script-side perspective.
Auto-Generating a Schema
If you have a representative CSV file, the pipeline can infer a schema automatically using AI. POST the file to the/api/v1/pipeline/generate endpoint:
AI-Generated Validation Schemas
For JSON and XML pipelines, you can also generate validation schemas using AI:- JSON Schema (Draft 4) — for validating JSON data against an Everit-compatible schema
- W3C XSD — for validating XML data against an XML Schema
json-schema (generates Draft 4 JSON Schema) and xsd (generates W3C XSD). The generated schema is stored at {environment}-config/validation-schema/{name}.json or {name}.xsd.
Schema Evolution
Datris supports additive schema evolution — when new columns appear in uploaded data, the pipeline schema automatically expands to include them. No manual intervention or pipeline recreation is needed.How it works
- You create a pipeline with a 3-column CSV (e.g.,
name,age,city) - Later, you upload a CSV with a 4th column (e.g.,
name,age,city,email) - Datris detects the new
emailcolumn, adds it to the pipeline schema as typestring, and runsALTER TABLEon the PostgreSQL destination to add the column - The data is processed with all 4 columns
- Previous rows have
NULLfor the new column; new rows have the full data
schemaVersion counter in the pipeline config, so you can track how the schema has evolved over time.
What triggers evolution
- New columns in the CSV header that are not in the stored schema → automatically added as
stringtype
What does NOT trigger evolution
- Missing columns — if a CSV is missing a non-key column from the schema, the missing values are filled with empty strings (existing behavior). Key fields must always be present.
- Column renames — treated as a removal + addition. The old column gets empty values, and the new column is added.
- Type changes — new columns default to
string. Changing an existing column’s type requires manual pipeline update. - JSON/XML data — these use a single
_jsonor_xmlfield, so column-level evolution does not apply.
Example
Source vs Destination Schemas
A source schema describes the data as it arrives (columns, JSON keys, database columns). It is always required. A destination schema describes the data as it should be written to the target system. It is optional. When omitted, the destination inherits the source schema unchanged. Use a destination schema when you need to:- Widen a type (e.g.,
inttobigint) for the target table. - Drop fields that should not reach the destination.
NULL. To carry a value under a different column name, perform the rename in a transformation step before the destination is applied.
NULL. Use a transformation step to rename or reshape fields before the destination schema is applied.