Schema Definition & Auto-Generation

Schemas define the structure of data flowing through the pipeline. Every pipeline requires a source schema that describes the incoming fields and their types. Optionally, a destination schema can override types or rename fields when writing to a target.

Defining Schemas

Schemas are declared in the source.schemaProperties.fields array of a pipeline configuration. Each field entry specifies a name and a data type.

{
  "source": {
    "schemaProperties": {
      "fields": [
        { "name": "id", "type": "bigint" },
        { "name": "email", "type": "varchar(255)" },
        { "name": "signup_date", "type": "date" },
        { "name": "balance", "type": "decimal(12,2)" },
        { "name": "is_active", "type": "boolean" },
        { "name": "notes", "type": "string" }
      ]
    }
  }
}

Supported Data Types

When the UI or AI schema generation infers types, it uses these common types:

Type	Description
`boolean`	True/false value
`int`	32-bit signed integer
`bigint`	64-bit signed integer
`float`	32-bit floating point
`double`	64-bit floating point
`string`	Variable-length text
`date`	Calendar date (no time component)
`timestamp`	Date and time with microsecond precision

MCP Server (AI Agents)

When an AI agent creates a pipeline via the MCP server, all fields are set to string. This is intentional — it ensures data is always ingested successfully regardless of format inconsistencies. Type enforcement happens at the destination (e.g., PostgreSQL column types) rather than at ingestion time, avoiding failed jobs from unexpected values.

Advanced Types

These types are supported by the server but are never auto-generated. Use them when writing pipeline JSON configuration manually:

Type	Description
`tinyint`	8-bit signed integer
`smallint`	16-bit signed integer
`decimal(p,s)`	Fixed-precision decimal with `p` total digits and `s` scale digits
`varchar(n)`	Variable-length text with maximum length `n`
`char(n)`	Fixed-length text of exactly `n` characters

Refer to data-types for type mappings to PostgreSQL and Spark.

Column Naming Rules

Column names in pipeline schemaProperties.fields must match [A-Za-z0-9_]+ — letters, digits, and underscores only. Spaces, parentheses, percent signs, hyphens, and other punctuation are rejected by the pipeline validator at registration time. Names are also case-insensitive for duplicate detection: Foo and foo collide. Recommended convention: lowercase snake_case (e.g. customer_id, eps_estimate).

Tap auto-normalization

When a tap returns records whose keys contain disallowed characters (common with pandas DataFrames or APIs that preserve human-readable column titles), the platform automatically normalizes them in TapScriptRunner before they enter the pipeline. The conversion runs in this order:

Spell out semantically meaningful special characters as _word_ tokens (see table below)
Replace any remaining non-alphanumeric character with _
Collapse runs of _ into a single _
Trim leading/trailing _
Lowercase

Special character spell-out table

Character	Replacement	Example
`%`	`percent`	`Surprise(%)` → `surprise_percent`
`#`	`num`	`Order#` → `order_num`
`$`	`dollars`	`Price$` → `price_dollars`
`&`	`and`	`R&D Spending` → `r_and_d_spending`
`@`	`at`	`Email@Domain` → `email_at_domain`
`+`	`plus`	`Type A+` → `type_a_plus`
`=`	`equals`	`a=b` → `a_equals_b`
`<`	`lt`	`a<b` → `a_lt_b`
`>`	`gt`	`a>b` → `a_gt_b`
`/`	`per`	`miles/hour` → `miles_per_hour`
`*`	`star`	`count*` → `count_star`
`^`	`pow`	`x^2` → `x_pow_2`
`~`	`approx`	`~Population` → `approx_population`
`!`	`bang`	`Score!` → `score_bang`

Other special characters (spaces, hyphens, parentheses, brackets, dots, commas, colons, quotes, etc.) are simply replaced with _. So EPS Estimate becomes eps_estimate and Cost ($) becomes cost_dollars. This normalization only applies to tabular results (dataType: csv). JSON and XML results destined for MongoDB are passed through unchanged in the _json field. See Taps → Script Requirements for the script-side perspective.

Auto-Generating a Schema

If you have a representative CSV file, the pipeline can infer a schema automatically using AI. POST the file to the /api/v1/pipeline/generate endpoint:

curl -X POST "http://localhost:8080/api/v1/pipeline/generate" \
  -F "file=@sample.csv" \
  -F "pipeline=my_pipeline"

The AI analyzes the file content (up to 100 lines for CSV, or 10,000 characters for JSON/XML) and returns a complete pipeline configuration with inferred field names and data types. You can edit the output before saving it to a pipeline configuration. In the UI, this happens automatically in Step 1 of the pipeline creation wizard when you upload a sample file and click “Analyze File”.

AI-Generated Validation Schemas

For JSON and XML pipelines, you can also generate validation schemas using AI:

JSON Schema (Draft 4) — for validating JSON data against an Everit-compatible schema
W3C XSD — for validating XML data against an XML Schema

In the UI (Step 4 — Data Quality), choose “Generate schema with AI”, enter a schema name, and provide sample data (or load it from your uploaded file). The AI generates a compliant schema that is stored in MinIO and referenced automatically in the pipeline config. Via API:

curl -X POST "http://localhost:8080/api/v1/config/generate-schema" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "json-schema",
    "name": "stock_prices_schema",
    "sampleData": "{\"symbol\": \"AAPL\", \"price\": 150.25}"
  }'

Valid types: json-schema (generates Draft 4 JSON Schema) and xsd (generates W3C XSD). The generated schema is stored at {environment}-config/validation-schema/{name}.json or {name}.xsd.

Schema Evolution

Datris supports additive schema evolution — when new columns appear in uploaded data, the pipeline schema automatically expands to include them. No manual intervention or pipeline recreation is needed.

How it works

You create a pipeline with a 3-column CSV (e.g., name, age, city)
Later, you upload a CSV with a 4th column (e.g., name, age, city, email)
Datris detects the new email column, adds it to the pipeline schema as type string, and runs ALTER TABLE on the PostgreSQL destination to add the column
The data is processed with all 4 columns
Previous rows have NULL for the new column; new rows have the full data

Each schema change increments a schemaVersion counter in the pipeline config, so you can track how the schema has evolved over time.

What triggers evolution

New columns in the CSV header that are not in the stored schema → automatically added as string type

What does NOT trigger evolution

Missing columns — if a CSV is missing a non-key column from the schema, the missing values are filled with empty strings (existing behavior). Key fields must always be present.
Column renames — treated as a removal + addition. The old column gets empty values, and the new column is added.
Type changes — new columns default to string. Changing an existing column’s type requires manual pipeline update.
JSON/XML data — these use a single _json or _xml field, so column-level evolution does not apply.

Example

# Create pipeline with 3-column data
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
  -F "file=@data_v1.csv" -F "pipeline=customers"
# data_v1.csv: name,age,city

# Later, upload data with a new column
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
  -F "file=@data_v2.csv" -F "pipeline=customers"
# data_v2.csv: name,age,city,email
# → Schema evolves automatically, ALTER TABLE adds "email" column

Check the evolved schema:

curl http://localhost:8080/api/v1/pipeline?pipeline=customers
# source.schemaProperties.schemaVersion: 2
# source.schemaProperties.fields: [name, age, city, email]

Source vs Destination Schemas

A source schema describes the data as it arrives (columns, JSON keys, database columns). It is always required. A destination schema describes the data as it should be written to the target system. It is optional. When omitted, the destination inherits the source schema unchanged. Use a destination schema when you need to:

Widen a type (e.g., int to bigint) for the target table.
Rename a field between ingestion and storage.
Drop fields that should not reach the destination.

{
  "source": {
    "schemaProperties": {
      "fields": [
        { "name": "user_id", "type": "int" },
        { "name": "full_name", "type": "varchar(100)" }
      ]
    }
  },
  "destination": {
    "schemaProperties": {
      "fields": [
        { "name": "user_id", "type": "bigint" },
        { "name": "full_name", "type": "varchar(200)" }
      ]
    }
  }
}

When both schemas are present, the pipeline maps source fields to destination fields by position. Ensure the field count matches or use a transformation step to reconcile differences.

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

Schema Definition & Auto-Generation

Defining Schemas

Supported Data Types

MCP Server (AI Agents)

Advanced Types

Column Naming Rules

Tap auto-normalization

Special character spell-out table

Auto-Generating a Schema

AI-Generated Validation Schemas

Schema Evolution

How it works

What triggers evolution

What does NOT trigger evolution

Example

Source vs Destination Schemas

Getting Started

Discovery

Taps

Ingestion

Destinations

Data Quality

Transformation

AI Features

Configuration

Examples

Documentation Index

​Defining Schemas

​Supported Data Types

​MCP Server (AI Agents)

​Advanced Types

​Column Naming Rules

​Tap auto-normalization

​Special character spell-out table

​Auto-Generating a Schema

​AI-Generated Validation Schemas

​Schema Evolution

​How it works

​What triggers evolution

​What does NOT trigger evolution

​Example

​Source vs Destination Schemas

Defining Schemas

Supported Data Types

MCP Server (AI Agents)

Advanced Types

Column Naming Rules

Tap auto-normalization

Special character spell-out table

Auto-Generating a Schema

AI-Generated Validation Schemas

Schema Evolution

How it works

What triggers evolution

What does NOT trigger evolution

Example

Source vs Destination Schemas