Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.datris.ai/llms.txt

Use this file to discover all available pages before exploring further.

Schemas define the structure of data flowing through the pipeline. Every pipeline requires a source schema that describes the incoming fields and their types. Optionally, a destination schema can override types or rename fields when writing to a target.

Defining Schemas

Schemas are declared in the source.schemaProperties.fields array of a pipeline configuration. Each field entry specifies a name and a data type.
{
  "source": {
    "schemaProperties": {
      "fields": [
        { "name": "id", "type": "bigint" },
        { "name": "email", "type": "varchar(255)" },
        { "name": "signup_date", "type": "date" },
        { "name": "balance", "type": "decimal(12,2)" },
        { "name": "is_active", "type": "boolean" },
        { "name": "notes", "type": "string" }
      ]
    }
  }
}

Supported Data Types

When the UI or AI schema generation infers types, it uses these common types:
TypeDescription
booleanTrue/false value
int32-bit signed integer
bigint64-bit signed integer
float32-bit floating point
double64-bit floating point
stringVariable-length text
dateCalendar date (no time component)
timestampDate and time with microsecond precision

MCP Server (AI Agents)

When an AI agent creates a pipeline via the MCP server, all fields are set to string. This is intentional — it ensures data is always ingested successfully regardless of format inconsistencies. Type enforcement happens at the destination (e.g., PostgreSQL column types) rather than at ingestion time, avoiding failed jobs from unexpected values.

Advanced Types

These types are supported by the server but are never auto-generated. Use them when writing pipeline JSON configuration manually:
TypeDescription
tinyint8-bit signed integer
smallint16-bit signed integer
decimal(p,s)Fixed-precision decimal with p total digits and s scale digits
varchar(n)Variable-length text with maximum length n
char(n)Fixed-length text of exactly n characters
Refer to data-types for type mappings to PostgreSQL and Spark.

Column Naming Rules

Column names in pipeline schemaProperties.fields must match [A-Za-z0-9_]+ — letters, digits, and underscores only. Spaces, parentheses, percent signs, hyphens, and other punctuation are rejected by the pipeline validator at registration time. Names are also case-insensitive for duplicate detection: Foo and foo collide. Recommended convention: lowercase snake_case (e.g. customer_id, eps_estimate).

Tap auto-normalization

When a tap returns records whose keys contain disallowed characters (common with pandas DataFrames or APIs that preserve human-readable column titles), the platform automatically normalizes them in TapScriptRunner before they enter the pipeline. The conversion runs in this order:
  1. Spell out semantically meaningful special characters as _word_ tokens (see table below)
  2. Replace any remaining non-alphanumeric character with _
  3. Collapse runs of _ into a single _
  4. Trim leading/trailing _
  5. Lowercase

Special character spell-out table

CharacterReplacementExample
%percentSurprise(%)surprise_percent
#numOrder#order_num
$dollarsPrice$price_dollars
&andR&D Spendingr_and_d_spending
@atEmail@Domainemail_at_domain
+plusType A+type_a_plus
=equalsa=ba_equals_b
<lta<ba_lt_b
>gta>ba_gt_b
/permiles/hourmiles_per_hour
*starcount*count_star
^powx^2x_pow_2
~approx~Populationapprox_population
!bangScore!score_bang
Other special characters (spaces, hyphens, parentheses, brackets, dots, commas, colons, quotes, etc.) are simply replaced with _. So EPS Estimate becomes eps_estimate and Cost ($) becomes cost_dollars. This normalization only applies to tabular results (dataType: csv). JSON and XML results destined for MongoDB are passed through unchanged in the _json field. See Taps → Script Requirements for the script-side perspective.

Auto-Generating a Schema

If you have a representative CSV file, the pipeline can infer a schema automatically using AI. POST the file to the /api/v1/pipeline/generate endpoint:
curl -X POST "http://localhost:8080/api/v1/pipeline/generate" \
  -F "file=@sample.csv" \
  -F "pipeline=my_pipeline"
The AI analyzes the file content (up to 100 lines for CSV, or 10,000 characters for JSON/XML) and returns a complete pipeline configuration with inferred field names and data types. You can edit the output before saving it to a pipeline configuration. In the UI, this happens automatically in Step 1 of the pipeline creation wizard when you upload a sample file and click “Analyze File”.

AI-Generated Validation Schemas

For JSON and XML pipelines, you can also generate validation schemas using AI:
  • JSON Schema (Draft 4) — for validating JSON data against an Everit-compatible schema
  • W3C XSD — for validating XML data against an XML Schema
In the UI (Step 4 — Data Quality), choose “Generate schema with AI”, enter a schema name, and provide sample data (or load it from your uploaded file). The AI generates a compliant schema that is stored in MinIO and referenced automatically in the pipeline config. Via API:
curl -X POST "http://localhost:8080/api/v1/config/generate-schema" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "json-schema",
    "name": "stock_prices_schema",
    "sampleData": "{\"symbol\": \"AAPL\", \"price\": 150.25}"
  }'
Valid types: json-schema (generates Draft 4 JSON Schema) and xsd (generates W3C XSD). The generated schema is stored at {environment}-config/validation-schema/{name}.json or {name}.xsd.

Schema Evolution

Datris supports additive schema evolution — when new columns appear in uploaded data, the pipeline schema automatically expands to include them. No manual intervention or pipeline recreation is needed.

How it works

  1. You create a pipeline with a 3-column CSV (e.g., name, age, city)
  2. Later, you upload a CSV with a 4th column (e.g., name, age, city, email)
  3. Datris detects the new email column, adds it to the pipeline schema as type string, and runs ALTER TABLE on the PostgreSQL destination to add the column
  4. The data is processed with all 4 columns
  5. Previous rows have NULL for the new column; new rows have the full data
Each schema change increments a schemaVersion counter in the pipeline config, so you can track how the schema has evolved over time.

What triggers evolution

  • New columns in the CSV header that are not in the stored schema → automatically added as string type

What does NOT trigger evolution

  • Missing columns — if a CSV is missing a non-key column from the schema, the missing values are filled with empty strings (existing behavior). Key fields must always be present.
  • Column renames — treated as a removal + addition. The old column gets empty values, and the new column is added.
  • Type changes — new columns default to string. Changing an existing column’s type requires manual pipeline update.
  • JSON/XML data — these use a single _json or _xml field, so column-level evolution does not apply.

Example

# Create pipeline with 3-column data
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
  -F "file=@data_v1.csv" -F "pipeline=customers"
# data_v1.csv: name,age,city

# Later, upload data with a new column
curl -X POST http://localhost:8080/api/v1/pipeline/upload \
  -F "file=@data_v2.csv" -F "pipeline=customers"
# data_v2.csv: name,age,city,email
# → Schema evolves automatically, ALTER TABLE adds "email" column
Check the evolved schema:
curl http://localhost:8080/api/v1/pipeline?pipeline=customers
# source.schemaProperties.schemaVersion: 2
# source.schemaProperties.fields: [name, age, city, email]

Source vs Destination Schemas

A source schema describes the data as it arrives (columns, JSON keys, database columns). It is always required. A destination schema describes the data as it should be written to the target system. It is optional. When omitted, the destination inherits the source schema unchanged. Use a destination schema when you need to:
  • Widen a type (e.g., int to bigint) for the target table.
  • Rename a field between ingestion and storage.
  • Drop fields that should not reach the destination.
{
  "source": {
    "schemaProperties": {
      "fields": [
        { "name": "user_id", "type": "int" },
        { "name": "full_name", "type": "varchar(100)" }
      ]
    }
  },
  "destination": {
    "schemaProperties": {
      "fields": [
        { "name": "user_id", "type": "bigint" },
        { "name": "full_name", "type": "varchar(200)" }
      ]
    }
  }
}
When both schemas are present, the pipeline maps source fields to destination fields by position. Ensure the field count matches or use a transformation step to reconcile differences.