Data Quality
How ETLX models, executes, and observes data quality rules as first-class, metadata-driven validation processes.
Data Quality in ETLX
Data quality in ETLX is not an afterthought or an external process. It is a first-class execution model, designed to validate, observe, and optionally correct data using declarative, SQL-driven rules.
Rather than embedding ad-hoc checks inside transformations, ETLX treats data quality as:
- Executable metadata
- Auditable validation logic
- A governance primitive
This allows quality rules to be reasoned about, documented, versioned, and executed independently of ETL or ELT pipelines.
The DATA_QUALITY Execution Model
A data quality workflow is defined using a root Markdown block that declares its execution mode as DATA_QUALITY.
From this block, ETLX executes a deterministic sequence of validation rules, each modeled as an independent execution unit.
Conceptually, a DATA_QUALITY block answers three questions:
- What condition defines invalid data?
- How should violations be observed and reported?
- Can (and should) invalid data be fixed automatically?
Structure Overview
A DATA_QUALITY definition consists of:
Root block metadata
- Describes the purpose and activation state of the quality process
Validation rules
- Each rule is a level-two heading representing a single, focused quality check
SQL logic
- One query to detect violations
- An optional query to correct them
Data Quality Markdown Example
The example below illustrates a DATA_QUALITY workflow validating the TRIP_DATA dataset introduced in the Query Documentation section (NYC Yellow Taxi data).
# QUALITY_CHECK
```yaml
description: "Runs some queries to check quality / validate."
runs_as: DATA_QUALITY
active: true
```
## Rule0001
```yaml
name: Rule0001
description: "Check if the payment_type has only the values 0=Flex Fare, 1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided trip."
connection: "duckdb:"
before_sql: "ATTACH 'database/DB_EX_DGOV.db' AS DB (TYPE SQLITE)"
query: quality_check_query
fix_quality_err: fix_quality_err_query
column: total_err # Defaults to 'total'.
check_only: false # runs only quality check if true
fix_only: false # runs only quality fix if true and available and possible
after_sql: "DETACH DB"
active: true
```
```sql
-- quality_check_query
SELECT COUNT(*) AS "total_err"
FROM "TRIP_DATA"
WHERE "payment_type" NOT IN (0,1,2,3,4,5,6);
```
```sql
-- fix_quality_err_query
UPDATE "TRIP_DATA"
SET "payment_type" = 'default value'
WHERE "payment_type" NOT IN (0,1,2,3,4,5,6);
```
## Rule0002
```yaml
name: Rule0002
description: "Check if there is any trip with distance less than or equal to zero."
connection: "duckdb:"
before_sql: "ATTACH 'database/DB_EX_DGOV.db' AS DB (TYPE SQLITE)"
query: quality_check_query
fix_quality_err: null # no automated fixing for this
column: total_err # Defaults to 'total'.
after_sql: "DETACH DB"
active: true
```
```sql
-- quality_check_query
SELECT COUNT(*) AS "total_err"
FROM "TRIP_DATA"
WHERE NOT "trip_distance" > 0;
```
Rule Semantics
Each validation rule represents a contract that the data must satisfy.
A rule is defined by:
- Detection logic (
query) - Optional remediation logic (
fix_quality_err) - Execution controls (
check_only,fix_only) - Lifecycle hooks (
before_sql,after_sql)
Rules are intentionally small and composable, encouraging one responsibility per rule.
Execution Flow
For each active rule, ETLX executes the following steps:
Initialization
- Resolve connection
- Execute
before_sql
Validation
- Execute the detection query
- Read the violation count from the configured column
Decision
- If violations = 0 → rule passes
- If violations > 0 → rule fails
Optional Fix
- If
fix_quality_erris defined and allowed - Execute remediation SQL
- If
Finalization
- Execute
after_sql - Record execution metadata
- Execute
This flow is deterministic and fully observable.
Query & Connection Resolution
Validation queries run using the rule’s
connectionIf omitted, the DATA_QUALITY root connection is used
Queries can reference:
- Inline SQL
- Named SQL blocks
- Shared datasets such as
TRIP_DATA
This allows quality rules to be applied consistently across:
- Extracted data
- Transformed outputs
- Final reporting tables
Observability & Governance
Every DATA_QUALITY execution produces structured metadata, including:
- Rule name and description
- Execution timestamps and duration
- Number of violations detected
- Fix actions applied
- Errors and warnings
This metadata can be used to:
- Generate data quality reports
- Track SLA compliance
- Power dashboards
- Feed governance and audit systems
Design Principles
ETLX data quality rules are:
- Declarative — define intent, not control flow
- SQL-native — no DSLs, no abstractions hiding logic
- Deterministic — same input, same result
- Auditable — every action is recorded
- Composable — reusable across pipelines
What This Is (and Is Not)
This is:
- A data quality execution model
- A governance-friendly validation layer
- An observable contract enforcement mechanism
This is not:
- A black-box data quality tool
- A replacement for BI validation
- A rule engine detached from SQL
ETLX complements catalogs, observability platforms, and BI tools by enforcing quality at the source of execution.
Summary
In ETLX:
- Data quality is code, not comments
- Rules are first-class execution units
- SQL defines truth
- Metadata defines meaning
Data quality is not something you check once — it is something you continuously execute, observe, and govern.
Last updated 08 Jan 2026, 08:08 -01 .