DuckDB at the Core

ETLX is built around DuckDB as its execution engine. At its core, ETLX embraces a SQL-first philosophy, enabling powerful in-process analytics and transformations without requiring external compute engines or distributed systems.

DuckDB acts as the analytical backbone of the pipeline, executing transformations, validations, exports, and even orchestration-related logic using standard SQL.


🧠 Why DuckDB?

DuckDB is a modern analytical database designed for OLAP workloads, embedded directly into applications. This makes it a perfect fit for ETLX.

Key advantages:

  • Embedded & In-process – No external service to manage
  • Columnar execution – High performance for analytical queries
  • Rich SQL support – Window functions, CTEs, complex joins
  • Extensible – Load extensions (SQLite, Postgres, Excel, JSON, HTTP, Parquet, etc.)
  • Portable – Same SQL runs locally, in CI, or in production

ETLX leverages all of these features while keeping the workflow declarative and reproducible.

🔹 SQL-First Transformations

In ETLX, SQL is the primary transformation language.

This means:

  • Business logic is written in plain SQL
  • Transformations are self-documented
  • Pipelines are easier to review, audit, and version-control

Example:

  SELECT
  customer_id,
  SUM(amount) AS total_spent,
  COUNT(*) AS total_orders
FROM sales
GROUP BY customer_id;
  

No DSLs. No custom operators. Just SQL.

🔹 In-Process Analytics

Because DuckDB runs inside the ETLX process, analytics are:

  • Executed in-memory or on local disk
  • Free from network latency
  • Deterministic and easy to debug

This enables advanced use cases such as:

  • Ad-hoc analytics during ETL
  • Data quality checks
  • Profiling and statistics generation
  • Report aggregation
  • Metadata-driven validation

All without shipping data to an external engine.

🔹 One Engine, Multiple Data Sources

DuckDB allows ETLX to query multiple data sources using SQL:

  • Local files (CSV, Parquet, JSON, Excel)
  • SQLite databases
  • Postgres / MySQL (via extensions)
  • HTTP / S3-compatible storage

Example:

  SELECT *
FROM read_parquet('s3://bucket/data/*.parquet');
  

This enables federated analytics while keeping a single execution engine.

🔹 Foundation for ETLX Features

DuckDB powers almost every ETLX feature:

  • ETL / ELT transformations
  • DATA_QUALITY validations
  • MULTI_QUERIES aggregation
  • EXPORTS (CSV, Excel, templates)
  • LOGS persistence
  • SCRIPTS execution

By standardizing on DuckDB, ETLX ensures consistent behavior across all pipeline stages.

🎯 Design Philosophy

If it can be expressed in SQL, ETLX should execute it.

DuckDB enables ETLX to remain:

  • Minimal – fewer moving parts
  • Transparent – SQL is visible and inspectable
  • Performant – optimized analytical execution
  • Portable – runs anywhere

✅ Summary

  • DuckDB is the core execution engine of ETLX
  • Enables SQL-first, declarative pipelines
  • Provides in-process analytics with no external dependencies
  • Powers transformations, validations, exports, and observability

DuckDB is not just a dependency in ETLX — it is the foundation.

Last updated 05 Jan 2026, 15:03 -01 . history

Was this page helpful?