Exploring Self-Service ETL in AWS

As someone deeply embedded in the Microsoft Fabric ecosystem, I’ve spent considerable time building scalable dataflows, orchestrating pipelines, and structuring data in Lakehouses. However, a recent project required me to pivot and explore the AWS data platform to evaluate how self-service ETL (Extract, Transform, Load) can be implemented using Amazon’s native services.

This experience gave me the opportunity to compare AWS’s offerings with Fabric’s, especially around self-service data preparation and enterprise-grade transformation. The result? A powerful set of tools that enable data engineers and citizen data analysts alike to design, manage, and visualize ETL pipelines without heavy code dependency.

Let’s take a deep dive into the architecture I implemented — shown in the diagram below — and break down each component:

🧱 Architectural Overview: Layered Data Lake + Glue + Athena

![Self-Service ETL Architecture on AWS](/mnt/data/post-architecture aws.png)

The architecture follows a bronze-silver-gold data lake pattern, supported entirely by AWS native services:

🔹 Phase 1: Data Extraction & Load (Raw Ingestion)

At the ingestion stage, data is sourced from multiple relational databases, such as:

Oracle
SQL Server
PostgreSQL
MySQL

Each source pushes raw data (without transformation) into Amazon S3, organized into the Bronze Layer. This layer serves as a raw, immutable landing zone. The data is typically stored in CSV, JSON, or Parquet format. This separation of raw data is a critical design choice, enabling traceability and reprocessing when needed.

You can ingest this data via:

AWS Glue connectors
AWS DMS (Database Migration Service)
Custom Lambda jobs or Step Functions
Third-party ELT tools like Fivetran or Matillion

🔹 Phase 2: Data Transformation (Self-Service ETL)

Once raw data lands in S3, transformation begins in a controlled, repeatable fashion using AWS Glue DataBrew and AWS Glue ETL.

✅ AWS Glue DataBrew

Glue DataBrew is a no-code data preparation tool that enables users (data analysts, BI developers) to:

Profile data with automated statistics
Clean nulls, rename columns, detect data types
Apply basic transformations through 250+ built-in operations
Schedule and monitor jobs using AWS Step Functions

The output of DataBrew flows into the Silver Layer — cleaned and normalized datasets ready for enrichment or downstream processing.

✅ AWS Glue ETL

For more complex transformation logic — involving joins, window functions, or Python/Scala-based custom logic — Glue ETL comes into play.

Glue ETL offers:

Serverless Spark-based execution
Python or Scala scripting (PySpark, SparkSQL)
Integration with AWS Glue Jobs and Workflows

Data processed via Glue ETL is stored in the Gold Layer, which holds enriched, query-optimized, and business-ready datasets. These datasets are often partitioned, compressed (Snappy, Gzip), and stored in columnar formats like Parquet for cost-efficient analytics.

💡 A common practice is to orchestrate both Glue DataBrew and Glue ETL pipelines using AWS Step Functions, ensuring sequential execution, conditional logic, and failure handling across multiple stages.

🔹 Phase 3: Metadata & Query Layer

🧬 AWS Glue Data Catalog

As datasets evolve across the Bronze → Silver → Gold stages, managing metadata becomes crucial. This is where the AWS Glue Data Catalog provides immense value:

Schema registry for S3 datasets
Partition indexing
Table definitions compatible with Athena, EMR, and Redshift Spectrum

By crawling the S3 buckets or registering tables manually, the catalog provides a single source of truth for all datasets.

🔹 Phase 4: Reporting & Analytics

The final component is Amazon Athena, a serverless SQL engine that queries data directly from S3. Athena leverages the Glue Data Catalog to query structured data using ANSI SQL.

Key benefits:

No infrastructure to manage
Pay-per-query model (based on data scanned)
Fast performance with Parquet + partitioning

Using Power BI, I was able to connect directly to Athena via the ODBC connector, enabling interactive dashboards and ad-hoc reporting on top of the transformed datasets — just like in a traditional SQL environment.

🔍 Why This Matters: Enabling Self-Service at Scale

This architecture represents a shift towards self-service data engineering in AWS. Similar to Microsoft Fabric’s Dataflows and Pipelines, AWS enables technical and semi-technical users to collaborate on ETL workloads with minimal DevOps overhead.

✅ Benefits of This Design:

Separation of concerns via Bronze/Silver/Gold layers
Reusability of data assets across teams
Serverless architecture = zero infrastructure management
Low-code/no-code options with Glue DataBrew
Unified metadata with Glue Data Catalog
Seamless reporting via Athena → Power BI

⚖️ Final Thoughts

While Microsoft Fabric excels with its seamless integration between OneLake, Power BI, and Pipelines, AWS offers an equally powerful modular approach. The key difference lies in how much abstraction you need. Fabric is opinionated and integrated, AWS is flexible and granular.

If your organization is already on AWS, embracing tools like Glue DataBrew, Glue ETL, and Athena can enable a highly scalable and user-friendly data platform — without needing to set up and maintain Spark clusters or Airflow DAGs manually.

Would I use it again? Absolutely. Especially for teams looking to democratize ETL without giving up enterprise-scale governance and performance.

Juni, 2025