Processing & Ingestion

Processing & Ingestion#

This section explores how data is acquired, moved, and prepared for analysis. We begin with a brief overview of key concepts—such as ETL/ELT workflows, streaming ingestion, and change data capture (CDC)—then dive into practical tools and solutions used across modern data platforms.

Topics include:

  • Power Query (M Language) for data transformation
  • Microsoft Fabric Data Factory pipeline design
  • Apache Spark and Flink for distributed processing
  • Apache Kafka and NiFi for real-time data ingestion

ETL/ELT#

ETL (Extract, Transform, Load) has long been a cornerstone of Business Intelligence (BI) solutions. Traditionally, transactional data is extracted from an OLTP (Online Transaction Processing) system, transformed into an OLAP-compatible schema, and then loaded into an OLAP (Online Analytical Processing) data store. In this model, a dedicated ETL server handles all three steps. For example, an Informatica server might extract transactional data from an Oracle database, clean and reshape it, and then load it into a SQL Server-based data mart for use with SSAS (SQL Server Analysis Services).

ETL Image

However, ETL servers—like Informatica in this case—can become performance bottlenecks, especially when heavy transformations are required. Meanwhile, OLAP servers are typically more powerful and better suited for intensive computation. This led to the rise of ELT (Extract, Load, Transform), where data is first extracted from the OLTP system and loaded directly into the OLAP server. The transformation step is then performed within the OLAP environment itself, often leveraging its superior processing capabilities.

In an ELT scenario, the standalone ETL server may be eliminated entirely. For instance, SSIS (SQL Server Integration Services) running on the OLAP server might extract data from Oracle, load it into staging tables, perform transformations to align with the OLAP schema, and finally load the cleaned data into the data mart.

ELT Image

Data Ingestion#

In recent years, the term “data ingestion” has gained popularity, often replacing traditional references to ETL and ELT. This shift reflects a broader change in how data systems are designed and operated. Rather than focusing solely on transformation workflows, ingestion emphasizes the act of acquiring and moving data from diverse sources into centralized platforms—whether for immediate analysis, long-term storage, or downstream processing.

Modern data ingestion is often continuous, scalable, and real-time. It supports a wide variety of data types and formats, including structured, semi-structured, and unstructured data. Technologies like Apache Kafka, Apache NiFi, and cloud-native services such as Microsoft Fabric Data Factory enable ingestion from APIs, databases, logs, IoT devices, and more.

Unlike traditional ETL/ELT pipelines, ingestion systems are designed to handle high-throughput, low-latency data flows. They often incorporate features like schema evolution, fault tolerance, and event-driven processing. This makes them ideal for streaming analytics, operational dashboards, and machine learning pipelines.

Instead of tightly coupling ingestion with transformation—as in traditional ETL/ELT—modern systems treat ingestion as a standalone, scalable process. This shift enables organizations to collect diverse data types from distributed sources and deliver them into centralized platforms like data warehouses, lakehouses, and data lakes. These destinations serve as the foundation for analytics, machine learning, and operational intelligence. By decoupling ingestion from transformation, teams gain agility, fault tolerance, and the ability to evolve their data workflows independently.

In short, “data ingestion” has become the go-to term for the first stage of modern data architecture. It represents a flexible, technology-neutral approach that supports both batch and streaming pipelines.

ETL/ELT VS. Ingestion#

Aspect ETL / ELT Data Ingestion
Focus Transformation-centric Movement-centric
Workflow Coupling Tightly coupled (Extract → Transform → Load) Decoupled from transformation
Processing Style Mostly batch Batch and streaming
Tools Informatica, SSIS, Talend Apache Kafka, Apache NiFi, Fabric Data Factory
Destination Platforms Data warehouses, OLAP systems Data lakes, lakehouses, warehouses
Scalability Limited by ETL server Horizontally scalable
Latency Higher latency Low-latency, real-time capable
Use Cases BI reporting, historical analysis Streaming analytics, ML pipelines, operational BI