Modernizing ETL: Informatica to Databricks

Written by Ashish Baghel | Jun 24, 2025 2:18:39 PM

PowerCenter data pipeline components

PowerCenter ETL processes consist of workflows, mappings, transformations, sources, and targets. These elements define how data is ingested from any source and where it is directed after the transformation is applied. For example, a workflow might extract customer data from an Oracle database, apply transformations to clean and enrich it, and then load the final output into a SQL Server-based reporting system.

Workflows: In PowerCenter, a workflow is the process that manages how data is moved and transformed from source to destination. It defines the steps involved in extracting data like for instance customer records, applying necessary changes or clean-up, and then loading it into the final system (like a reporting database). For a business user, think of it as the automated pipeline that ensures the right data is collected, processed, and delivered where it’s needed, accurately and on time.

Mappings: Mapping is the core logic of an ETL process. Under mapping you can define what data is to be moved, how the data can be changed and where it should go. A mapping might take raw sales data from an input file, filter out the incomplete records, calculate the total sales generated, and then push the clean, enriched data into a sales dashboard.

Transformations: Transformations happen inside mappings and they are the critical steps which help define a specific rule or instruction to be applied to the data when it flows through a pipeline. Transformation can be of various types including data cleansing, data calculations, updating data record, and so on.

Source: Every PowerCenter workflow starts with incoming data from sources such as a .CSV file, databases like Oracle, SQL Server, or Teradata, as well as mainframes and flat files.

Target: The target is usually a data warehouse, data mart, or operational system used for reporting and analysis.

Challenges of Legacy PowerCenter Data Pipelines

Legacy Informatica PowerCenter pipelines were built in a time when data was mostly structured (like tables in databases), stored on-premise (in company data centers), and processed in batches (once a day or a few times a week). For many years, this setup worked well, especially for traditional reporting and dashboards based on historical data. However, several challenges have arisen in the last few years.

Challenges of PowerCenter Data Pipelines

High cost as PowerCenter requires specialized skills, hardware, and licensing
Data pipelines are slow, hence, adding new data sources or modifying logic takes weeks
Poor Integration with modern platforms like Snowflake or real-time analytics systems
Batch-based and limited real-time processing make it unsuitable for powering AI models
Legacy pipelines become bottlenecks with surging data volumes

The advent of Databricks Lakehouse architecture

The Databricks Lakehouse architecture is built from the ground up to handle modern data challenges that legacy PowerCenter pipelines struggle with. It combines the best of data lakes and data warehouses into a single, unified platform, designed for the cloud, real-time analytics, and AI/ML workloads.

The Databricks Lakehouse architecture combines the best of:

Data Lakes to store structured or unstructured, raw or refined data
Data Warehouses to provide structure, reliability, and governance needed for reporting and analytics
AI Platforms for enabling machine learning, data science, and real-time decisions

View full post