The problems with Data Pipelines and the hydration of a Data Lake include:
- Fragmented Development: Individual pipelines are built in isolation, leading to inconsistencies, redundancy and avoiding organized growing and fast new use cases implementation.
- Centralized Bottleneck: Data practitioners often rely on data engineers to build and manage pipelines, creating a bottleneck and hindering agility.
- Limited Scalability and Flexibility: Scaling and adapting pipelines to new use cases can be challenging and time-consuming.
- Cost Inefficiency: Orchestration tools like Apache Airflow can be costly to operate at scale, some organizations need cost-effective alternatives.
Data teams often end with technical debt surrounding CI/CD, IaS, observability, and the least privilege principle. Establishing a foundational data platform that proactively addresses these potential gaps would empower teams to concentrate their efforts on building their data pipelines.
Key parts of a pipeline
A platform addresses these challenges by introducing a paradigm shift in data platform development. Its core design principles are:
1. CI/CD and Parameter-Driven Approach:
A Data Platform Automation (DPA) of the underlying infrastructure (cloud) leverages a multi-repository strategy, with dedicated repositories for:
- Orchestration Framework: Maintained by analytics engineers to provide seamless, extensible orchestration and execution infrastructure.
- Data Model: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata.
- Data Orchestration: Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps.
- Data Transformation: Directly used by end data practitioners to define, store, and deploy data transformations.
This separation of concerns allows for independent deployment, scalability, and clear ownership of different platform components. Each repository should have its own CI/CD pipeline, enabling independent deployment and faster iteration cycles.
2. Embracing the Analytics Engineering Concept:
DPA is built on the principles of analytics engineering, empowering data practitioners to independently build, organize, transform, and document data using software engineering best practices. This fosters a self-service data platform where data practitioners can create their own data products while adhering to a federated computational governance model.
3. Agnostic to Orchestration and Processing Tools:
DPA would be designed to be agnostic to the orchestration tool and data processing engine. While it provides sample orchestration code for Cloud Workflows, and Cloud Composer it can be integrated with other tools based on specific needs. This flexibility allows for seamless integration with existing systems and future-proofs the data platform.
4. Serverless-First Approach:
DPA prioritizes serverless technologies, leveraging the scalability, cost-effectiveness, and ease of use of services like Cloud Functions, BigQuery, and Cloud Workflows. This minimizes the need for long-term running servers, reducing operational overhead and costs.
5. Cost-Effectiveness:
By leveraging serverless technologies and providing a standardized framework for data pipeline development, DPA significantly reduces the overall cost of building and operating a data platform. This ensures cost-effectiveness and makes the platform accessible to a wider range of organizations.