The Data Engineering Lifecycle #

Components of the lifecycle #

Lifecycle:

Generation
Storage
Ingestion
Transformation
Serving data

Undercurrents of the lifecycle:

Security
Data management
DataOps
Data architecture
Orchestration
Software engineering

Generation #

Considerations for generation:

Type of data source (application/IoT/database)
Data generation rate
Data quality
Schema of the data
Data ingestion frequency
Impact on source system performance when reading data

Storage #

Considerations for storage:

Data characteristics such as volume, frequency of ingestion, and file format
Scaling capabilities including available storage, read/write rates, and throughput
Metadata capture for schema evolution, data lineage, and data flows
Storage solution type: object storage or cloud data warehouse
Schema management: schema-agnostic object storage, flexible schema with Cassandra, or enforced schema with a cloud data warehouse
Master data management, golden records, data quality, and data lineage for data governance
Regulatory compliance and data sovereignty considerations

Temperatures of data

hot data
lukewarm data
cold data

Ingestion #

Ingestion part is usually located biggest bottlenecks of the lifecycle. The source systems are normally outside your direct control and might randomly become unresponsive or provide data of poor quality.

Considerations for the ingestion:

Data availability and source reliability.
How sink will handle volume, format and frequency?
Batch or streaming?
Push or Pull?

Batch ingestion: convenient way of processing this stream in large chunks—for example, handling a full day’s worth of data in a single batch.

Streaming ingestion: allows to provide data to downstream systems in a continuous, real-time fashion. Real-time (or near real-time) means that the data is available to a downstream system a short time after it is produced (e.g., less than one second later).

Micro-batching: used in ex. Spark Streaming with data taken from 1 second period.

Push model: a source system writes data out to a target, whether a database, object store, or filesystem. Example is standard ETL process.

Pull model: data is retrieved from the source system. Example is CDC with logs.

Transformation #

Examples of transformations:

mapping data into correct types
transforming the data schema and applying normalization
large-scale aggregation for reporting
featurizeing data for ML processes
enriching the data

Reverse ETL takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems. It allows us to take analytics, scored models, etc., and feed these back into production systems or SaaS platforms. For some engineers view as a anti-pattern.

Security #

Security good practices #

Principle of least privilege: give access only to the essential data and resources needed to perform an intended function.
Create a culture of security.
Protect data from unwanted visibility using encryption, tokenization, data masking, obfuscation, and access controls.
Implement user and identity access management (IAM) roles, policies, groups, network security, password policies, and encryption.

Data Management #

Disciplines of Data Management:

Data governance, including data quality, integrity, security, discoverability and accountability
Data modeling and design
Metadata management
Data lineage
Storage and operations
Data integration and interoperability
Data lifecycle management
Data systems for advanced analytics and ML
Ethics and privacy

The Data Management Association International (DAMA) Data Management Body of Knowledge (DMBOK), which we consider to be the definitive book for enterprise data management, offers this definition:

Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle.

Data governance #

According to Data Governance: The Definitive Guide the definition of data governance is:

Data governance is, first and foremost, a data management function to ensure the quality, integrity, security, and usability of the data collected by an organization.

Master Data Management #

Master data is data about business entities such as employees, customers, products, and locations. As organizations grow larger maintaining a consistent picture of entities more challenging. Master data management (MDM) is the practice of building consistent entity definitions known as golden records.

Data lineage #

Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on. Data lineage helps with error tracking, accountability, and debugging of data and the systems that process it.

Data integration and interoperability #

Data integration is becoming increasingly important as data engineers move away from single-stack analytics and towards a heterogeneous cloud environment. The process involves integrating data across various tools and processes.

Data privacy #

Data privacy and data retention laws such as the GDPR and the CCPA require data engineers to actively manage data destruction to respect users’ right to be forgotten.

Data engineers need to ensure:

that datasets mask personally identifiable information (PII) and other sensitive information,
that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA.

DataOps #

DataOps is like DevOps, but for data products. It’s a set of practices that enable rapid innovation, high data quality, collaboration, and clear measurement and monitoring.

DataOps has three core technical elements:

automation,
monitoring and observability,
incident response.

Automation #

change management (environment, code, and data version control),
continuous integration/continuous deployment (CI/CD),
configuration as code,
monitor and maintain the reliability of technology and ,zsystems (data pipelines, orchestration, etc.), with the added dimension of checking for data quality, data/model drift, metadata integrity, and more.

Observability and monitoring #

monitoring,
logging,
alerting,
tracing are all critical to getting ahead of any problems along the data engineering lifecycle.

Incident response #

Incident response is about using the automation and observability capabilities mentioned previously to rapidly identify root causes of an incident and resolve it as reliably and quickly as possible.

Other concepts #

Orchestration #

Orchestration is the process of coordinating multiple jobs efficiently on a schedule. It ensures high availability, job history, visualization, and alerting. Advanced engines can backfill new tasks and DAGs, but orchestration is strictly a batch concept.

Infrastructure as code (IaC) #

IaC (Infrastructure as Code) applies software engineering practices to managing infrastructure configuration. Data engineers use IaC frameworks to manage their infrastructure in a cloud environment, instead of manually setting up instances and installing software.