The Future of Data Engineering #
The Data Engineering Lifecycle Isn’t Going Away #
The notion that simpler tools and practices will lead to the demise of data engineers is incorrect. They will continue to be crucial in designing and maintaining data systems, even if the tools become easier to use.
The Decline of Complexity and the Rise #
SaaS-managed services have made it possible for all companies to do data engineering by removing the complexity of understanding big data systems that previously required a large team and significant resources to deploy in the 2000s. Off-the-shelf data connectors
like Fivetran
and Airbyte
save time and resources for data engineers,
The Cloud-Scale Data OS and Improved Interoperability #
The next evolution of cloud data operating systems will focus on higher levels of abstraction.
- Object storage in the cloud and new file formats like
Parquet
andAvro
will play a significant role in cloud data interchange. - A metadata catalog will be crucial in driving automation and simplification.
- Data orchestration platforms like
Apache Airflow
,Dagster
, andPrefect
will become significantly more data-aware.
Enterprisey Data Engineering #
Simplification and best practices will make data engineering more enterprisey, but this only refers to the good aspects of data management and governance. The golden age of enterprisey data management tools is currently ongoing, with technology becoming more accessible and presenting opportunities for data engineers to focus on data management and DataOps.
Titles and Responsibilities Will Morph #
As simplicity moves up the stack, the boundaries between data engineering, software engineering, data science, and ML engineering are becoming blurred, and software engineers will need to acquire data engineering skills. Additionally, data engineers will be integrated into application development teams, and the boundaries between application backend systems and data engineering tools will be lowered through deep integration via streaming and event-driven architectures.
The Live Data Stack #
The Live Data Stack
, which fuses real-time analytics and ML into applications using streaming technologies, will be the successor to the Modern Data Stack
. It will democratize real-time technologies, making them accessible to companies of all sizes as easy-to-use cloud-based offerings and opening up new possibilities for creating better user experiences and business value.
Streaming Pipelines and Real-Time Analytical Databases #
Streaming pipelines and real-time analytical databases will facilitate the move from the MDS to the Live Data Stack. Streaming technologies will continue to see extreme growth, with a focus on the business utility of streaming data. Real-time analytical databases enable fast ingestion and subsecond queries on data. A back-to-the-future moment for data transformations will shift away from ELT
to a stream, transform, and load (STL
) approach.
The Fusion of Data with Applications #
The fusion of application and data layers is the next revolution. Soon, applications will integrate real-time automation and decision making powered by streaming pipelines and ML. Emerging database technologies and feature stores may improve the experience of engineering the live data stack.
The Tight Feedback Between Applications and ML #
The live data stack will fuse real-time analytics and ML into applications by using streaming technologies. ML will become integrated into most applications, creating a cycle of ever-smarter applications and increasing business value. Despite the rise of sophisticated data systems, spreadsheets remain widely used and handle complex analytics. We predict a new class of tools that combines the interactive analytics capabilities of a spreadsheet with the backend power of cloud OLAP systems will emerge.