Data Engineering Described #
Data Engineer’s goals #
- produce optimum ROI and reduce costs (financial and opportunity)
- reduce risk (security, data quality)
- maximize data value and utility
- must constantly optimize along the axes of cost, agility,
scalability
, simplicity, reuse, andinteroperability
.
History of data engineering #
-
Data warehouse term was coined by Bill Inmon in 1989.
-
IBM developed the relational database and SQL, and Oracle popularized it.
-
MPP and relational databases dominated until internet companies sought new cost-effective, scalable, and reliable systems.
-
Google’s GFS and MapReduce paper in 2004 started ultra-scalable data processing paradigm.
-
Yahoo developed Hadoop in 2006, and AWS became the first popular public cloud.
-
Hadoop ecosystem, including Hadoop, YARN, and HDFS, was popular until Apache Spark rose to prominence in 2015.
-
Simplification is now a trend towards managed and abstracted tools.
Data team #
Upstream stakeholders
- Data architects
- Software engineers
DevOps engineers
Downstream stakeholders
- Data scientists
- Data analysts
- Machine learning engineers and AI researchers
Data maturity #
Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization.
Three stages:
- starting with data
- scaling with data
- leading with data