Data Engineering Described #
Data Engineer’s goals #
- produce optimum ROI and reduce costs (financial and opportunity)
- reduce risk (security, data quality)
- maximize data value and utility
- must constantly optimize along the axes of cost, agility,
, simplicity, reuse, andinteroperability
History of data engineering #
Data warehouse term was coined by Bill Inmon in 1989.
IBM developed the relational database and SQL, and Oracle popularized it.
MPP and relational databases dominated until internet companies sought new cost-effective, scalable, and reliable systems.
Google’s GFS and MapReduce paper in 2004 started ultra-scalable data processing paradigm.
Yahoo developed Hadoop in 2006, and AWS became the first popular public cloud.
Hadoop ecosystem, including Hadoop, YARN, and HDFS, was popular until Apache Spark rose to prominence in 2015.
Simplification is now a trend towards managed and abstracted tools.
Data team #
Upstream stakeholders
- Data architects
- Software engineers
DevOps engineers
Downstream stakeholders
- Data scientists
- Data analysts
- Machine learning engineers and AI researchers
Data maturity #
Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization.
Three stages:
- starting with data
- scaling with data
- leading with data