01: Data Engineering Described

Data Engineering Described #


Data Engineer’s goals #


  • produce optimum ROI and reduce costs (financial and opportunity)
  • reduce risk (security, data quality)
  • maximize data value and utility
  • must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability.

History of data engineering #


  • Data warehouse term was coined by Bill Inmon in 1989.

  • IBM developed the relational database and SQL, and Oracle popularized it.

  • MPP and relational databases dominated until internet companies sought new cost-effective, scalable, and reliable systems.

  • Google’s GFS and MapReduce paper in 2004 started ultra-scalable data processing paradigm.

  • Yahoo developed Hadoop in 2006, and AWS became the first popular public cloud.

  • Hadoop ecosystem, including Hadoop, YARN, and HDFS, was popular until Apache Spark rose to prominence in 2015.

  • Simplification is now a trend towards managed and abstracted tools.

Data team #


Upstream stakeholders

  • Data architects
  • Software engineers
  • DevOps engineers

Downstream stakeholders

  • Data scientists
  • Data analysts
  • Machine learning engineers and AI researchers

Data maturity #


Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization.

Three stages:

  • starting with data
  • scaling with data
  • leading with data
comments powered by Disqus