Data engineering isn’t sexy, yet it is a sine qua non to extract value from data.
I have had my fair share of data engineering projects that did not quite live up to expectations. Data engineering, or data warehousing, is quite easy to get wrong. How hard can it be to integrate data from different sources, joining them on relevant business keys, and you are ready to develop funky machine learning models for your client. Right? Wrong. In this blog post I will sum up the most important lessons I have stumbled upon in the past 10 years. It is work in progress, so stay tuned for updates.
Functional data engineering as a starting point.
In this excellent article (pdf included below), Maxime Beauchemin explains how the tenets of functional programming can be applied to data engineering. Particularly relevant are:
- using pure tasks in your data pipeline
- regard table partitions as immutable objects
- using a persistent and immutable staging areas
Google BigQuery as your datawarehouse storage platform
I am a big fan of BigQuery. You can throw structured and unstructured data at it. It performs well with little management overhead compared to e.g. PostgreSQL data warehouses I had to maintain. And pricing is attractive: the cost for data-at-rest is the lowest of all cloud platform (as far as I know), hence I don’t have no qualms to really ingest as much data as possible.
Google’s solution design of BigQuery for datawarehouse practitioners is a good starting point. With the functional engineering approach in mind, my current best practice for data warehousing which I have applied e.g. for the Happi project consists of the following components:
- Use BigQuery’s date partitioning to ingest data into your staging schemas. Doing so, you can make optimal use of the parallel processing of BigQuery. Data generated by apps can even by ingested real-time by pushing data in newline delimited json format to BigQuery.
- Use a lightweight workflow management library like Prefect to orchestrate and schedule all your idempotent tasks.
- Use templated SQL for efficiently and reliably generating datamarts. Given the underlying Dremmel storage engine, we don’t need to bother with creating dimension and fact tables per se, although we still apply Kimball’s concept of dimensional modeling).