Functional data engineering with Google BigQuery

Data engineering isn’t sexy, yet it is a sine qua non to extract value from data.

I have had my fair share of data engineering projects that did not quite live up to expectations. Data engineering, or data warehousing, is quite easy to get wrong. How hard can it be to integrate data from different sources, joining them on relevant business keys, and you are ready to develop funky machine learning models for your client. Right? Wrong. In this blog post I will sum up the most important lessons I have stumbled upon in the past 10 years. It is work in progress, so stay tuned for updates.

Functional data engineering as a starting point.

In this excellent article (pdf included below), Maxime Beauchemin explains how the tenets of functional programming can be applied to data engineering. Particularly relevant are:

  • using pure tasks in your data pipeline
  • regard table partitions as immutable objects
  • using a persistent and immutable staging areas

Google BigQuery as your datawarehouse storage platform

I am a big fan of BigQuery. You can throw structured and unstructured data at it. It performs well with little management overhead compared to e.g. PostgreSQL data warehouses I had to maintain. And pricing is attractive: the cost for data-at-rest is the lowest of all cloud platform (as far as I know), hence I don’t have no qualms to really ingest as much data as possible.

Google’s solution design of BigQuery for datawarehouse practitioners is a good starting point. With the functional engineering approach in mind, my current best practice for data warehousing which I have applied e.g. for the Happi project consists of the following components: