Data Engineering in 2022: Exploring LakeFS with Jupyter and PySpark
With my foray into the current world of data engineering I wanted to get my hands dirty with some of the tools and technologies I’d been reading about. The vehicle for this was trying to understand more about LakeFS, but along the way dabbling with PySpark and S3 (MinIO) too.
I’d forgotten how amazingly useful notebooks are. It’s six years since I wrote about them last (and the last time I tried my hand at PySpark). This blog is basically the notebook, with some more annotations.