Oct 5, 2023

Learning Apache Flink S01E03: Running my First Flink Cluster and Application

🎉 I just ran my first Apache Flink cluster and application on it 🎉

Oct 4, 2023

cd: string not in pwd

A brief diversion from my journey learning Apache Flink to document an interesting zsh oddity that briefly tripped me up:

cd: string not in pwd: flink-1.17.1

Oct 2, 2023

Learning Apache Flink S01E02: What is Flink?

My journey with Apache Flink begins with an overview of what Flink actually is.

What better place to start than the Apache Flink website itself:

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Sep 29, 2023

Learning Apache Flink S01E01: Where Do I Start?

Like a fortunate child on Christmas Day, I’ve got a brand new toy! A brand new—to me—open-source technology to unwrap, learn, and perhaps even aspire to master elements of within.

I joined Decodable two weeks ago, and since Decodable is built on top of Apache Flink it seems like a great time to learn it. After six years learning Apache Kafka and hearing about this “Flink” thing but—for better or worse—never investigating it, I now have the perfect opportunity to do so.

Sep 21, 2023

An Itch That Just Has to Be Scratched… (Or, Why Am I Joining Decodable?)

This week I joined Decodable. I’m grateful to my former colleagues at Treeverse for allowing me to join them on the journey with lakeFS - but something about the streaming world was too strong to resist 😁.

Jul 19, 2023

Blog Writing for Developers

Writing is one of the most powerful forms of communication, and it’s useful in a multitude of roles and contexts. As a blog-writing, documentation-authoring, twitter-shitposting DevEx engineer I spend a lot of my time writing. Recently, someone paid me a very nice compliment about a blog I’d written and asked how they could learn to write like me and what resources I’d recommend.

Never one to miss a chance to write and share something, here’s my response to this :)

May 23, 2023

What Does This DevEx Engineer Do?

This was originally titled more broadly “What Does A DevEx Engineer Do”, but that made it into a far too tedious and long-winding etymological exploration of the discipline. Instead, I’m going to tell you what this particular instantiation of the entity does 😄

May 3, 2023

Authoring Wordpress blogs in Markdown (with Google Docs for review)

Wordpress still, to an extent, rules the blogging world. Its longevity is testament to…something about it ;) However, it’s not my favourite platform in which to write a blog by a long way. It doesn’t support Markdown to the extent that I want. Yes, I’ve tried the plugins; no, they didn’t do what I needed.

I like to write all my content in a structured format - ideally Asciidoc, but I’ll settle for Markdown too. Here’s how I stayed [almost] sane whilst composing a blog in Markdown, reviewing it in Google Docs, and then publishing it in Wordpress in a non-lossy way.

Apr 20, 2023

Building Better Docs - Automating Jekyll Builds and Link Checking for PRs

One of the most important ways that a project can help its developers is providing them good documentation. Actually, scratch that. Great documentation.

Apr 5, 2023

Using Delta from pySpark - `java.lang.ClassNotFoundException: delta.DefaultSource`

No great insights in this post, just something for folk who Google this error after me and don’t want to waste three hours chasing their tails… 😄

Mar 14, 2023

Quickly Convert CSV to Parquet with DuckDB

Here’s a neat little trick you can use with DuckDB to convert a CSV file into a Parquet file:

COPY (SELECT *
	    FROM read_csv('~/data/source.csv',AUTO_DETECT=TRUE))
  TO '~/data/target.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

Mar 3, 2023

Making the move from Alfred to Raycast

It all started with a tweet.

Mar 3, 2023

Aligning mismatched Parquet schemas in DuckDB

What do you do when you want to query over multiple parquet files but the schemas don’t quite line up? Let’s find out 👇🏻

Dec 9, 2022

Looking Forwards, and Looking Backwards

As we enter December and 2022 draws to a close, so does a significant chapter in my working career—later this month I’ll be leaving Confluent and onto pastures new.

It’s nearly six years since I wrote a 'moving on' blog entry, and as well as sharing what I’ll be working on next (and why), I also want to reflect on how much I’ve benefited from my time at Confluent and particularly the people with whom I worked.

Nov 8, 2022

Data Engineering in 2022: ELT tools

In my quest to bring myself up to date with where the data & analytics engineering world is at nowadays, I’m going to build on my exploration of the storage and access technologies and look at the tools we use for loading and transforming data.

Oct 24, 2022

Data Engineering in 2022: Wrangling the feedback data from Current 22 with dbt

I started my dbt journey by poking and pulling at the pre-built jaffle_shop demo running with DuckDB as its data store. Now I want to see if I can put it to use myself to wrangle the session feedback data that came in from Current 2022. I’ve analysed this already, but it struck me that a particular part of it would benefit from some tidying up - and be a good excuse to see what it’s like using dbt to do so.

Oct 20, 2022

Data Engineering in 2022: Exploring dbt with DuckDB

I’ve been wanting to try out dbt for some time now, and a recent long-haul flight seemed like the obvious opportunity to do so. Except many of the tutorials with dbt that I found were based on using Cloud, and airplane WiFi is generally sucky or non-existant. Then I found the DuckDB-based demo of dbt, which seemed to fit the bill (🦆 geddit?!) perfectly, since DuckDB runs locally. In addition, DuckDB had appeared on my radar recently and I was keen to check it out.

Oct 14, 2022

Current 22 - Session Analysis with DuckDB and Jupyter Notebook

At Current 2022 the audience was given the option to submit ratings. Here’s some analysis I’ve done on the raw data. It’s interesting to poke about it, and it also gave me an excuse to try using DuckDB in a notebook!

Oct 2, 2022

Data Engineering in 2022: Architectures & Terminology

This is one of those you had to be there moments. If you come into the world of data and analytics engineering today, ELT is just what it is and is pretty much universally understood. But if you’ve been around for …waves hands… longer than that, you might be confused by what people are calling ELT and ETL. Well, I was ✋.

Sep 26, 2022