Ephes Blog

Miscellaneous things. Mostly Weeknotes and links I stumbled upon.


Weeknotes 2024-12-09

, Jochen
The thing most people don't remember about Prometheus is that humans already had fire, but then Zeus locked it up behind a paywall, because Zeus was a petty little bitch. Prometheus stole it *back*. In this essay about DRM I will --jwz

I was just going through my weeknotes from last year and decided to take it a bit easier on myself with the weekly updates 😇. Not much else has changed though - I'm still juggling too many things at once. Despite being busy, I did manage to write a TIL post about including extra data in Python packages, and I published a quick hot take on why I think Big Data might be on its way out (which actually started as a comment in a discussion that I had Claude help me reshape into a blog post). Still working on the latest podcast episode, but on a brighter note, we got to take a lovely trip to Maastricht.

Stuff

Out of Context Images


Why Your Laptop Might Be Faster Than a Cluster: Rethinking Big Data in 2025

, Jochen

Modern laptops are significantly more powerful than many realize, often capable of handling data processing tasks that traditionally were relegated to computing clusters. This perspective often surprises people, which in turn surprises me, because it's not a new development. While certain tasks undeniably require immense computational power, more often than not, we are limited by input/output (I/O) speeds rather than raw processing power. The solid-state drives (SSDs) in today's laptops, such as those in MacBooks, are incredibly fast, making them highly efficient for data-intensive tasks.

Consider a common scenario: you have several hundred gigabytes of Parquet files and need to perform aggregations or simple computations that could be handled with a tool like polars/pandas/numpy. In a cluster environment, these files are typically distributed across multiple machines using a system like Hadoop Distributed File System (HDFS) or stored in S3 buckets. Tools like Impala, Hive, or Spark are then used to execute SQL-like queries, distributing the workload across numerous nodes. However, these nodes often spend a significant amount of time waiting for I/O operations from S3 or HDFS, leading to inefficiencies.

Hannes Mühleisen, the developer of DuckDB, addressed this issue in a talk, explaining why they chose not to create a distributed version of DuckDB. He pointed out that distributing the load across multiple machines rarely offers a performance benefit that justifies the added complexity. In their tests, a single server node running DuckDB could match the performance of a 60-node Spark cluster—a trade-off that hardly seems worthwhile.

This sentiment is echoed by one of the original BigQuery developers in a blog post, stating that most companies believe they have "big data" when, in reality, their data can be efficiently processed on a single machine.

The evolution of big data infrastructure meetings tells the story:

  • 2014: "We need a whole rack for this!"
  • 2019: "Maybe just one really beefy server?"
  • 2024: "Have you tried using your MacBook?"
  • Next year: "Just run it on your smart watch, bro"

My favorite quote regarding Big data comes from Gary Bernhardt:

Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

The Takeaway

Before investing in complex and costly infrastructure, it's crucial to assess the actual requirements of your data processing tasks. Modern laptops are powerful tools that, when fully utilized, can handle substantial workloads efficiently. By rethinking our approach and leveraging the hardware we already have, we can achieve better performance without unnecessary complexity.


How to Add Extra Data to a Python Package

, Jochen

Today, I learned how to include external files in a Python package using uv and the hatchling build backend. My goal was to add a directory containing sample fixture data to the package. Here’s the resulting pyproject.toml file:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.sdist]
packages = ["src/mypackage"]

[tool.hatch.build.targets.sdist.force-include]
"data/fixtures" = "mypackage/fixtures"  # Include the fixtures directory in the sdist

[tool.hatch.build.targets.wheel]
packages = ["src/mypackage"]

[tool.hatch.build.targets.wheel.force-include]
"data/fixtures" = "mypackage/fixtures"  # Include the fixtures directory in the wheel

This turned out to be surprisingly tricky to get right, and the toml file doesn’t make it immediately clear which paths are the source and which are the target in the sdist and wheel. So, here’s an overview of the directory structure:

project_root/
├── src/
│   └── mypackage/
│       ├── __init__.py
│       ├── module1.py
│       └── fixtures/           # Included as mypackage/fixtures by pyproject.toml
│           ├── fixture1.json
│           └── fixture2.json
├── data/
│   └── fixtures/
│       ├── fixture1.json
│       └── fixture2.json
├── pyproject.toml
└── README.md

Weeknotes 2024-12-02

, Jochen
When I die please mix my ashes with concrete, then turn me into a brick and throw me at an Amazon Warehouse --Erik Uden

Yep, it’s been a stressful week, just as expected. Somehow, we managed to record a podcast episode, but now I need to find the time to get it published.

Fediverse

Software

Videos


Weeknotes 2024-11-25

, Jochen
Don't use a big word when a singularly unloquacious and diminutive linguistic expression will satisfactorily accomplish the contemporary necessity. --Dgar

This week brought some improvements and a new release for podcast-transcripts. Overall, though, it was more work and less play than usual—a pattern I suspect will continue through the end of the year.

Articles

Weeknotes

Software

Videos

Out of Context Images