Ephes Blog

Miscellaneous things. Mostly Weeknotes and links I stumbled upon.


Why Your Laptop Might Be Faster Than a Cluster: Rethinking Big Data in 2025

Modern laptops are significantly more powerful than many realize, often capable of handling data processing tasks that traditionally were relegated to computing clusters. This perspective often surprises people, which in turn surprises me, because it's not a new development. While certain tasks undeniably require immense computational power, more often than not, we are limited by input/output (I/O) speeds rather than raw processing power. The solid-state drives (SSDs) in today's laptops, such as those in MacBooks, are incredibly fast, making them highly efficient for data-intensive tasks.

Consider a common scenario: you have several hundred gigabytes of Parquet files and need to perform aggregations or simple computations that could be handled with a tool like polars/pandas/numpy. In a cluster environment, these files are typically distributed across multiple machines using a system like Hadoop Distributed File System (HDFS) or stored in S3 buckets. Tools like Impala, Hive, or Spark are then used to execute SQL-like queries, distributing the workload across numerous nodes. However, these nodes often spend a significant amount of time waiting for I/O operations from S3 or HDFS, leading to inefficiencies.

Hannes Mühleisen, the developer of DuckDB, addressed this issue in a talk, explaining why they chose not to create a distributed version of DuckDB. He pointed out that distributing the load across multiple machines rarely offers a performance benefit that justifies the added complexity. In their tests, a single server node running DuckDB could match the performance of a 60-node Spark cluster—a trade-off that hardly seems worthwhile.

This sentiment is echoed by one of the original BigQuery developers in a blog post, stating that most companies believe they have "big data" when, in reality, their data can be efficiently processed on a single machine.

The evolution of big data infrastructure meetings tells the story:

  • 2014: "We need a whole rack for this!"
  • 2019: "Maybe just one really beefy server?"
  • 2024: "Have you tried using your MacBook?"
  • Next year: "Just run it on your smart watch, bro"

My favorite quote regarding Big data comes from Gary Bernhardt:

Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

The Takeaway

Before investing in complex and costly infrastructure, it's crucial to assess the actual requirements of your data processing tasks. Modern laptops are powerful tools that, when fully utilized, can handle substantial workloads efficiently. By rethinking our approach and leveraging the hardware we already have, we can achieve better performance without unnecessary complexity.


How to Add Extra Data to a Python Package

Today, I learned how to include external files in a Python package using uv and the hatchling build backend. My goal was to add a directory containing sample fixture data to the package. Here’s the resulting pyproject.toml file:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.sdist]
packages = ["src/mypackage"]

[tool.hatch.build.targets.sdist.force-include]
"data/fixtures" = "mypackage/fixtures"  # Include the fixtures directory in the sdist

[tool.hatch.build.targets.wheel]
packages = ["src/mypackage"]

[tool.hatch.build.targets.wheel.force-include]
"data/fixtures" = "mypackage/fixtures"  # Include the fixtures directory in the wheel

This turned out to be surprisingly tricky to get right, and the toml file doesn’t make it immediately clear which paths are the source and which are the target in the sdist and wheel. So, here’s an overview of the directory structure:

project_root/
├── src/
│   └── mypackage/
│       ├── __init__.py
│       ├── module1.py
│       └── fixtures/           # Included as mypackage/fixtures by pyproject.toml
│           ├── fixture1.json
│           └── fixture2.json
├── data/
│   └── fixtures/
│       ├── fixture1.json
│       └── fixture2.json
├── pyproject.toml
└── README.md

Weeknotes 2024-12-02


Weeknotes 2024-11-25

Don't use a big word when a singularly unloquacious and diminutive linguistic expression will satisfactorily accomplish the contemporary necessity. --Dgar

This week brought some improvements and a new release for podcast-transcripts. Overall, though, it was more work and less play than usual—a pattern I suspect will continue through the end of the year.

Articles

Weeknotes

Software

Videos

Out of Context Images


🎙️ Introducing podcast-transcript: Audio Transcription Made Simple

Hey folks! I recently built a little command-line tool called podcast-transcript that turns audio into text. While it started as a podcast transcription project during the PyDDF autumn sprint, it works great for any speech audio. The coolest part? It can transcribe a 2-hour podcast in about 90 seconds!

Quick Start 🚀

pip install podcast-transcript  # or use pipx or uvx
transcribe https://d2mmy4gxasde9x.cloudfront.net/cast_audio/pp_53.mp3

Why Groq?

After trying different approaches, I landed on using the Groq API for transcription. Here's why:

  • It's blazing fast
  • Getting an API key is free and API usage is free (with reasonable limits: 8 hours of audio per day, 2 hours per hour)
  • The Whisper large-v3 model handles multiple languages well (especially noticeable for German content)

Technical Bits

The tool handles some interesting challenges under the hood:

  • Automatically resamples audio to 16kHz mono before upload (if you don't do it before, Groq will after upload)
  • Splits files larger than 25MB into chunks and stitches the transcripts back together
  • Uses httpx for direct API calls to get detailed JSON responses inspired by Simon Willison’s approach
  • Outputs in multiple formats: DOTe JSON, Podlove JSON, WebVTT, and plain text

Future Plans

I'm planning to add support for local transcription using the OpenAI Whisper model. While Whisper v2 works well enough for English content, v3 shows notable improvements for other languages (especially German). I initially skipped local processing because of the PyTorch dependency, but it's on the roadmap! I also plan to add multitrack support for handling audio files with separate speaker tracks.

The code is open source and contributions are welcome. Let me know if you try it out!