Modern laptops are significantly more powerful than many realize, often capable of handling data processing tasks that traditionally were relegated to computing clusters. This perspective often surprises people, which in turn surprises me, because it's not a new development. While certain tasks undeniably require immense computational power, more often than not, we are limited by input/output (I/O) speeds rather than raw processing power. The solid-state drives (SSDs) in today's laptops, such as those in MacBooks, are incredibly fast, making them highly efficient for data-intensive tasks.
Consider a common scenario: you have several hundred gigabytes of Parquet files and need to perform aggregations or simple computations that could be handled with a tool like polars/pandas/numpy. In a cluster environment, these files are typically distributed across multiple machines using a system like Hadoop Distributed File System (HDFS) or stored in S3 buckets. Tools like Impala, Hive, or Spark are then used to execute SQL-like queries, distributing the workload across numerous nodes. However, these nodes often spend a significant amount of time waiting for I/O operations from S3 or HDFS, leading to inefficiencies.
Hannes Mühleisen, the developer of DuckDB, addressed this issue in a talk, explaining why they chose not to create a distributed version of DuckDB. He pointed out that distributing the load across multiple machines rarely offers a performance benefit that justifies the added complexity. In their tests, a single server node running DuckDB could match the performance of a 60-node Spark cluster—a trade-off that hardly seems worthwhile.
This sentiment is echoed by one of the original BigQuery developers in a blog post, stating that most companies believe they have "big data" when, in reality, their data can be efficiently processed on a single machine.
The evolution of big data infrastructure meetings tells the story:
- 2014: "We need a whole rack for this!"
- 2019: "Maybe just one really beefy server?"
- 2024: "Have you tried using your MacBook?"
- Next year: "Just run it on your smart watch, bro"
My favorite quote regarding Big data comes from Gary Bernhardt:
Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.
The Takeaway
Before investing in complex and costly infrastructure, it's crucial to assess the actual requirements of your data processing tasks. Modern laptops are powerful tools that, when fully utilized, can handle substantial workloads efficiently. By rethinking our approach and leveraging the hardware we already have, we can achieve better performance without unnecessary complexity.