Disclaimer: I have no clue what I'm doing here. If you do: pls halp.
Setup
Last year I did a talk about serving files with Django. For demonstration purposes I wrote a little benchmark tool called "will it saturate". Recently I played around with mealie a little bit and noticed that it uses an additional caddy to serve images for recipes. On discord I asked why and was told it's used because caddy is faster at serving images than uvicorn / starlette. So I wondered how much faster it might be and tried to get my old "will_it_saturate" project to test it.
My base assumption is that there should not be a big difference between different web servers serving static files, because there's not much those web servers do when serving files. They just orchestrate operating system syscalls that do the real work and whether you call those via c (nginx), go (caddy) or python (uvicorn) shouldn't matter that much. Well, turns out that this assumption might be wrong. Which is interesting. The webservers used are: nginx, caddy, uvicorn. I included nginx, because it represents the state of the art in serving static files when configured properly, but I'm not sure whether I managed to do that. Caddy and uvicorn are the two servers I wanted to benchmark against each other.
For the benchmark I created 12.5K files, each containing 100KB of random data (similar to a recipe image in mealie) so that downloading them would saturate a gigabit link for about ten seconds. I know it's possible to saturate a gigabit connection with concurrent file downloads serving those files from uvicorn so I didn't repeat that. The question I'm interested this time is: How much faster than uvicorn is caddy? The main metric is transferred bytes per second. The bigger, the better.
Results
Here's a first result running this notebook:
server |
client |
elapsed |
file_size_h |
bytes_per_second_h |
arch |
nginx/minimal |
wrk |
0.695651 |
97.66KB |
1.67GB |
x86_64 |
nginx/sendfile |
wrk |
0.718180 |
97.66KB |
1.62GB |
x86_64 |
caddy |
wrk |
0.880563 |
97.66KB |
1.32GB |
x86_64 |
fastAPI/uvicorn |
wrk |
6.153709 |
97.66KB |
193.72MB |
x86_64 |
Ok, well. Seems like caddy is a lot faster than uvicorn, wow. The server is an old intel xeon running linux. The client is wrk opening up 20 connections concurrently (more than a browser usually does). Here are some points that surprised me:
- Turning on sendfile didn't make nginx faster. I think this is because there's some hard limit on ssd bandwidth or something like that
- Had to use multiple workers with uvicorn. Using a single worker yields only 50MB/s.
- Caddy is not much slower than nginx despite nginx is using 4 workers and caddy only one. Probably another hint that there is some hardware bottleneck.
Let's see how the numbers change when I run the script on my macbook air running macOS:
server |
client |
elapsed |
file_size_h |
bytes_per_second_h |
arch |
nginx/sendfile |
wrk |
0.252434 |
97.66KB |
4.61GB |
arm64 |
nginx/minimal |
wrk |
0.287506 |
97.66KB |
4.05GB |
arm64 |
caddy |
wrk |
0.481639 |
97.66KB |
2.42GB |
arm64 |
fastAPI/uvicorn |
wrk |
4.228977 |
97.66KB |
281.89MB |
arm64 |
Ok, also interesting. My macbook air (M1) does not even get warm running the benchmark. Surprising details:
- Using nginx with multiple workers is much faster now (no hardware bottleneck?)
- Using sendfile is faster - I didn't know macOS even had a sendfile syscall, weird
Conclusion
There's indeed a big difference between nginx and caddy on one side and uvicorn on the other. And I don't know why, so further research is needed :). The reason I tried to benchmark nginx with and without sendfile was to make sure I'm not measuring some form of kernel level io / zero copy tcp vs having to do all the work in userspace, because uvicorn lacks zerocopy send support atm. The results seem to suggest that it's possible to be very fast without sendfile. Which is good, because nowadays we usually serve files from some kind of object store and not from the file system and therefore it's not possible to benefit from sendfile anyway (please correct me if I'm wrong).
Still, those numbers won't make a big difference in practice, because most machines don't have network links faster than one gigabit. It will become much more relevant when this changes to 10Gb, because than uvicorn will be too slow to saturate it.
The next thing I'll be testing is to preload the files in memory to rule out aiofiles being the culprit for being slow. What I also find really interesting is that I had to use wrk as http client for the benchmarks because the python http clients I tried (httpx, aiohttp) were far too slow. With aiohttp being a little bit faster than httpx. They max out at about 80MB/s. Why is that?