PyCon DE Day Three

April 13, 2022, Jochen

Got up early enough today and catched the right bus. Buying a 24h ticket exactly 24h after you previously buyed one (same bus) seems to be a very strange usecase, because it broke the BVG-app. But I got to the bcc on time. Which was important, because the first talk was one of the talks I was most looking forward to: Python 3.11 in the Web Browser - A Journey by Christian Heimes. And it was as great as expected.

Another talk I was very eager to see was It is all about files and HTTP by Efe Öge which was scheduled right after the keynote in the same room, but it was more like a gentle introduction into file serving. Not the hardcore "Why do we even need nginx? Let's do it all in python with uvloop and asyncio! Let's bring zero copy tcp to uvicorn!!1" talk I would have loved to hear. Maybe I have to do this talk by myself someday. But not now, I don't have time. No, don't do it.

The next talk I attended was Squirrel - Efficient Data Loading for Large-Scale Deep Learning by Dr. Thomas Wollmann. It was about how to optimize your data ingestion to avoid having your GPUs idle (which is probably a valid usecase). Squirrel was said to be an implementation of a data-mesh - a pattern which was proposed by thoughtworks and mckinsey (never heard about it, until now). I don't know if it's fair to reject the idea solely based on this evidence (the proponents). I have a podcast episode about data meshes in my inbox, maybe I use this as an excuse to finally listen to it. Meh.

After that I listened to On Blocks, Copies and Views: updating pandas' internals by Joris Van den Bossche. This was a really great talk. Pandas is great, but it has the problem that it's often hard to say which operations do produce a data copy. And if you change data in a dataframe you believe to have copied but haven't, bad things might happen. And under exactly which circumstances a copy was created or the data referenced in multiple dataframes is altered is completely confusing. A clean solution for this confusion would be to use some kind of "copy on write" mechanism for dataframes where each operation yields a "copy", but an actual copy is only created when data is changed. But this will probably break a lot of old code, therefore it's not easy to switch to that solution.

After lunch I attended How to Find Your Way Through a Million Lines of Code by Jürgen Gmach which was also really great. It was about things you can do to get more familiar with a large codebase. For example: If you don't know were to put a new test, set a breakpoint at the location were you want to change something and then run the tests. Then put your new test beside the semantically nearest test you found using this method.

The last talk of the day was kind of a blast from the past: Transformer based clustering: Identifying product clusters for E-commerce by Sebastian Wanner and Christopher Lennan. I did something similar (use machine learning to solve this problem, not the transformer stuff which wasn't invented yet) 15 years ago working for billiger.de. Overall their approach seemed very similar to ours and their numbers (0.84 F0.5 on shoe offers) looked really good. The room was packed and the questions from the audience were also very good. I would bet there were more people working on the same problem in the room. Too bad nobody cared about this problem back when I was working on it. I have to say I'm a little bit jealous now, *sigh*. Very interesting talk.

So, back to Düsseldorf after three intense days. I think I had one warning in my corona warn app in the last two years, but now it's going crazy. Public transport probably. But since I nearly always wore a mask I'm not really worried.