Ephes Blog

Miscellaneous things. Not sure what to put here, yet.


Scatter plots with density quartiles with python

The other day I saw an interesting blog post about scatter plots with density quartiles using r.

I liked the idea, but wondered whether a simple kernel density plot would have the same effect. And if not, maybe how difficult it will be to adapt the approach to python. So I created this little jupyter notebook:

Writing my own blog engine: The database model

 Since I’m writing my own blog software, I’ve thought about how to lay out the models in the database. This is the layout I am currently using:


There's the main table of blogposts having a blog_id foreign key column pointing to the blog a blogpost belongs to. Blogposts are also asscociated with the user which created them in a column named author. I omitted all of the other user <-> model relationships to make the entity relationship diagram more simple.

The relationships between blogposts and media entities like images or videos are more interesting. At the moment I'm using many to many relationships for each media type. Galleries of images are considered as a different media type and have another many to many relationship to images. Probably there will be some more of those relationships like audio being added in the future.

Having a many to many relationship for each media type seems to be tedious. Just getting all of the blogposts including their related media models for the list of recent posts requires now complex sql queries. Is this really necessary? I don't now, but all approaches have their advantages and drawbacks and this seems to be the most general approach, so I'll use it until I know that I really don't need that generality. This might sound a little bit like premature optimization, but this is a fun project so it doesn't have to be efficient.

The approach I used before that was to have one many to many relationship between blogposts and a media model which then had a generic relation to the actual media model. It would have been easier to add new media types: Just add a new audio table and relate blogposts to audio content by using the generic foreign key from the media model table. This is also the approach used by generic tagging applications. They can't know in advance for which models tags will be created, so they have a generic relation that could be used for every model. But in my case there's a finite set of media types. There are not hundreds of possibilities but just five to ten. And using generic relations has some bad disadvantages which are nicely summarized in the article Avoid Django’s GenericForeignKey.

Another approach I've thought about but didn't implement is to have just simple one to many relationships between blogposts and media models. So for example you could have a blogpost with many images but not an image that belongs to more than one blogpost. In theory this is wrong, because it should be completely possible that one image appears in more than one blogpost. But this shouldn't happen often and if it happens, it's enough to duplicate the image row in the database, which isn't a problem. The image itself lives in the file system and we could use a hash of it's own content as a filename to avoid duplicate images in the file system. We would have then multiple image models in the database pointing to the same image in the file system, but would be really that bad? The database queries would get a lot simpler then.

And finally I'm not sure whether it's a good idea to have galleries in the database as their own models. It would also be possible to add a json field to blogpost and write information like which image belongs to which gallery to this json field.

It always suprises me how seemingly easy problems like how to model a blog engine turn out to be not so trivial at all on close examination. It seems that The Schainker Converse to Hoare's Law of Large Problem still holds true:  Inside every small problem is a larger problem struggling to get out.

Tartlets

Probably one of the last days this year we got to sit outside.

Using a CDN

Today I switched from serving media files directly from s3 to cloudfront. The main reason is that s3 only supports http/1.1 while cloudfront also does http/2 which is much faster because images etc. could be downloaded in parallel.

PyCon 2017

PyCon 2017 is over now and I liked it a lot. It was a really well organized event and the talks were great. I already knew the conference location because I lived only a few hundred meters away for a long time. So it didn't came as a surprise for me that it was awesome.



I don't attend conferences like this because of the talks, I have to say. But I always get a strange boost in motivation from events like this - I don't now why, but I suspect it has something to do with talking to a lot of likeminded people. Talking to my former collegues from billiger.de for example was a lot of fun. They did a lot of cool things since I moved on like having a new search service in pure python or migrating from relational databases to redis and now to elasticsearch for document storage.



At the social event we had some concerns that the beer supply could become a problem, since nearly all of the attendees showed up. Turned out that our worries were exaggerated.



Since I know Karlsruhe well, I did also met nice people that didn't have anything to do with pycon.



There was a funny installation at one of the exhibitions at the zkm.