LearnLang – a small chrome extension for learning the German cases

I’ve been learning German for quite some time now. Some months ago, it came to the point where I was stuck – in order to progress I had to learn the German cases by heart.

The German Cases – By Touhidur Rahman – Own work, CC BY-SA 4.0, Link

It’s not a lot of data, and being able to understand it is relatively straightforward, however knowing it actively as part of a language takes practice.

My main sources of German practice are Duolingo, books and music. Both books and music contribute to passive knowledge rather than practice, and Duolingo just wasn’t focused enough. I decided to write something myself. It was a small itch I had to scratch!

Ideally, I just wanted exercises that given a sentence, I would have to pick the correct form of der/das/die/den/dem/des whenever it appeared. This should apply to ein/eine/eines/einer/einem/einen, dein/deine/… and mein/meine/… etc. you get the point.

To achieve that, I wrote a small chrome extension that would process a page, find all the pieces of texts to replace, and add a bit of dropdown html instead of them. Then you would pick the right option in the dropdown – it would turn into the right word with a green checkmark, otherwise you would get some toaster message saying you were wrong.

Since these days I have a full time job plus two kids – I wrote this mostly during train rides and a couple of evenings. Doing this allowed me to lean how to write a chrome extension (it’s really easy), but interestingly enough, there is a small challenge there I didn’t expect: how to regex-search through text nodes in a given HTML document and to replace the match with some HTML? The solution is apparently non-trivial.

If you decide to take the old text, add some tags and then old_tag.innerHTML = modify(text_data) you are in for a nasty surprise. If that text_data contained html tags as text – they would now be parsed as HTML. This is at best a bug, and at worst a security risk. It would appear to work, except when it won’t. Unfortunately, a lot of answers on stackoverflow suggest you do exactly that.

Well, as a lazy developer – I used somebody else’s answer, almost as is. It wasn’t even the selected answer – the selected answer used innerHTML :(

Here is the extension itself, you are welcome to try it out, e.g. on Rotkäppchen (AKA “Little Red Riding Hood”).

A demo of the extension
Posted in Programming | Tagged , , , , , , | Leave a comment

Writing a pandemic simulation

Over the last weekend I felt like programming something fun and easy, so I thought, why not write yet another pandemic/epidemic simulation.

A quick demo of simpandemic

So between helping a crying child and preparing lunch, I created simpandemic. It’s small, simplistic, but easy to play with and change parameters. As a toy project, it’s far from perfect. I implemented infection based on distance rather than collision detection, like some other simulations do, and optimized it using a grid and not a tree structure (e.g. rtree). However, it works, it is playable and very much tweak-able.

Right now it depends on pygame, which is great fun, but a bit of a pain to get it working on mac these days.

Feel free to download it, fork it, play with it, whatever. I’ll accept fun pull requests in case these actually come.

Stay healthy, stay safe, stay home!

Posted in Programming | Tagged , , , , | Leave a comment

How I learned to stop worrying and actually use StackOverflow

So apparently almost all of the developers in the world are using stackoverflow. However many developers just use StackOverflow to lookup answers, and rarely to ask their own questions. Answering other people’s questions is of course rarer still.

Up until recently I was the same: I wrote a few questions in StackOverflow, and even answered a few, but by and large I was using it to find existing answers.

This week something changed, something broke. In a way, I stopped caring. I had a problem, I didn’t find a solution fast enough, and decided, “what the heck, the solution is not obvious, I’ll just write a question”. Also, if the solution is obvious to someone else – that’s even better, I’ll learn something.

And so I asked my most recent questions, about distances between 2D segments, projections, etc. I’ll cover this subject in depth in a future blog post, as this one is about StackOverflow.

Writing a question on StackOverflow has a few advantages over not writing it. The most obvious one: you might actually get an answer! Here is a good example, my most recent question. The less obvious is that you get to put down your question in writing which just like in rubber duck debugging and that would help you with solving this problem, and practice the skill of asking the right questions.

Also important to mention – you have nothing to lose but a little bit of time. As long as your question is real and you are not clueless, asking a question will not reflect badly on you in any way, quite the opposite.

What actually surprised me is the gamification of StackOverflow – you get points for participating. I already knew about it, but I was surprised at how effective it is. Here is where I am at the time of writing this post:

My StackOverflow reputation as of 2020-03-12

Participating on SO is surprisingly addictive, and as a close friend told me there are additional advantages: once your reputation is high enough – you start getting job offers, and you can actually use that on your resume/CV (if using them is a thing you do :)

My advice to any developer reading this: you are already looking up answers on StackOverflow. If you don’t find an answer, don’t just move on. Before you do – write a question. Even if you do move on, you’ll get something valuable from it.

Posted in Programming | Tagged | Leave a comment

Back to writing

So apparently my last blog post was from 2012. That’s quite a bit of time.

Since then I’ve:

  • Had a son
  • Sold my startup Desti to HERE
  • Moved with my family to Boston
  • Moved back to Israel, join Cymmetria, first as VP R&D and later as CTO
  • Had another son
  • Left Cymmetria and joined Flytrex as VP R&D

It’s not a long list, but it covers a lot of ground. Right now, Corona virus notwithstanding, I’m pretty excited about the work we do at Flytrex: we’re building a system for food delivery via autonomous drones.

Here is a short video that shows what we’re working on:

The video is by now 11 months old and the system changed a lot since then, and our main challenge right now is getting this system working in the USA.

Learning from my experience, I want to start writing regularly. To achieve that, while I will write mostly about programming, I will also write about other areas of interest. Let’s see where this new adventure takes us. Onwards!

Posted in Personal | Tagged , , | Leave a comment

Two bugs don’t make a right

Three lefts roadsign
While working on my new startup, we are doing a little bit of reasoning using implications. One of the more curious forms of implications is the negative form: consider the following exaggerated example:

  • a place being kid-friendly implies that it is not romantic.
  • a place being a strip club implies it is not kid-friendly

If we allow negative implications to be transitive, then it would follow that since being a strip club makes a place less kid-friendly, it makes it more romantic. We don’t want that. So I had to write some code to specifically ignore that situation. Before writing that, in the best tradition of TDD I wrote a test for two chained negative implications. I implemented the code, the test passed and I was happy.

For a while.

Fast forward a couple of weeks, and I’m trying out adding some negative implications, and the program doesn’t behave as expected. My code doesn’t work. I turn back to my test, check it out, and sure enough, all the thing the test asserts as True are actually True, and the test does test the right thing.

Digging deeper, I discovered the issue. I had two bugs: the first was that the code handling chained negative implications wasn’t working right. The second was in my graph building algorithm – it seems that I was forgetting to add some edges. What made that second bug insidious was that it hid the effect of the first bug from the test – effectively making the test pass.

So – for me it was – two negative implications don’t mean a positive one, and two bugs don’t make a feature.

Posted in Programming | Tagged , , , , , , | 1 Comment

Optimizing Django ORM / Postgres queries using left join

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
 
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

SELECT id FROM (
    SELECT some_object.id AS id, processed_object_data.id AS other_id FROM
    some_object
    LEFT JOIN
    processed_object_data
    ON
    (some_object.id = processed_object_data.some_object_id) AND
    (...some FILTER ON processed_object_data...)
) AS inner_select 
WHERE 
inner_select.other_id IS NULL
LIMIT 100

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

Posted in Databases, Optimization | Tagged , , , , , | 1 Comment

Collision: the story of the random bug

So here I was, trying to write some Django server-side code, when every once in a while, some test would fail.
Now, it is important to know that we are using any_model, a cute little library that allows you to specify only the fields you need when creating objects, and randomizes the rest (to help uncover more bugs).

In this particular instance, the test that was failing was trying to store objects on the server using an API, and then check that the new objects exist in the DB. Every once in a while, an object didn’t exist. It should be noted that the table with the missing rows had a Djano-ORM URLField.

So first things first, I changed the code to print the random seed it was using on every failure. Now the next time it failed (a day later), I had the random seed in hand.

I then proceeded to use that random seed – and now I had a reproducible bug – it failed every time, consistently.

The next step was finding the cause of the bug. To cut a long story short – it turns out that it looked for an object with a specific URL. Which url? the url created for the first object (we had two).

The bug was that the second object was getting the same url as the first. I remind you, these urls are generated randomly. The troublesome url was http://72.14.221.99

I leave you now to guess/check what are the chances for the collision here
(the correct way to do that would be to check any_model’s code for generating urls, and not just say 1 in 2^32… :)

So I made sure the second object got a new url, and all was well, and the land had rest for forty years. (or less).

Posted in Python | Tagged , , , , , | 2 Comments

Cheap language detection using NLTK

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

import nltk
 
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}
 
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
 
 
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

The question to you: what other quick NLTK, or NLP hacks did you write?

Posted in Python | Tagged , , , , | 6 Comments

Wikipedia Images

A few days ago a friend (x) of a friend (y) showed me and my friend (y) a small app he was developing, that had photos from flickr and picasa. We suggested adding photos from Wikipedia as well, but he (x) said that the photos were too big, and it was too much trouble resizing them.
Luckily for him I knew of Wikipedia’s “hidden” image resizing feature, and as it was useful to me and to someone else, I thought I’d share it here.

Let’s say you are looking to resize the following image of the Eiffel Tower: http://en.wikipedia.org/wiki/File:Tour_Eiffel_Wikimedia_Commons.jpg. Then the url to the image itself is:

http://upload.wikimedia.org/wikipedia/commons/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg

To get the url to a resized image, just add ‘thumb/’ after ‘commons/’ and then add ‘/[%d]px-[filename]’ at the end, were %d is the new width. So for our image, the new url would be:

http://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/300px-Tour_Eiffel_Wikimedia_Commons.jpg

That’s it, simple and quick. Have fun adding Wikipedia to your content!

Posted in Programming | Tagged , , , | 3 Comments

Python Module Usage Stats – Feb 2011

Here are the top 30 “base modules”, ordered by number of PyPI projects importing them. These results are based on 11,204 packages download from PyPI. Explanations, full results and code to generate them are available below.

Results


(click to enlarge)

Full results are available (see Methodology to understand what they mean exactly).

Discussion

Some interesting tidbits and comparisons:

  • It seems django has gained “some popularity”. Zope is very high up on the list, and plone is at 42 with 907 projects importing it.
  • The number of projects importing unittest is somewhat depressing, especially relative to setuptools which is impressive. That might be because setuptools is somewhat a prerequisite to appear on PyPI (practically speaking), while unittest is not. (Edit: corrected by Michael Foord in a comment)
  • optparse with 1875 vs. getopt with 515.
  • cPickle with 690 vs. pickle with 598.
  • simplejson with 760 vs. json with 593.

I invite you all to find out more interesting pieces of information by going over the results. I bet there’s a lot more knowledge to be gained from this.

Background

Back in 2007 I wrote a small script that counted module imports in python code. I used it to generate statistics for Python modules. A week or two ago I had an idea to repeat that experiment – and see the difference between 2007 and 2011. I also thought of a small hypothesis to test: since django became very popular, I’d expect it to be very high up on the list.

I started working with my old code, and decided that I should update it. Looking for imports in Python code is not as simple as it seems. I considered using the tokenize and parser modules, but decided against that. Using parser would make my code version dependent and by the time I thought of tokenize, I had the complicated part already worked out. By the complicated part I mean of course the big regexps I used ;)

Methodology

Input: PyPI and a source distribution of the Python2.7 standard library. I wrote a small script (cheese_getter.py) to fetch python modules. It does it by reading the PyPI index page, and then using easy_install to fetch each module. Since currently there are a bit less than 13k modules in PyPI, this took some time.

Parsing: I wrote a relatively simple piece of code to find “import x” and “from x import y” statements in code. This is much more tricky than it seems: statements such as “from x import a,b”, “from . import bla” and

from bla import \
               some_module\
               some_module2

should all be supported. In order to achieve uniformity, I converted each import statement to a series of dotted modules. So for example, “import a.b” will yield “a” and “a.b”, and “from b import c,d” will yield “b”, “b.c”, and “b.d”.

Processing: I created three result types:

  1. total number of imports
  2. total number of packages importing the module
  3. total number of packages importing the module, only for the first module mentioned in a dotted module name, e.g. not “a.b”, only “a”.

I believe the third is the most informative, although there are interesting things to learn from the others as well.

Code: Full code is available. Peer reviews and independent reports are welcome :)

Posted in Python | Tagged , , , | 9 Comments