I am swamped with university homework. I did manage to come up with some small somewhat useful utility.
Do you know when you need a word that ends with a double letter and then an ‘n’? you would want to write it like this: “xxN$”. Well, I wrote a simple tool that translates a ‘word lookup syntax’ into regular expressions, and then does a lookup for the word.
This raises a two questions:
What exactly is the syntax?
Small letters stand for any letter, and big letter stand for their specific letters. So for example, “bEEn” would match “been”, “seen” and “keep”, but not “boon”. Punctuation is left unchanged, so you can use $,^, (?!xy) and so on. I didn’t work too too much on it, so [] doesn’t translate well (the letters inside it will come out garbled).
What is the data for the lookup?
Well, I used the excellent natural language toolkit (obviously in Python). Very easy to start using it, with a nice tutorial. It’s nice knowing that such a tool is available. The source data is actually an extract from the Gutenberg project. The result words are sorted according to frequency in the source language, and the 20 most frequent words are returned.
Without further ado, here are some code snippets. (The compiled regexp is printed. Note how ugly (yet simple) it is :)
(Note: I did cut out some of the output…)
In [1]: fd = words.create_word_index() In [2]: words.find_words("^bEEn$",fd) ^(?P<b>.)EE(?!((?P=b))|(E))(?P<n>.)$ Out[2]: ['BEEN', 'SEEN', 'KEEP', 'FEET', 'FEEL', 'SEEK', 'SEED', 'MEET', 'DEEP', 'NEED', 'SEEM', ...] In [3]: words.find_words("^jvfpf$",fd) ^(?P<j>.)(?!((?P=j)))(?P<v>.)(?!((?P=j))|((?P=v)))(?P<f>.)(?!((?P=j))|((?P=f))|((?P=v)))(?P<p>.)(?P=f)$ Out[3]: ['THERE', 'THESE', 'WHERE', 'JESUS', 'MOSES', 'HOSTS', 'LINEN', 'PIECE', 'SCENE', 'WHOSO', 'POSTS', 'EPHAH', ...] |
Note that if you find some of the result words a bit too uncommon for your taste (e.g. Ephah), I would suggest removing the bible from the source data… :)
The (very ugly, uncommented, draft code) is available here. I apologize for its unseemliness, but I figured that it’s better to make it available in its current form, rather than never put it online for lack of time and will. It could also be seen that the code isn’t terribly efficient. It is however, ‘good enough’ for interactive work.
Leave a Reply