Categories
Python

Cheap language detection using NLTK

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

import nltk
 
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}
 
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
 
 
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

The question to you: what other quick NLTK, or NLP hacks did you write?

Categories
Algorithms Humour Programming Python

Rhyme and Reason with Python

After reading xkcd’s blog, I went to LimerickDB.com. It looks and works just like bash.org, but for limericks. Quite amused with some of the limericks available, I wanted to try and compose my own.
The first one came out fine, but I got stuck without a rhyme for the second one:

a man named guido once thought
he’d invent a language well wrought
indentation as tabs

Uhm, no luck there… “tabs” does not rhyme well. Then an idea came… I could create a rhyme index!
How to do it?
First, I took my already written word index. Last time I used it was to solve a substitution cipher for my crypto homework.
Then, I looked for the phonemes for each word, using espeak – a text to speech program:

lorg@kild:~$ espeak -q -x hello
 h@l'oU

After doing so, I collected words that had similar phonemes endings, and there it was, a rhyme index:

In [84]: words.get_word_rhymes("tabs", word_index, rhyme_index)[:10]
Out[84]:
['crabs', 'stabs', 'macnabs',
 'dabs', 'cabs', 'scabs',
 'slabs', 'arabs', 'snobs',
 'thebes']

Note that better rhymes are listed first – according to the length of the match.
To create this list I didn’t use stress information, although with some more work it may be done.

Back to our limerick, we can see that “tabs” does not have really good rhymes. Better luck with indentation maybe:

In [86]: words.get_word_rhymes("indentation", word_index, rhyme_index)[:20]
Out[86]:
['kishion', 'alleviation', 'expostulation',
 'machination', 'syrophenician', 'proposition',
 'infatuation', 'computation', 'mansion',
 'meditation', 'aggression', 'anticipation',
 'clarification', 'logician', 'gratulation',
 'discretion', 'quotation', 'presentation',
 'dissolution', 'deprecation']

And we get (not the best limerick I admit):

a man named guido once thought
he’d invent a language well wrought
with space indentation
and good presentation
we all got the language we sought

My original ending line was “it left all the perl hackers distraught”… but it ain’t true, so I gave it up. The ‘distraught’ was found with my script as well.
If you want to look for your own rhymes, then you can use my hacked-up script, although it isn’t written well enough for my taste. (For example, don’t expect to see documentation :)
I’ll upload a better version once I refactor it.
To save you the trouble of generating the word index and the rhyme index from scratch, I used cPickle to store them. You can use them for lookups regardless of the script.

Categories
Programming Python Utility Functions

Simple word lookup for crosswords and such…

I am swamped with university homework. I did manage to come up with some small somewhat useful utility.

Do you know when you need a word that ends with a double letter and then an ‘n’? you would want to write it like this: “xxN$”. Well, I wrote a simple tool that translates a ‘word lookup syntax’ into regular expressions, and then does a lookup for the word.

This raises a two questions:

What exactly is the syntax?
Small letters stand for any letter, and big letter stand for their specific letters. So for example, “bEEn” would match “been”, “seen” and “keep”, but not “boon”. Punctuation is left unchanged, so you can use $,^, (?!xy) and so on. I didn’t work too too much on it, so [] doesn’t translate well (the letters inside it will come out garbled).

What is the data for the lookup?
Well, I used the excellent natural language toolkit (obviously in Python). Very easy to start using it, with a nice tutorial. It’s nice knowing that such a tool is available. The source data is actually an extract from the Gutenberg project. The result words are sorted according to frequency in the source language, and the 20 most frequent words are returned.

Without further ado, here are some code snippets. (The compiled regexp is printed. Note how ugly (yet simple) it is :)
(Note: I did cut out some of the output…)

In [1]: fd = words.create_word_index()
 
In [2]: words.find_words("^bEEn$",fd)
^(?P<b>.)EE(?!((?P=b))|(E))(?P<n>.)$
Out[2]:
['BEEN',
 'SEEN',
 'KEEP',
 'FEET',
 'FEEL',
 'SEEK',
 'SEED',
 'MEET',
 'DEEP',
 'NEED',
 'SEEM',
 ...]
 
In [3]: words.find_words("^jvfpf$",fd)
^(?P<j>.)(?!((?P=j)))(?P<v>.)(?!((?P=j))|((?P=v)))(?P<f>.)(?!((?P=j))|((?P=f))|((?P=v)))(?P<p>.)(?P=f)$
Out[3]:
['THERE',
 'THESE',
 'WHERE',
 'JESUS',
 'MOSES',
 'HOSTS',
 'LINEN',
 'PIECE',
 'SCENE',
 'WHOSO',
 'POSTS',
 'EPHAH',
 ...]

Note that if you find some of the result words a bit too uncommon for your taste (e.g. Ephah), I would suggest removing the bible from the source data… :)

The (very ugly, uncommented, draft code) is available here. I apologize for its unseemliness, but I figured that it’s better to make it available in its current form, rather than never put it online for lack of time and will. It could also be seen that the code isn’t terribly efficient. It is however, ‘good enough’ for interactive work.