Cheap language detection using NLTK

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

import nltk
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

The question to you: what other quick NLTK, or NLP hacks did you write?

This entry was posted in Python and tagged , , , , . Bookmark the permalink.

3 Responses to Cheap language detection using NLTK

  1. It’s a very nice simple hack, and it might work on a good dataset. Did you evaluate the accuracy of the algorithm?

    I think you will be very interested in’s hack for the same problem: Minutes 25:00-27:00.

  2. lorg says:

    1. Thanks for the link!
    2. I did a quick evaluation on my dataset, saw that it was reasonable, and left it at that. For the English/non-English bit it had a few false-negatives, but mostly for very short texts, and for longer texts it had very few errors. As I wrote, the code is not too rigorously tested :)