Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!
What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:
import nltk ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english')) NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()} def get_language(text): words = set(nltk.wordpunct_tokenize(text.lower())) return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0] def is_english(text): text = text.lower() words = set(nltk.wordpunct_tokenize(text)) return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS) |
The question to you: what other quick NLTK, or NLP hacks did you write?
It’s a very nice simple hack, and it might work on a good dataset. Did you evaluate the accuracy of the algorithm?
I think you will be very interested in Bit.ly’s hack for the same problem: http://devslovebacon.com/speakers/hilary-mason Minutes 25:00-27:00.
Nimrod:
1. Thanks for the link!
2. I did a quick evaluation on my dataset, saw that it was reasonable, and left it at that. For the English/non-English bit it had a few false-negatives, but mostly for very short texts, and for longer texts it had very few errors. As I wrote, the code is not too rigorously tested :)
Short texts are the hardest. A naive bayes character n-gram model works relatively well if the text is proper English and not full of names/internet abbreviations like ‘lol’ or ‘omg’. It’s very easy to code, but there’re also ready-made libraries for this. See for instance Lingpipe’s tutorial at http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html
The stopwords approach has a lot of merit for real-world mixed-language texts.
I would suggest one improvement to the code: if there are 0 hits, then it should return ‘und’. (Currently it returns ‘sv’, Swedish, because it happens to be first in the list.)
Lots of thanks. its just what I was looking for to complete my program. <3