IMPORTANT UPDATE: The code I used to create these statistics had some bugs. The fixed statistics are available here.
After reading Doug Hellman’s post about python stdlib modules he needs the documentation to, I commented there that I need the documentation for logging because I don’t use it too frequently. Later, I thought a little bit more about it, and I wanted to check which modules are actually used often, and which rarely.
My first step was to “grep import *.py” on my python scripts directory. Later I progressed to writing a simple script which basically does the same thing, but also rates each imported module according to frequency (it looks for “import <modules>” and “from <module> import bla”). The results weren’t interesting enough. Before going any further, I sat down, and wrote my expected list of frequent stdlib modules:
- sys
- os
- re
- math
- thread
- urllib2
- time
- random
- struct
- socket
- itertools
- operator
My next step was to try Google Codesearch. Some time ago I wrote a little script to harvest results from Google Codesearch, but enough time has passed, and the script doesn’t work anymore. Parsing their output is messy enough, and their results don’t seem that helpful anyway. (They don’t return all the results in a project). So I thought about using koders.com, when I had a flash. I can use Cheeseshop! easy_install must have a way to just download the source of a project! (For those who don’t know: you can do easy_install SQLObject, and easy_install will look for the module in the PyPI, download it and then install it for you). Indeed, easy_install had this option (-eb), and I was in luck. Quickly, I got a list of all the modules in PyPI, shuffled it, and picked 300 randomly. (Actually, downloading them wasn’t that quick, and about 20 didn’t download right :).
I ran my statistics script, but non-stdlib modules were also listed. I fixed some bugs, improved the script a little, and (partially) disallowed inner project imports using a little heuristic. These are the improved results:
- sys,113
- os,107
- setuptools,57
- common,50
- unittest,50
- __future__,25
- distutils.core,24
- zope.interface,22
- re,19
- pygame,18
- time,13
- datetime,11
- string,10
- zope,10
- pyglet.gl,9
- types,8
- random,7
- pkg_resources,7
- gtk,6
- struct,6
- ez_setup,6
- zope.component,5
- math,5
- logging,5
- sqlalchemy,5
(Please note that I do not argue that these results are completely true, as my code might still contain problems, and it probably filters too much. However, I do claim that these results are representative).
Well, some surprises… I didn’t expect setuptools and types to be there. Especially types… who needs it anymore (and anyway)? Seeing unittest so high in the list gives me a pretty good feeling about opensource. __future__ being there is amusing… Seems that we just can’t wait for the next version. And indeed, logging is pretty low, but not as low as I thought it will be.
Another curiosity: the module frequencies seem (unsurprisingly) to obey Zipf’s law. Here is a small graph showing the results:
(plotted with Google Docs). The R squared value is 0.92. For the original results (where inner dependencies were allowed), it seemed that the law was even stronger.
For those interested in the full results, here is a csv file containing them.
One final note: while these results are true for projects in Python, they don’t represent usage in interactive mode. Since I use interactive mode a lot, and during interactive work I use os and re quite a lot, I bet the results would be changed if we somehow could count interactive mode usage.
That’s an interesting analysis, Imri. I like the idea of using PyPI for the research. Have you posted the code?
It would be interesting to run the same tool against the standard library to see which modules are reused internally, too.
Did you exclude setup.py files from your analysis? If not, that’s probably why setuptools, distutils.core, and ez_setup are over-represented in your results.
Doug: Nice idea, although I believe the results will be a little slanted, as parts of the stdlib are implemented in C. As for the code, I’ll release it in a few days.
Phillip: I didn’t exclude setup.py files on purpose. I figured that there is probably one setup.py tools per project, so there should be one distutils import per project, while other imports may appear more than once. This way, I still got to know how common is distutil usage relative to other modules. The original post by Doug mentioned distutils as a ‘problematic’ module, so those statistics were relevant.
It’s not surprising to see that sys and os are the most used modules. For example you need sys to get access to argv (command line arguments) and os (os.path) for path handling and most modules should use os for OS independent path handling :)
I agree. That’s why before I started acquiring these statistics, I sat down and wrote what I thought the results should be.
You *need* the stdlib, there’s no escaping that. os and sys weren’t surprises. types was (at least for me).
Also, I don’t believe sys.argv alone explains the high usage of the sys module. You’d think that sys.argv would be used once per project, in the main file, but if this was the only usage, the numbers should have been much lower (please check the updated numbers in the next post though).
It might be interesting to do statistics on module usage as well – just count the number of different strings after the dot, as in “module_name.usage”.