Posts Tagged ‘Python packages’

Must Have Python Packages

Friday, December 5th, 2008

Python 2.6 and 3.0 are upon us.
I helped a friend set up the new Python environment for research and development, and we thought about constructing a PythonPack: a compilation of packages every home should have lying around.
(more…)

Python module usage statistics – Cont.

Wednesday, November 21st, 2007

Well, very embarrassingly for me, turns out I had a bug in my original post and code. As per Doug’s suggestion, I tried running the script I wrote on the standard library, and got results I didn’t quite believe. So I checked them, opened socket.py, and there was an “import _socket”. However module “_socket” was not listed in my results. After some digging around, (and feeling my face getting red hot…) I found the bug. It was the smallest thing: I forgot to add re.MULTILINE as an argument to re.search, so the start of my regexp, “^[ \t]*import” didn’t match except on the beginning of the file. #@&*!!! Just no words to express how I felt at that moment.

So, I fixed the bug, tested my code a bit, fixed some more (very minor) bugs, and here are the top 25 of the (hopefully correct this time!) results:

  • sys,1426
  • os,1250
  • unittest,566
  • time,446
  • re,383
  • string,321
  • types,298
  • setuptools,264
  • pkg_resources,217
  • cStringIO,184
  • zope.interface,177
  • datetime,173
  • shutil,167
  • os.path,162
  • gtk,143
  • StringIO,143
  • random,136
  • tempfile,132
  • copy,131
  • threading,128
  • distutils.core,127
  • doctest,126
  • md5,125
  • setuptools.command.e,116
  • logging,116

Except the larger numbers, the arrangement pretty much stayed the same.

(This seems to at least confirm my claim that the results were representative)
Here are the results for the standard library (/usr/lib/python2.5/):

  • sys,1309
  • os,1065
  • ctypes,588
  • re,496
  • string,493
  • types,435
  • time,374
  • numpy,297
  • warnings,254
  • os.path,204
  • cStringIO,196
  • common,185
  • math,159
  • traceback,158
  • gettext,152
  • codecs,147
  • StringIO,147
  • copy,133
  • __future__,128
  • tempfile,126
  • random,119
  • threading,117
  • unittest,108
  • numpy.testing,105
  • errno,100

These results seem different. ctypes seems to be the biggest change. Note that these results might be slanted, as some stdlib modules are implemented in C, and I didn’t test a ‘clean’ Python installation (but rather one with all my non-default modules installed).

Here is a link to the new results, the new stdlib results, and the new fixed graph. I created it using the first 325 module. The fixed R squared value is now approx. 0.99. Here is the code.

I would like to apologize to anyone who was misled by the original results, and again state that it is quite possible that there are still more bugs there, but I still claim the results to be represntative.

Python module usage statistics

Monday, November 19th, 2007

IMPORTANT UPDATE: The code I used to create these statistics had some bugs. The fixed statistics are available here.

After reading Doug Hellman’s post about python stdlib modules he needs the documentation to, I commented there that I need the documentation for logging because I don’t use it too frequently. Later, I thought a little bit more about it, and I wanted to check which modules are actually used often, and which rarely.

My first step was to “grep import *.py” on my python scripts directory. Later I progressed to writing a simple script which basically does the same thing, but also rates each imported module according to frequency (it looks for “import <modules>” and “from <module> import bla”). The results weren’t interesting enough. Before going any further, I sat down, and wrote my expected list of frequent stdlib modules:

  • sys
  • os
  • re
  • math
  • thread
  • urllib2
  • time
  • random
  • struct
  • socket
  • itertools
  • operator

My next step was to try Google Codesearch. Some time ago I wrote a little script to harvest results from Google Codesearch, but enough time has passed, and the script doesn’t work anymore. Parsing their output is messy enough, and their results don’t seem that helpful anyway. (They don’t return all the results in a project). So I thought about using koders.com, when I had a flash. I can use Cheeseshop! easy_install must have a way to just download the source of a project! (For those who don’t know: you can do easy_install SQLObject, and easy_install will look for the module in the PyPI, download it and then install it for you). Indeed, easy_install had this option (-eb), and I was in luck. Quickly, I got a list of all the modules in PyPI, shuffled it, and picked 300 randomly. (Actually, downloading them wasn’t that quick, and about 20 didn’t download right :).

I ran my statistics script, but non-stdlib modules were also listed. I fixed some bugs, improved the script a little, and (partially) disallowed inner project imports using a little heuristic. These are the improved results:

  • sys,113
  • os,107
  • setuptools,57
  • common,50
  • unittest,50
  • __future__,25
  • distutils.core,24
  • zope.interface,22
  • re,19
  • pygame,18
  • time,13
  • datetime,11
  • string,10
  • zope,10
  • pyglet.gl,9
  • types,8
  • random,7
  • pkg_resources,7
  • gtk,6
  • struct,6
  • ez_setup,6
  • zope.component,5
  • math,5
  • logging,5
  • sqlalchemy,5

(Please note that I do not argue that these results are completely true, as my code might still contain problems, and it probably filters too much. However, I do claim that these results are representative).

Well, some surprises… I didn’t expect setuptools and types to be there. Especially types… who needs it anymore (and anyway)? Seeing unittest so high in the list gives me a pretty good feeling about opensource. __future__ being there is amusing… Seems that we just can’t wait for the next version. And indeed, logging is pretty low, but not as low as I thought it will be.

Another curiosity: the module frequencies seem (unsurprisingly) to obey Zipf’s law. Here is a small graph showing the results: module imports on a log-log graph

(plotted with Google Docs). The R squared value is 0.92. For the original results (where inner dependencies were allowed), it seemed that the law was even stronger.

For those interested in the full results, here is a csv file containing them.

One final note: while these results are true for projects in Python, they don’t represent usage in interactive mode. Since I use interactive mode a lot, and during interactive work I use os and re quite a lot, I bet the results would be changed if we somehow could count interactive mode usage.