Categories
Python

Python Module Usage Stats – Feb 2011

Here are the top 30 “base modules”, ordered by number of PyPI projects importing them. These results are based on 11,204 packages download from PyPI. Explanations, full results and code to generate them are available below.

Results


(click to enlarge)

Full results are available (see Methodology to understand what they mean exactly).

Discussion

Some interesting tidbits and comparisons:

  • It seems django has gained “some popularity”. Zope is very high up on the list, and plone is at 42 with 907 projects importing it.
  • The number of projects importing unittest is somewhat depressing, especially relative to setuptools which is impressive. That might be because setuptools is somewhat a prerequisite to appear on PyPI (practically speaking), while unittest is not. (Edit: corrected by Michael Foord in a comment)
  • optparse with 1875 vs. getopt with 515.
  • cPickle with 690 vs. pickle with 598.
  • simplejson with 760 vs. json with 593.

I invite you all to find out more interesting pieces of information by going over the results. I bet there’s a lot more knowledge to be gained from this.

Background

Back in 2007 I wrote a small script that counted module imports in python code. I used it to generate statistics for Python modules. A week or two ago I had an idea to repeat that experiment – and see the difference between 2007 and 2011. I also thought of a small hypothesis to test: since django became very popular, I’d expect it to be very high up on the list.

I started working with my old code, and decided that I should update it. Looking for imports in Python code is not as simple as it seems. I considered using the tokenize and parser modules, but decided against that. Using parser would make my code version dependent and by the time I thought of tokenize, I had the complicated part already worked out. By the complicated part I mean of course the big regexps I used ;)

Methodology

Input: PyPI and a source distribution of the Python2.7 standard library. I wrote a small script (cheese_getter.py) to fetch python modules. It does it by reading the PyPI index page, and then using easy_install to fetch each module. Since currently there are a bit less than 13k modules in PyPI, this took some time.

Parsing: I wrote a relatively simple piece of code to find “import x” and “from x import y” statements in code. This is much more tricky than it seems: statements such as “from x import a,b”, “from . import bla” and

from bla import \
               some_module\
               some_module2

should all be supported. In order to achieve uniformity, I converted each import statement to a series of dotted modules. So for example, “import a.b” will yield “a” and “a.b”, and “from b import c,d” will yield “b”, “b.c”, and “b.d”.

Processing: I created three result types:

  1. total number of imports
  2. total number of packages importing the module
  3. total number of packages importing the module, only for the first module mentioned in a dotted module name, e.g. not “a.b”, only “a”.

I believe the third is the most informative, although there are interesting things to learn from the others as well.

Code: Full code is available. Peer reviews and independent reports are welcome :)

Categories
Projects Statistics

The Art and Science of Pulling Numbers Out of Your Sleeve

About a year or so ago, I was reading R. V. Jones’ excellent book ‘Most Secret War’. One of the stories I remembered and told my colleagues about, was how Jones estimated the rocket production capabilities of the Germans. He did so after looking at an aerial photograph of a rocket fuel shed. One of my colleague then told me of a course she took at the Hebrew University of Jerusalem that teaches how to make such estimates. I let it go at the time, and just remembered that there is such a course.

A few days ago, I met up with this colleague, and I was reminded of this course. You see, right now I’m collecting data for a project I’m doing with a friend, and I needed estimation know-how. So I asked her the name of the course, and she gave it to me. Two Google search later and I had the English name of the course, “Order of Magnitude in Physics Problems”, and a textbook to look at. I just finished reading the introduction, and I know that I’m going to read the rest of it too. Not just for my current project – but because it’s such a good skill to have.

The book opens with a problem:

We dedicate the first example to physicists who need employment outside of physics. […] How much money is there in a fully loaded Brinks armored car?

The book then goes to show how to answer such a question intelligently. I’m hooked.

Categories
computer science Math Programming Python Statistics Utility Functions

Solution for the Random Selection Challenge

A few days ago, I wrote up two small Python Challenges. Several people have presented solutions for the first challenge, and I also posted my solution in the comments there.

However, the second challenge remained unsolved, and I will present a solution for it in this post.

Categories
Challenges Math Programming Python

Small Python Challenge No. 3 – Random Selection

This time I’ll give two related problems, both not too hard.

Lets warm up with the first:

You have a mapping between items and probabilities. You need to choose each item with its probability.

For example, consider the items [‘good’, ‘bad’, ‘ugly’], with probabilities of [0.5, 0.3, 0.2] accordingly. Your solution should choose good with probability 50%, bad with 30% and ugly with 20%.

I came to this challenge because just today I had to solve it, and it seems like a common problem. Hence, it makes sense to ask ‘what is the best way?’.

The second problem is slightly harder:

Assume a bell shaped function p(x) that you can ‘solve’. This means that given a value y, you can get all x such that p(x)=y. For example, sin(x)^2 in [0,pi] is such a function. Given a function such as Python’s random.random() that yields a uniform distribution of values in [0,1), write a function that yields a distribution proportional to p(x) in the appropriate interval.

For example, consider the function p(x) = e^(-x^2) in [-1,1]. Since p(0) = 1, and p(0.5)~0.779, the value 0 should be p(0)/p(0.5)~1.28 times more common than 0.5.

As usual, the preferred solutions are the elegant ones. Go!

note: please post your solutions in the comments, using [ python]…[ /python] tags (but without the spaces in the tags).

Categories
Math Programming Python

Python module usage statistics – Cont.

Well, very embarrassingly for me, turns out I had a bug in my original post and code. As per Doug’s suggestion, I tried running the script I wrote on the standard library, and got results I didn’t quite believe. So I checked them, opened socket.py, and there was an “import _socket”. However module “_socket” was not listed in my results. After some digging around, (and feeling my face getting red hot…) I found the bug. It was the smallest thing: I forgot to add re.MULTILINE as an argument to re.search, so the start of my regexp, “^[ \t]*import” didn’t match except on the beginning of the file. #@&*!!! Just no words to express how I felt at that moment.

So, I fixed the bug, tested my code a bit, fixed some more (very minor) bugs, and here are the top 25 of the (hopefully correct this time!) results:

  • sys,1426
  • os,1250
  • unittest,566
  • time,446
  • re,383
  • string,321
  • types,298
  • setuptools,264
  • pkg_resources,217
  • cStringIO,184
  • zope.interface,177
  • datetime,173
  • shutil,167
  • os.path,162
  • gtk,143
  • StringIO,143
  • random,136
  • tempfile,132
  • copy,131
  • threading,128
  • distutils.core,127
  • doctest,126
  • md5,125
  • setuptools.command.e,116
  • logging,116

Except the larger numbers, the arrangement pretty much stayed the same.

(This seems to at least confirm my claim that the results were representative)
Here are the results for the standard library (/usr/lib/python2.5/):

  • sys,1309
  • os,1065
  • ctypes,588
  • re,496
  • string,493
  • types,435
  • time,374
  • numpy,297
  • warnings,254
  • os.path,204
  • cStringIO,196
  • common,185
  • math,159
  • traceback,158
  • gettext,152
  • codecs,147
  • StringIO,147
  • copy,133
  • __future__,128
  • tempfile,126
  • random,119
  • threading,117
  • unittest,108
  • numpy.testing,105
  • errno,100

These results seem different. ctypes seems to be the biggest change. Note that these results might be slanted, as some stdlib modules are implemented in C, and I didn’t test a ‘clean’ Python installation (but rather one with all my non-default modules installed).

Here is a link to the new results, the new stdlib results, and the new fixed graph. I created it using the first 325 module. The fixed R squared value is now approx. 0.99. Here is the code.

I would like to apologize to anyone who was misled by the original results, and again state that it is quite possible that there are still more bugs there, but I still claim the results to be represntative.

Categories
Math Programming Python

Python module usage statistics

IMPORTANT UPDATE: The code I used to create these statistics had some bugs. The fixed statistics are available here.

After reading Doug Hellman’s post about python stdlib modules he needs the documentation to, I commented there that I need the documentation for logging because I don’t use it too frequently. Later, I thought a little bit more about it, and I wanted to check which modules are actually used often, and which rarely.

My first step was to “grep import *.py” on my python scripts directory. Later I progressed to writing a simple script which basically does the same thing, but also rates each imported module according to frequency (it looks for “import <modules>” and “from <module> import bla”). The results weren’t interesting enough. Before going any further, I sat down, and wrote my expected list of frequent stdlib modules:

  • sys
  • os
  • re
  • math
  • thread
  • urllib2
  • time
  • random
  • struct
  • socket
  • itertools
  • operator

My next step was to try Google Codesearch. Some time ago I wrote a little script to harvest results from Google Codesearch, but enough time has passed, and the script doesn’t work anymore. Parsing their output is messy enough, and their results don’t seem that helpful anyway. (They don’t return all the results in a project). So I thought about using koders.com, when I had a flash. I can use Cheeseshop! easy_install must have a way to just download the source of a project! (For those who don’t know: you can do easy_install SQLObject, and easy_install will look for the module in the PyPI, download it and then install it for you). Indeed, easy_install had this option (-eb), and I was in luck. Quickly, I got a list of all the modules in PyPI, shuffled it, and picked 300 randomly. (Actually, downloading them wasn’t that quick, and about 20 didn’t download right :).

I ran my statistics script, but non-stdlib modules were also listed. I fixed some bugs, improved the script a little, and (partially) disallowed inner project imports using a little heuristic. These are the improved results:

  • sys,113
  • os,107
  • setuptools,57
  • common,50
  • unittest,50
  • __future__,25
  • distutils.core,24
  • zope.interface,22
  • re,19
  • pygame,18
  • time,13
  • datetime,11
  • string,10
  • zope,10
  • pyglet.gl,9
  • types,8
  • random,7
  • pkg_resources,7
  • gtk,6
  • struct,6
  • ez_setup,6
  • zope.component,5
  • math,5
  • logging,5
  • sqlalchemy,5

(Please note that I do not argue that these results are completely true, as my code might still contain problems, and it probably filters too much. However, I do claim that these results are representative).

Well, some surprises… I didn’t expect setuptools and types to be there. Especially types… who needs it anymore (and anyway)? Seeing unittest so high in the list gives me a pretty good feeling about opensource. __future__ being there is amusing… Seems that we just can’t wait for the next version. And indeed, logging is pretty low, but not as low as I thought it will be.

Another curiosity: the module frequencies seem (unsurprisingly) to obey Zipf’s law. Here is a small graph showing the results: module imports on a log-log graph

(plotted with Google Docs). The R squared value is 0.92. For the original results (where inner dependencies were allowed), it seemed that the law was even stronger.

For those interested in the full results, here is a csv file containing them.

One final note: while these results are true for projects in Python, they don’t represent usage in interactive mode. Since I use interactive mode a lot, and during interactive work I use os and re quite a lot, I bet the results would be changed if we somehow could count interactive mode usage.