Well, very embarrassingly for me, turns out I had a bug in my original post and code. As per Doug’s suggestion, I tried running the script I wrote on the standard library, and got results I didn’t quite believe. So I checked them, opened socket.py, and there was an “import _socket”. However module “_socket” was not listed in my results. After some digging around, (and feeling my face getting red hot…) I found the bug. It was the smallest thing: I forgot to add re.MULTILINE as an argument to re.search, so the start of my regexp, “^[ \t]*import” didn’t match except on the beginning of the file. #@&*!!! Just no words to express how I felt at that moment.
So, I fixed the bug, tested my code a bit, fixed some more (very minor) bugs, and here are the top 25 of the (hopefully correct this time!) results:
- sys,1426
- os,1250
- unittest,566
- time,446
- re,383
- string,321
- types,298
- setuptools,264
- pkg_resources,217
- cStringIO,184
- zope.interface,177
- datetime,173
- shutil,167
- os.path,162
- gtk,143
- StringIO,143
- random,136
- tempfile,132
- copy,131
- threading,128
- distutils.core,127
- doctest,126
- md5,125
- setuptools.command.e,116
- logging,116
Except the larger numbers, the arrangement pretty much stayed the same.
(This seems to at least confirm my claim that the results were representative)
Here are the results for the standard library (/usr/lib/python2.5/):
- sys,1309
- os,1065
- ctypes,588
- re,496
- string,493
- types,435
- time,374
- numpy,297
- warnings,254
- os.path,204
- cStringIO,196
- common,185
- math,159
- traceback,158
- gettext,152
- codecs,147
- StringIO,147
- copy,133
- __future__,128
- tempfile,126
- random,119
- threading,117
- unittest,108
- numpy.testing,105
- errno,100
These results seem different. ctypes seems to be the biggest change. Note that these results might be slanted, as some stdlib modules are implemented in C, and I didn’t test a ‘clean’ Python installation (but rather one with all my non-default modules installed).
Here is a link to the new results, the new stdlib results, and the new fixed graph. I created it using the first 325 module. The fixed R squared value is now approx. 0.99. Here is the code.
I would like to apologize to anyone who was misled by the original results, and again state that it is quite possible that there are still more bugs there, but I still claim the results to be represntative.
Ah, the fact that you ran the test against /usr/lib/python2.5 instead of the Python source tree probably explains the numpy and common users, since those aren’t part of the standard library.
It seems like the only other surprising item on the list is ctypes, as you point out, and that could also be explained by the third party modules, since I only see a couple of occurances of ctypes in the raw source for the standard library.
you mentioned yourself ‘struct’ in the expected list. but in the new results i don’t see it at all. nobody parses files ? :)
actually ‘struct’ is in the top 15 after taking a look in the csv file, which makes more sense. now i feel much better…