Algorithms Programming Philosophy Python

Proofreading and what's wrong with the heapq module

Consider the following problem:

you have a large book you would like to proofread, with many chapters (100+) and a few men (4) at your disposal. How would you distribute the chapters among the men, considering that each proofreader must get whole chapters?

Well, the obvious solution is just divide the number of chapters by the number of men, and give each proofreader an appropriate number of (randomly picked?) chapters. But what if the chapters or of varying length? Well, you could just distribute them randomly, but that just doesn’t feel right now does it?

I was asked by a friend to help him write a script to solve this, and quite quickly I came up with the solution:

Sort the chapters according to length (number of words), in descending order. Keep the proofreaders in a sorted order. While there are still chapters left – get the next longest chapter, and give it to the proofreader with the least total length. Rearrange the proofreaders.

This algorithm is a greedy algorithm and is also regarded as the straightforward way to solve this problem.

Well, this seems simple enough – In the case where there are more then a few proofreaders – well, the obvious solution is to use a minimum heap. Using a minimum heap in python should be easy – just import the heapq module, use heapify and various other functions, just as you would use sort, and you are home free.

Not quite…

Turns out, that using the heapq module isn’t so easy as it seems, or at least not as quick and simple as I would like it to be. While you can sort list using your own compare function, and also providing a key-function, the heapq module sorts items using Python’s standard operators. This means, that to use heapq properly to sort an object according to some property, you have to write an object wrapper that has __lt__ and similar functions defined. All this coding could be saved, if heapq had a key argument that could be used, or any other method to control the comparisons.

And what about the proofreading you ask? Well, it worked. It divided the chapters quite evenly, although we did end up using sort() repeatedly, as the number of proofreaders was small and we did not want to overly complicate matters.

This again emphasizes a something I’ve come to discover – being able to write something quickly is usually more important then later being able to run it quickly. If you write something in an environment that allows for quick coding – later if speed is required you’ll be able to optimize it quickly, or even change the design or the basic algorithm.



Default argument values in python are usually used in the following fashion:

def func(foo, bar=4):
    return foo*bar

However, if the default value for the function is a mutable object (such as a list), changing it will cause subsequent calls to the function behave differently. This behavior produced the following idiom:

def foo(arg = None):
    if arg is None:
        arg = []
    #modify arg...
    return arg

This idiom works quite well, unless None is a legitimate value for arg. consider the follwing:

class Vehicle(object):
    #other code goes here...
    def is_car(self, car_type = None):
        if car_type is None:
            return self._type == CAR
        return self._type == CAR and self.value == car_type

The method is_car has two different behaviors: If called without car_type, it will check whether self has _type CAR. Otherwise, it will also check that self has a value of the specific car_type. This will work quite well, but if None is a legitimate car_type, checking against it will produce surprising results – it will only check if self has _type CAR. To avoid this kind of bug, I propose using DefaultArg, a new singleton that will signify that a given argument is “the default argument” for the function. Here is the previous example, now using DefaultArg:

class Vehicle(object):
    #other code goes here...
    def is_car(self, car_type = DefaultArg):
        if car_type is DefaultArg:
            return self._type == CAR
        return self._type == CAR and self.value == car_type

This makes sense, because one should never set a variable to DefaultArg (unless through a function call). Also, tools like PyLint can easily check for this kind of mistakes. Note that DefaultArg addresses a specific problem – when the default argument changes the behavior of the function and that None is also a legitimate value for that argument. I would also like to note that DefaultArg was the solution for a real bug I encountred – and now I it use regularly.

Here is a simple implementation, using Daniel Brodie’s Singleton recipe:

class Singleton(type):
    def __init__(self, *args):
        type.__init__(self, *args)
        self._instances = {}

    def __call__(self, *args):
        if not args in self._instances:
            self._instances[args] = type.__call__(self, *args)
        return self._instances[args]

class _DefaultArg(object):
    def __init__(self, *args): pass

    def __eq__(self, other):
        return isinstance(other, _DefaultArg)

    def __ne__(self, other):
        return not self == other

DefaultArg = _DefaultArg()

For the interested reader – Tomer Filiba proposed a different solution (to a slightly different problem I belive), using the “copydefaults” decorator.