Proofreading and what's wrong with the heapq module

Consider the following problem:

you have a large book you would like to proofread, with many chapters (100+) and a few men (4) at your disposal. How would you distribute the chapters among the men, considering that each proofreader must get whole chapters?

Well, the obvious solution is just divide the number of chapters by the number of men, and give each proofreader an appropriate number of (randomly picked?) chapters. But what if the chapters or of varying length? Well, you could just distribute them randomly, but that just doesn’t feel right now does it?

I was asked by a friend to help him write a script to solve this, and quite quickly I came up with the solution:

Sort the chapters according to length (number of words), in descending order. Keep the proofreaders in a sorted order. While there are still chapters left – get the next longest chapter, and give it to the proofreader with the least total length. Rearrange the proofreaders.

This algorithm is a greedy algorithm and is also regarded as the straightforward way to solve this problem.

Well, this seems simple enough – In the case where there are more then a few proofreaders – well, the obvious solution is to use a minimum heap. Using a minimum heap in python should be easy – just import the heapq module, use heapify and various other functions, just as you would use sort, and you are home free.

Not quite…

Turns out, that using the heapq module isn’t so easy as it seems, or at least not as quick and simple as I would like it to be. While you can sort list using your own compare function, and also providing a key-function, the heapq module sorts items using Python’s standard operators. This means, that to use heapq properly to sort an object according to some property, you have to write an object wrapper that has __lt__ and similar functions defined. All this coding could be saved, if heapq had a key argument that could be used, or any other method to control the comparisons.

And what about the proofreading you ask? Well, it worked. It divided the chapters quite evenly, although we did end up using sort() repeatedly, as the number of proofreaders was small and we did not want to overly complicate matters.

This again emphasizes a something I’ve come to discover – being able to write something quickly is usually more important then later being able to run it quickly. If you write something in an environment that allows for quick coding – later if speed is required you’ll be able to optimize it quickly, or even change the design or the basic algorithm.

This entry was posted in Algorithms, Programming Philosophy, Python and tagged , , , . Bookmark the permalink.

2 Responses to Proofreading and what's wrong with the heapq module

  1. lorg says:

    Reading back, I can see that repeated calls to min() would obviously be of better complexity. still, when you have less then 7 proofreaders, this doesn’t make much difference :)