I came up with this challenge when I had to write a function to divide a sequence to percentiles. I needed this to calculate some statistics over trips for plnnr.com. “This sounds trivial” I thought, and reached for my simple blocks function:
def blocks(seq, block_len): """blocks(range(5),2) -> [[0, 1], [2, 3], [4]]""" seq_len = len(seq) if seq_len%block_len == 0: num_blocks = seq_len/block_len else: num_blocks = 1 + (seq_len/block_len) result = [[] for i in xrange(num_blocks)] for idx, obj in enumerate(seq): result[idx/block_len].append(obj) return result |
So I set block_len to len(seq)/10 and called blocks(seq, block_len). Unfortunately, according to the docs of blocks (which I wrote…), when there is extra data, a “partial” block is added – which is exactly what we don’t want when calculating percentiles.
Instead, the behavior we want is nblocks(seq, number_of_blocks), for example: nblocks(range(10), 3) -> [0, 1, 2], [3, 4, 5], [6, 7, 8, 9].
This is a surprisingly hard to write function, or rather, harder than you’d expect. I’ll be especially glad if someone writes it elegantly. My solution works well enough, but isn’t the prettiest.
So, you have the definition – let’s see if you can do better than me. Once enough solutions are presented, I will present my own.
IMPORTANT EDIT: the required signature is nblocks(seq, num_blocks). So for seq(range(10), 3), the return value should be 3 blocks, with the last one having an extra item. As a general rule, the extra items should be spread as evenly as possible.
First stab at it:
Then I did a recursive solution to see if I could make it more compact:
How’s this?
A quickie, I’m not sure if it’s elegant enough…
http://codepad.org/Di6Z3POC
Thank you all for your solutions!
Jeremy: I think I didn’t explain myself well enough: I wasn’t aiming for a better “blocks” function. I was aiming for a “nblocks” function, whose signature is nblocks(seq, num_blocks).
Lucas: your solution is problematic, as for nblocks(range(10), 3), the last element was dropped.
Yuv: try to write it as nblocks(seq, num_blocks) :)
I will add an edit to the blog post to make the requried signature more clear :)
Ok, let’s try…
Hmm, not really elegant. Having to know the length of the sequence seems unavoidable. And izip(*[seq] * block_len) looks promising, but i dunno how to make it work :).
Ephes:
Thanks for the solution! However, that’s not quite what I was looking for :) See again my update and my comment above: the signature should be nblocks(seq, num_blocks). So for the “usual” percentiles, it’d be nblocks(seq, 10).
Regarding the length of the sequence: you are right, it’s unavoidable to know it but that’s ok. (It’s avoidable only if you’re willing to do some slightly ugly hacks :)
Oh, I see I misread what you were trying to do. Will work on a better solution.
Is that the function you’re looking for?
I’m not sure what you mean by “As a general rule, the extra items should be spread as evenly as possible.”
Here’s my solution with my own interpretation of that statement:
def blocks(seq, block_len):
rest = seq
result = []
while len(rest) > block_len:
result.append(rest[:block_len])
rest = rest[block_len:]
#now distribute
for i, v in enumerate(rest):
result[-(i+1)].append(v)
return result
Oops. Let me try it again with the pre tag.
Breshenham to the rescue!
def nblocks( seq, block_len ):
assert seq
l = []
error = 0.0
deltaerr = float(block_len) / len(seq)
for item in seq:
l.append( item )
error += deltaerr
if error >= 1.0:
yield l
l = []
error -= 1.0
if l:
yield l
( Provides better distribution )
Shouldn’t have used the code tag… Trying again:
Breshenham to the rescue!
>>> list( try2.nblocks( range(10), 3) )
[[0, 1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> list( try2.nblocks( range(9), 3) )
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
>>> map(len, try2.nblocks(range(17*9), 10))
[16, 15, 15, 16, 15, 15, 16, 15, 15, 15]
( Provides better distribution )
Everyone: It seems I wasn’t clear in my explanation of the challenge – it is to write a function that receives as parameter the *number of blocks*, and not the length of the block.
I tried to improve the situation in the blog post, by emphasizing the required signature, let me know if you think it’s still unclear.
Yeah, my bad. Rename block_len to block_count. It does what you asked, I just didn’t pay enough attention to the naming.
For instance:
here are my two cents :
the idea is to work on “lengths” first: average is length/nb_blocks, and extra, is to be dispatched.
Then I cumulate the lengths to turn them in indices
that’s the lblock function.
the nblock is then straightforward.
little improvement. I didn’t get the “even” part !
it uses rounding error to dispatch evenly the extras, this time.
oups, sorry, needed a little renaming to keep the code readable
Maybe I’m late to the party, but I wrote this last night:
Ah ok, hope to understand the problem better, now. A straightforward approach (thanks Thilo):
A generator of blocks:
So that seq doesn’t have to be a list:
Here is my solution:
I do believe that there were other solutions prettier than this.
What I really liked about this challenge is that it sounds really easy, and even once you define the problem, it takes more time than you expected.
I took a small liberty here and relaxed the requirements by saying that
nblocks
must return an iterator (which is not necessarily a list). This is the best I came up with:Quite magical solution that uses same idea as grouper example in itertools docs
http://docs.python.org/2/library/itertools.html#recipes
[PYTHON]
def superpercentile(l, n):
border = len(l) / n * n
args = [iter(l[:border])] * n
r = zip(*args)
r[-1] += tuple(l[border:])
return r
[/PYTHON]