Categories
Challenges Programming Python Statistics

Small programming challenge No. 6 – nblocks

I came up with this challenge when I had to write a function to divide a sequence to percentiles. I needed this to calculate some statistics over trips for plnnr.com. “This sounds trivial” I thought, and reached for my simple blocks function:

def blocks(seq, block_len):
    """blocks(range(5),2) -> [[0, 1], [2, 3], [4]]"""
    seq_len = len(seq)
    if seq_len%block_len == 0:
        num_blocks = seq_len/block_len
    else:
        num_blocks = 1 + (seq_len/block_len)
 
    result = [[] for i in xrange(num_blocks)]
    for idx, obj in enumerate(seq):
        result[idx/block_len].append(obj)
    return result

So I set block_len to len(seq)/10 and called blocks(seq, block_len). Unfortunately, according to the docs of blocks (which I wrote…), when there is extra data, a “partial” block is added – which is exactly what we don’t want when calculating percentiles.
Instead, the behavior we want is nblocks(seq, number_of_blocks), for example: nblocks(range(10), 3) -> [0, 1, 2], [3, 4, 5], [6, 7, 8, 9].

This is a surprisingly hard to write function, or rather, harder than you’d expect. I’ll be especially glad if someone writes it elegantly. My solution works well enough, but isn’t the prettiest.

So, you have the definition – let’s see if you can do better than me. Once enough solutions are presented, I will present my own.

IMPORTANT EDIT: the required signature is nblocks(seq, num_blocks). So for seq(range(10), 3), the return value should be 3 blocks, with the last one having an extra item. As a general rule, the extra items should be spread as evenly as possible.

Categories
Challenges computer science Programming Python

The mathematics behind the solution for Challenge No. 5

If you take a look at the various solutions people proposed for the last challenge of generating a specific permutation, you’ll see that they are very similar. Most of them are based on some form of div-mod usage. The reason this is so, is because all of these solutions are using the Factorial Base.

What does that mean?
Note that we usually encounter div-mods when we want to find the representation of a number in a certain base. That should already pique your interest. Now consider that a base’s digits need not have the same weight. For example, consider how we count the number of seconds since the start of the week:

seconds of the last minute, A (at most 60-1)
minutes of the last hour, B (at most 60-1)
hours of the last day, C (at most (24-1)
days of the last week, D (at most 7-1)

So given A, B, C, D, we would say that the number of seconds is:
A + 60*B + 24*C + 7*D. This certainly looks like a base transformation. To go back, we would use divmod.

The factorial base is just the same, with the numbers n, n-1, … 1. Note that in the factorial base, you can only represent a finite number of numbers – n!. This should not be surprising – this is what we set out to do in the first place!
The thing that I found really amazing about this is that all the people to whom I posed this challenge came up with almost the same “way” of solving it.

Other interesting curiosities regarding bases can be found in Knuth’s book, “The Art of Computer Programming”, volume 2, Section 4.1.

Categories
Challenges computer science Programming

Small Programming Challenge no. 5 – Generating a Permutation

I thought of this one quite a long time ago, and I believe that the idea behind it is pretty nice mathematically. I got the idea for it from Knuth’s “The Art of Computer Programming”.

The challenge is simple:
write a function that receives as arguments two numbers, n, and num such that 0 <= num < n!. This function needs to return an array (list) representing a permutation of the numbers 0..n-1. For each possible num, the function needs to return a different permutation, such that over all values of num, all possible permutations are generated. The order of permutations is up to you. The function you write should do this in at most O(n) time & space (Various O(nlogn) are also acceptable). Write your solutions in the comments, in [ LANG ] [/ LANG ] blocks (without the spaces) where LANG is preferably Python :). I will post my solution in a few days. As usual, the most efficient & elegant solution wins. Go!

Categories
Challenges computer science Python

My solution to the counting sets challenge

A few days ago, I wrote up a challenge – to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I’ll describe it in more depth here.
In the problem, I am given an initial group of sets, and then an endless ‘stream of sets’. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten – not so much.

So how to solve this problem? My original solution is somewhat akin to the classic “lion-in-the-desert” problem, but more like the “blood test” story. I didn’t find a link to the story, so I’ll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn’t want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive – they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let’s build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one – if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We’ll start at the two biggest unions, and work our way down. There is an optimization though – if a set appears more than say, 10 times, it’s not very unique, and its score is zeroed. In that case, we don’t have to go down all the way, but stop as soon as we pass the 10 “positive result” mark.

Here’s the code:

class SetGroup(object):
    def __init__(self, set_list):
        cur_level = list(set_list)
        self.levels = []
        while len(cur_level) > 1:
            self.levels.append(cur_level)
            cur_level = [union(couple) for couple in blocks(cur_level, 2)]
        self.levels.reverse()
 
    def count(self, some_set, max_appear = None):
        indexes = [0]
        for level in self.levels:
            indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
            indexes = (x for x in indexes if x < len(level))
            indexes = [x for x in indexes if some_set <= level[x]]
            if max_appear is not None and len(indexes) >= max_appear:
                return max_appear
        return len(indexes)

Here’s a link to the full code.

I didn’t implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it’s because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.
Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)

Categories
Challenges computer science

Small Python Challenge No. 4 – Counting Sets

This is a problem that I encountered a short while ago. It seems like it could be easily solved very efficiently, but it’s not as easy as it looks.
Let’s say that we are given N (finite) sets of integers – S. For now we won’t assume anything about them. We are also given another set, a. The challenge is to write an efficient algorithm that will count how many sets from S contain a (or how many sets from S a is a subset of).

Let’s call a single test a comparison. The naive algorithm is of course checking each of the sets, which means exactly N comparisons. The challenge – can you do better? When will your solution outperform the naive solution?

I will give my solution in a few days. Submit your solutions in the comments, preferably in Python. You can write readable code using [ python ] [ /python ] blocks, just without the spaces.

Categories
Assembly Challenges Programming Testing

Some Assembly Required No. 1

I’ve been working on some of the instruction tests in vial, and I wanted to test the implementation of LOOP variants. My objective was to make sure the vial version is identical to the real CPU version (as discussed here). To achieve this, I had to cover all of the essential behaviors of LOOP.

Well, using the framework Gil and I wrote, I hacked up some code that should cover the relevant cases:

code_template = """
mov edx, ecx ; control the start zf
mov ecx, eax ; number of iterations
mov eax, 0 ; will hold the result, also an iteration counter
loop_start:
 
    cmp eax, ebx    ; check if we need to change zf
    setz dh
    xor dh, dl      ; if required, invert zf
    inc eax         ; count the iteration
    cmp dh, 0       ; set zf
 
    loop%s loop_start
"""
for loop_kind in ['','z','nz']:
    code_text = code_template % loop_kind
    c = FuncObject(code_text)
    for start_zf_value in [0,1]:
        for num_iters in [1,4,10]:
            for when_zf_changes in [1,2,15]:
                c(num_iters, when_zf_changes, start_zf_value)
                c.check()

Note that c(…) executes the code both on vial’s VM, and on the real cpu. c.check() compares their return value (EAX) and flags after the execution. I also wanted to avoid other kinds of jumps in this test.

To check that the code ran the same number of times, I returned EAX as the number of iterations.
All the games with edx are there to make sure that I’m testing different zf conditions.

The challenge for today:
Can you write a shorter assembly snippet that tests the same thing?

Categories
Challenges Math Programming Python

Small Python Challenge No. 3 – Random Selection

This time I’ll give two related problems, both not too hard.

Lets warm up with the first:

You have a mapping between items and probabilities. You need to choose each item with its probability.

For example, consider the items [‘good’, ‘bad’, ‘ugly’], with probabilities of [0.5, 0.3, 0.2] accordingly. Your solution should choose good with probability 50%, bad with 30% and ugly with 20%.

I came to this challenge because just today I had to solve it, and it seems like a common problem. Hence, it makes sense to ask ‘what is the best way?’.

The second problem is slightly harder:

Assume a bell shaped function p(x) that you can ‘solve’. This means that given a value y, you can get all x such that p(x)=y. For example, sin(x)^2 in [0,pi] is such a function. Given a function such as Python’s random.random() that yields a uniform distribution of values in [0,1), write a function that yields a distribution proportional to p(x) in the appropriate interval.

For example, consider the function p(x) = e^(-x^2) in [-1,1]. Since p(0) = 1, and p(0.5)~0.779, the value 0 should be p(0)/p(0.5)~1.28 times more common than 0.5.

As usual, the preferred solutions are the elegant ones. Go!

note: please post your solutions in the comments, using [ python]…[ /python] tags (but without the spaces in the tags).

Categories
Challenges Design Programming Python

LRU cache solution: a case for linked lists in Python

The reason I put up the LRU cache challenge up, was that I couldn’t think of a good solution to the problem without using linked lists. This has been pointed to by Adam and Erez as well. Adam commented on this, and Erez’ solution to the problem was algorithmically identical to mine.

So how to solve the challenge? Here are the two possible solutions I thought about:

  • Use a dict for lookup, and each element’s age is indicated by its position in a linked list. This is the solution Erez and I implemented.
  • Keep a ‘last time of use’ indicator for each element. This could be just a regular int, incremented by 1 for each lookup. Keep the elements in a min heap, and when there are too many elements, pop them using the minimum heap.

Generally, I consider the first solution more elegant. It doesn’t rely on an integer to work, so it could work ‘indefinitely’. Of course, the second solution can be also made to work indefinitely, with some upkeep from time to time. (The added time cost of the upkeep may be amortized over other actions.)

If you can think of some other, more elegant solution, I’ll be happy to hear about it.

So, given that a linked list solution is more elegant, we come to the crux of the problem: what to do in Python? The Python standard library does not contain a linked list implementation as far as I know. As a result, Python programmers are encouraged to use the list type, which is an array. This is just as well: for most intents and purposes, the list type is good enough.

I tried to think a little about other cases where a linked list was more appropriate, and I didn’t come up with any more such cases. If you come up with any such case, I’ll be happy to hear about it.

After looking for a public implementation, and not finding one that seemed good enough, I decided to go ahead and write my own.

Out of curiousity, I also did a small comparison of runtime speeds between my implementation of a linked list, and the list data type. I tried a test where a linked list has an obvious advantage (complexity wise)- removing elements from the middle of the list. The Python list won up to somewhere in the thousands of elements. (Of course, list is implemented in C, and mine is in pure Python).

What is my conclusion from all of this? The same as the conventional wisdom: use the list data type almost always. If you find yourself in need of a linked list, think long and hard (well, not too long) about your problem and solution. There’s a good chance that either you can use the built-in list with an equivalent solution, or that a regular list will still be faster, for most cases. Of course, if you see no other way – do what you think is best.

Your thoughts?

Categories
Challenges

Top 13 challenge sites

Statusreport in his new blog Ingeration has compiled a list of top challenge sites. I must admit, I already knew some of them in the past and he also told me about some of them earlier. Still, this time around I tried Electrica. Well, it’s really fun solving those puzzles with Python. I’m afraid to start Python Challenge again, after stopping a long time ago; I still have my homework to do!

Categories
Algorithms Challenges computer science Programming Python Utility Functions

Small Python Challenge No. 2 – LRU Cache

Caching is easy. Consider the cache I used to optimize the recursive spring:

class _NotInDict(object):
    pass
_NotInDict = _NotInDict()
def cached(func):
    cache = {}
    def wrapper_func(*args):
        prev_result = cache.get(args, _NotInDict)
        if prev_result is _NotInDict:
            result = func(*args)
            cache[args] = result
            return result
        return prev_result
    return wrapper_func

This kind of cache is simple and effective (especially for recursions), and may be used in all sorts of situations. However, sometimes you want a size limited cache. In that case you have to decide on the criterion used to decide which items to throw away. There are many kinds of criteria used, for further reading check out wikipedia.

For now I’d like to discuss the LRU cache though. LRU stands for Least Recently Used, which means that you throw away the items you didn’t use for a long time. Time in this case is measured by actions. I thought of this type of cache when I worked on the recursive spring. Since each step in the ‘recursivation’ used two samples of the previous step, caching was an obvious choice, and if I had to size limit my cache, LRU whould be the type of cache to use, as you could be certain that the older samples would not have to be used until the next drawing.

The challenge for the weekend is to write an LRU cache in Python. The cache has to be general – support hash-able keys and any cache size required. It has to be efficient – in the size of the cache and the time it takes for a lookup and an update. Once the standard requirements have been met, the big competition should be on elegance. I wrote a benchmark implementation, which is efficient, but not fast. Once I see some solutions, I’ll talk a bit about mine, which is an interesting case-study.