My solution to the counting sets challenge

A few days ago, I wrote up a challenge – to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I’ll describe it in more depth here.
In the problem, I am given an initial group of sets, and then an endless ‘stream of sets’. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten – not so much.

So how to solve this problem? My original solution is somewhat akin to the classic “lion-in-the-desert” problem, but more like the “blood test” story. I didn’t find a link to the story, so I’ll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn’t want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive – they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let’s build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one – if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We’ll start at the two biggest unions, and work our way down. There is an optimization though – if a set appears more than say, 10 times, it’s not very unique, and its score is zeroed. In that case, we don’t have to go down all the way, but stop as soon as we pass the 10 “positive result” mark.

Here’s the code:

class SetGroup(object):
    def __init__(self, set_list):
        cur_level = list(set_list)
        self.levels = []
        while len(cur_level) > 1:
            self.levels.append(cur_level)
            cur_level = [union(couple) for couple in blocks(cur_level, 2)]
        self.levels.reverse()
 
    def count(self, some_set, max_appear = None):
        indexes = [0]
        for level in self.levels:
            indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
            indexes = (x for x in indexes if x < len(level))
            indexes = [x for x in indexes if some_set <= level[x]]
            if max_appear is not None and len(indexes) >= max_appear:
                return max_appear
        return len(indexes)

Here’s a link to the full code.

I didn’t implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it’s because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.
Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)

This entry was posted in Challenges, computer science, Python and tagged , , , . Bookmark the permalink.

5 Responses to My solution to the counting sets challenge

  1. Rani says:

    The problem you’re referring to is called Combinatorial Group Testing, and has efficient* solutions even when:
    (i) more than one soldier is sick (say, at most k out of n), and
    (ii) you want to decide in advance which samples get sent to the lab, instead of waiting log(n) roundtrips (this property is called non-adaptiveness).

    For the case of k=1 you can pick the unions to check as follows:
    union number i has all sets whose index j satisfies (j>>i)&1==1.

    * Thats O( k^2 log(n/k) ).

  2. rouli says:

    Actually, I pondered about this method, and dismissed it once I’ve thought about the scenario where the union of every two sets in the original list of sets equals the union of all the sets in the original list.
    In that case, you’ll have to go all the way down, doing more comparisons than in the naive algorithm.
    (Obviously it’s solved by introducing max_appear)

  3. lorg says:

    rouli:
    That was one of the reasons I didn’t implement this solution right away. If the given set is contained in all the sets, you’ll do 2*n-1 checks to discover this. Before I thought of ‘max_appear’, I thought of other checks you can do before you run it, and then maybe run a different algorithm, maybe even the naive one.

    I also tried running eric and Shenberg’s solution only for input sets of cardinality 1 and 2 (and for the rest run my solution), but that one intersection was enough to kill it. There’s still a possibility that there are a few implementation kinks I put there that can fix it, but I doubt it.

  4. lorg says:

    Rani:
    It’s good to see the proper name for the “soldier-testing” algorithm, thanks!
    I didn’t understand though if you can adapt/find a better solution for the challenge. I’ll be happy to hear what you think – especially if it reduces the running time of my (very real) program :)

  5. lorg says:

    rouli:
    Regarding every two sets’ union being equal to the union of all sets – well, fortunately that doesn’t happen in my input.
    I forgot to mention that before I implemented the algorithm I did various “dry tests”, of estimating the the number of containment tests. In the general/worst case (without ‘max_appear’), where the input set is also contained in many set, it really isn’t good. It’s very easy to kill this algorithm, but for ‘sane’ input it performs well enough.

    As I said – there are many other drawbacks. For example, the set group may be ordered in such a way that the tree built from it will be very inefficient later on. My (theoretical) cheap shot at this was to randomize the order. I thought of some more drawbacks, but enough for now. Especially since it works :)