A few days ago, I wrote up a challenge – to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I’ll describe it in more depth here.

In the problem, I am given an initial group of sets, and then an endless ‘stream of sets’. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten – not so much.

So how to solve this problem? My original solution is somewhat akin to the classic “lion-in-the-desert” problem, but more like the “blood test” story. I didn’t find a link to the story, so I’ll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn’t want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive – they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let’s build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one – if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We’ll start at the two biggest unions, and work our way down. There is an optimization though – if a set appears more than say, 10 times, it’s not very unique, and its score is zeroed. In that case, we don’t have to go down all the way, but stop as soon as we pass the 10 “positive result” mark.

Here’s the code:

class SetGroup(object):
def __init__(self, set_list):
cur_level = list(set_list)
self.levels = []
while len(cur_level) > 1:
self.levels.append(cur_level)
cur_level = [union(couple) for couple in blocks(cur_level, 2)]
self.levels.reverse()
def count(self, some_set, max_appear = None):
indexes = [0]
for level in self.levels:
indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
indexes = (x for x in indexes if x < len(level))
indexes = [x for x in indexes if some_set <= level[x]]
if max_appear is not None and len(indexes) >= max_appear:
return max_appear
return len(indexes) |

class SetGroup(object):
def __init__(self, set_list):
cur_level = list(set_list)
self.levels = []
while len(cur_level) > 1:
self.levels.append(cur_level)
cur_level = [union(couple) for couple in blocks(cur_level, 2)]
self.levels.reverse()
def count(self, some_set, max_appear = None):
indexes = [0]
for level in self.levels:
indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
indexes = (x for x in indexes if x < len(level))
indexes = [x for x in indexes if some_set <= level[x]]
if max_appear is not None and len(indexes) >= max_appear:
return max_appear
return len(indexes)

Here’s a link to the full code.

I didn’t implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it’s because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.

Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)