This is a problem that I encountered a short while ago. It seems like it could be easily solved very efficiently, but it’s not as easy as it looks.

Let’s say that we are given N (finite) sets of integers – S. For now we won’t assume anything about them. We are also given another set, a. The challenge is to write an efficient algorithm that will count how many sets from S contain a (or how many sets from S a is a subset of).

Let’s call a single test a comparison. The naive algorithm is of course checking each of the sets, which means exactly N comparisons. The challenge – can you do better? When will your solution outperform the naive solution?

I will give my solution in a few days. Submit your solutions in the comments, preferably in Python. You can write readable code using [ python ] [ /python ] blocks, just without the spaces.

## 17 replies on “Small Python Challenge No. 4 – Counting Sets”

What other actions can you make and how much do they cost?

(or is it only “is set A a subset of set B”=1 ?)

You can do any standard set operation (intersection, union, difference, etc…). You can also assume sets are hashable (in Python, using frozensets) if it helps you.

Note that you can do any reasonable preparation work you want, as long as on a sequence of many runs you get better complexity.

To make it interesting, provide an interface that people need to implement.

No problem. I’ll even do that in Python:

[python]

class SetGroup(object):

def __init__(self, list_of_sets):

pass

def count(self, some_set):

“””return the number of sets x for which some_set <= x""" pass [/python]

if sets are hashable, and you have exponential time and memory for construction of a data structure, you could create an hash table whose keys are every possible subset in the union of S, and the entry for key K is the number of set in S that contain K. Then a query will take exactly 0 comparisons.

If I have enough time to prepare (I don’t know exactly what is : reasonnable) Il would make a big index of integers.

A List, L, that contains at index i, the list of indexes of sets that contains i.

L[i] is the list of j so that list_of_sets[j] contains i

Of course this list will be sparse, and I’ill use a linked list to get efficient.

Getting the answer is all about getting Intersect( L[a] )

Building L is costly, but getting Inter(L[a]) is not. The intersection can be computed incrementally (and we can lazily stop as soon as it gets empty).

It will cost A=card(a) intersection of arrays ( ordered, and of card lower than or equals to N), that is at worst A*N boolean test

the naive algorithm costs A*N*S tests (with in fact N*S beeing the sum of all card of all sets in list_of_sets ) .

I don’t think I’ve time to write it down in python right now, I’ll try it later (maybe).

[python]

def count(self, some_set):

I = L[some_set[0]]

for i in some_set:

I=I & L[i]

if len(I) ==0 : return 0

return size(I)

[/python]

(didn’t test it, and of course didn’t write the construction of L, that should be sets.

rouli:

I don’t consider exponential time & space requirements reasonable :)

eric:

That’s a very nice solution. I like how the complexity is dependent on the cardinality of a.

Well, I didn’t read anyone else’s comments except to make sure I’m the first to write some code, but the solution I chose is this:

create a dictionary of for each integer in one of the sets S, which sets it belongs to.

create a copy of S, let’s call it B.

Then, for every integer in the set a, B = intersect B with the set of sets to which it belongs (from the dictionary).

return the length of the resulting B

I hope my code can be more clear. This is almost entirely pasted out of my interpreter, so sorry about the tab issues.

[python]

def build_setdict(set_seq):

set_dict = {}

for some_set in set_seq:

for value in some_set:

try:

set_dict[value].add(some_set)

except KeyError:

set_dict[value] = set([some_set])

return set_dict

def get_containing_sets(sets, set_to_test):

sets = frozenset(map(frozenset, sets))

# integer->setlist mapping

setdict = build_setdict(sets)

remaining_sets = sets

try:

for i in set_to_test:

remaining_sets = remaining_sets.intersection(setdict[i])

except KeyError:

# Integer in no set

return set()

return remaining_sets

#example test case

>>> sets = [set(range(10)), set(range(10, 20)), set(range(20))]

>>> len(get_containing_sets(sets, range(5)))

2

>>> len(get_containing_sets(sets, range(5,15)))

1

>>> len(get_containing_sets(sets, range(5,25)))

0

>>> len(get_containing_sets(sets, range(11,16)))

2

[/python]

eric & Shenberg:

Both of your solutions are equivalent, and they both work best when the set to test is small. This means that it will probably work best when the set to test is common. When it is rare, it will take more time.

The solution I originally thought of has it the other way around, although it has it’s drawbacks. I need to do a more thorough complexity analysis of both algorithms to be able to compare them correctly.

Another note regarding the problem where I needed this – in that problem I had to score a set according to its rarity. Since very common sets score low, a possible improvement for the real-life problem might be to return TOO_MANY (equivalent to zero score) for sets that appear more times than some threshold.

Regardless, I’m still very curious to see more solutions to the full problem as stated in the post.

By the way, to all of you:

Thank you for your solutions!

Here’s an academic viewpoint of closely related problems.

http://arxiv.org/abs/0707.1532

Main theme: the complexity heavily depends on the width of the poset, that is, the size of the largest antichain. An antichain is a subset A of S such that no two sets in A are a subset of one another.

Rani:

That reminds me a very nice story – how to differentiate “mathematical programmers” from “practical programmers”. A few years back I had to sort a set of sets, and articles on ordering posets didn’t seem to do much help.

It turns out that when you ask “mathematical programmers” regarding this problem – they will also look at posets.

The amusing part comes when you figure out a “cheating solution” – for each set s compute (len(s), hash(s)), and sort according to that. If you do, it is guaranteed that if s1 <= s2, it will be before it in the ordering. The hash is there to keep the ordering well-defined for sets with the same cardinality. It turns out that less mathematically inclined programmers think of this solution much faster :)

True, but the notion of preprocessing the set S to get a chain data structure is one of the basic, recurring, ideas and it is indeed a powerful “trick”.

BTW, the paper I linked to has a practical algorithm that computes the height of *all* sets in S if this is what the original application required.

I’m looking forward to seeing your solution !

Yeah I think I must be misunderstanding something. Are set comparisons the only thing that costs time? If so, here’s my solution…

1. compute D[i] = a – S[i] for each S[i]

2. count = 0

3. for each D[i], if len(D[i]) = 0, count = count + 1

4. print count

Yeah that doesn’t really do any set comparisons. Probably not what you were going for. Posted for the lulz.

Anonymous:

1. Each computation of D[i] may be considered equivalent to a comparison.

2. Your solution does O(N) work for each given a.