This time I’ll give two related problems, both not too hard.

Lets warm up with the first:

You have a mapping between items and probabilities. You need to choose each item with its probability.

For example, consider the items [‘good’, ‘bad’, ‘ugly’], with probabilities of [0.5, 0.3, 0.2] accordingly. Your solution should choose good with probability 50%, bad with 30% and ugly with 20%.

I came to this challenge because just today I had to solve it, and it seems like a common problem. Hence, it makes sense to ask ‘what is the best way?’.

The second problem is slightly harder:

Assume a bell shaped function p(x) that you can ‘solve’. This means that given a value y, you can get all x such that p(x)=y. For example, sin(x)^2 in [0,pi] is such a function. Given a function such as Python’s random.random() that yields a uniform distribution of values in [0,1), write a function that yields a distribution proportional to p(x) in the appropriate interval.

For example, consider the function p(x) = e^(-x^2) in [-1,1]. Since p(0) = 1, and p(0.5)~0.779, the value 0 should be p(0)/p(0.5)~1.28 times more common than 0.5.

As usual, the preferred solutions are the elegant ones. Go!

**note:** please post your solutions in the comments, using [ python]…[ /python] tags (but without the spaces in the tags).

## 12 replies on “Small Python Challenge No. 3 – Random Selection”

The first one is interesting to me as well. I have for many times encountered the need for such functionality, but no really smart ideas so far. :-)

By the way, an obvious use case is generating a sequence of elements using a Markov Chain… That’s just what I was doing today.

The simplest way for #1 will be to create the actual population and then just use random.choise on it:

[python]

from operator import add

from random import choice

def create_population(items, probabilities):

return reduce(add, map(lambda ip: [ip[0]] * (int(ip[1] * 100)), \

zip(items, probabilities)))

def test():

from collections import defaultdict

items = [‘good’, ‘bad’, ‘ugly’]

probabilities = [0.5, 0.3, 0.2]

selections = [0, 0, 0]

population = create_population(items, probabilities)

num_times = 10000

for i in xrange(num_times):

item = choice(population)

selections[items.index(item)] += 1

for item in items:

index = items.index(item)

wanted = probabilities[index]

got = float(selections[index]) / num_times

print “item %s: wanted %.4f got %.4f” % (item, wanted, got)

if __name__ == “__main__”:

test()

[/python]

Oh well, so I’m bored.

I’m quite sure there’s a better way to do it, but whateva.

[python]

import random

choices = [“good”, “bad”, “ugly”]

probabilities = [0.5, 0.3, 0.2]

ranges = [0]

# convert probabilites to ranges from 0-1

# assumption: probablities sum equals to 1

for i in xrange(len(probabilities)):

ranges.append(probabilities[i] + ranges[i])

rand_choice = random.random()

print choices[len([choice for choice in ranges if rand_choice > choice]) – 1]

[/python]

Miki:

While being the simplest and good enough for some cases, it is not the most elegant solution:

1. It doesn’t yield the exact distribution for various inputs. Especially for cases where the probability is less than 0.01. While this might not be a problem for large n (your 100 in create_population),

2. it is a bit too wasteful (especially for such large n).

I must say though, I also thought of this solution at first. That was because in my use-case the probabilities were generated from a histogram of the items, so each item had a natural number of appearances. While this solves away the problem of the exact distribution, it is still not the ‘right’ solution. Consider a Markov chain generated from a large body of text. Now, I would like to generate a sequence of words according to this chain. I would have to create a such a population for each word in the chain.

A population created this way (whether using the original numbers, or just the percentages like you did) is still too much:

The Oxford dictionary contains entries for about 170,000 words. If we just want to look at 50,000 words, than generating such a population (of 100) for each word will require about 19Mbytes total. (Assuming 4-byte pointers to a collection of words.)

I did not look at other solutions before coding my two that are below. Of my two, I would like to check for applicability then go with the simpler, memory-hungry probchoice2() if possible over the more than twice as long probchoice().

The test info given is not repeatable and must be checked by hand – In a production environment I would need to select some delta and ensure calculated probabilities are within delta of the input probs.

Now looking at other comments, it’s nice to see that Miki gave tests too :-)

And looking at lorq’s comment to Miki, I had thought of precision and made my bin count selectable. I’m working on a PC with a gig of ram, and thought that working to three digits of precision as the default would be OK.

The prog:

[python]

”’\

Answer to http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/

Author Donald ‘Paddy’ McCarthy, Feb 2008, paddy3118-at-gmail-dot-com

“You have a mapping between items and probabilities.

You need to choose each item with its probability.

For example, consider the items [’good’, ‘bad’, ‘ugly’],

with probabilities of [0.5, 0.3, 0.2] accordingly.

Your solution should choose good with probability 50%,

bad with 30% and ugly with 20%.”

Sample output:

##

## PROBCHOICE

##

Trials: 100000

Target probability: 0.500,0.300,0.200

Attained probability: 0.500,0.302,0.197

##

## PROBCHOICE2

##

Trials: 100000

Target probability: 0.500,0.300,0.200

Attained probability: 0.502,0.299,0.199

>>> it = probchoice2(‘good bad ugly’.split(), [0.5, 0.3, 0.2])

>>> for x in range(10): print it.next()

…

bad

bad

bad

ugly

good

bad

good

good

good

bad

>>>

”’

import random

def probchoice(items, probs):

”’\

Splits the interval 0.0-1.0 in proportion to probs

then finds where each random.random() choice lies

”’

prob_accumulator = 0

accumulator = []

for p in probs:

prob_accumulator += p

accumulator.append(prob_accumulator)

accumZitems = zip(accumulator, items)[:-1]

last_item = items[-1]

while True:

r = random.random()

for prob_accumulator, item in accumZitems:

if r <= prob_accumulator:

yield item

break

else:

# last range handled by else clause

yield last_item

def probchoice2(items, probs, bincount=1000):

”’\

Puts items in bins in proportion to probs

then uses random.choice() to select items.

Larger bincount for more memory use but

higher accuracy (on avarage).

”’

prob_accumulator = 0

bins = []

for item,prob in zip(items, probs):

bins += [item]*int(bincount*prob)

while True:

yield random.choice(bins)

def tester(func=probchoice, items=’good bad ugly’.split(),

probs=[0.5, 0.3, 0.2],

trials = 100000

):

def problist2string(probs):

”’\

Turns a list of probabilities into a string

Also rounds FP values

”’

return “,”.join(‘%5.3f’ % (p,) for p in probs)

from collections import defaultdict

counter = defaultdict(int)

it = func(items, probs)

for dummy in xrange(trials):

counter[it.next()] += 1

print “\n##\n## %s\n##” % func.func_name.upper()

print “Trials: “, trials

print “Target probability: “, problist2string(probs)

print “Attained probability:”, problist2string(

counter[x]/float(trials) for x in items)

if __name__ == ‘__main__’:

tester()

tester(probchoice2)

[/python]

Statusreport:

Like I wrote in my email to you, you got it right, but it could be better. (Faster, and more elegant.)

Paddy3118:

If you could rewrite probchoice to be fast, short and elegant, would you consider it the better solution?

And some general notes:

1. I like the tests.

2. No one yet tried the second challenge! Still open for grabs :)

Hi lorq,

Np email?

But that aside, I’m currently adopting a measure of quality that is “quality meets the spec”, and in which work to exceed a spec after it is already met, detracts from quality. If a spec is vague then it is important to seek out a refinement to the spec. so that you can guage the quality of your code.

I would tend to write, and test, and be wary of both speed optimisations unless it was slow, and elegance optimisations unless someone (such as yourself), pointed out that its hard to read. I do like chasing algorithms though so I might waste effort on a flight of fancy in that way, after already having an algorithm that would suffice :-)

As for the extended challenge:

1: I couldn’t see myself finishing it in an evening.

2: Could you go into more detail? Maybe with sample input/output?

3: I don’t think I could make much use of any result – it would probably be too complex for training others (too much time spent explaining the task, too much time needed to go through a solution).

– Paddy.

– Paddy.

Hey Paddy:

Generally I agree with you. If you’ve got a solution that works and meets the spec, that’s enough.

However, in this case this consideration is moot – we are talking about a challenge where the target is elegance and having fun solving it.

Without further ado, here’s my solution to the first challenge:

[python]

import numpy

import bisect

import random

def random_select(items, probs):

probs = numpy.cumsum(probs)

while True:

yield items[bisect.bisect(probs, random.random())]

def test():

items = [‘good’, ‘bad’, ‘evil’]

probs = [0.5, 0.3, 0.2]

x = random_select(items, probs)

result = []

for i in xrange(10000):

result.append(x.next())

for x, p in zip(items, probs):

print x, p, result.count(x)/float(len(result))

if __name__ == ‘__main__’:

test()

[/python]

A few notes:

1. This solution is correct for any probability distribution.

2. It is O(n) for set-up and O(lgn) for each call to next().

3. There is a single caveat, I am assuming the probabilities sum up well to 1, which with floating point numbers might not always be the case. To be on the safe side the last element might be replicated with probability 1, just to make sure there are no out of bounds references.

Numpy’s cumsum() (cumulative sum) is a good function, quite useful. Before I knew about it I wrote one of my own. About bisect though… I don’t like the name. I think it’s one of the least aptly named modules in the Python stdlib. Still very useful to be aware of. Probably should have been named ‘sorted_find’ or something similar.

Regarding the second challenge:

1. It is more of a mathematical challenge. When I first thought about it, it didn’t take me too long to solve it(<0.5hr). The solution is simpler than it seems. It may also be solved generally for any distribution continuous in [a,b], but the generalization is more complicated, and I don't know of a better way to solve it.

2. Regarding use-cases. Well, I came up for this challenge independently. To find uses I did a Google codesearch on Python's random.normalvariate() and as an example, found jitter delay in communication code. Another example was particle systems. I guess you could find other usages.

3. Here is an example run of the my solution to the second problem. I don’t know if the apparent error in the middle is because my solution is bad, or because my measurement is bad. I’ll be *very happy* to see a good solution and a good proof.

Here is a sketch of how I took the measurement. It is much more complicated than the actual solution :)

[python]

def p(x):

return numpy.exp(-x**2)

def solve_p(y):

x = numpy.sqrt(-numpy.log(y))

return [-x, x]

result = [random_select.random_dist(p, solve_p) for i in xrange(10000)]

d = [int(x*100) for x in result]

h = {}

for x in d:

h[x] = h.get(x,0)+1.0/len(result)

yvals = [h[int(x*100)] for x in result]

xvals = result

xvals2 = numpy.arange(-3,3,0.1)

avg = sum(yvals)/len(yvals)

yvals2 = [p(x)*avg for x in xvals2]

pylab.plot(xvals, yvals, “r+”)

pylab.plot(xvals2, yvals2, “b-“)

[/python]

Paddy:

I almost forgot… about the email, well, statusreport originally emailed his solution instead of putting it in the comments.

Indeed I had a bug in the measurement code. I used int(x*100) instead of numpy.floor(x*100) which gave 0 a result that’s twice as much as it should.

Once that bug was out of the way, I could use max instead of an average, and here is the resulting graph.

Here is a solution which maps the probabilities to ranges and an empirical proof.

I am not sure how to submit code here, so I hope this is properly formatted!

[python]

import random

D= {}

def seed(d):

global D

# Reverse the dictionary

l = []

for k,v in d.items():

l.append((v, k))

l.sort()

minval = 0

for prob, word in l:

D[(minval, minval + int(prob*100))] = word

minval += int(prob*100)

def getword2():

r = random.randint(0, 99)

for t in D:

if r in range(t[0], t[1]):

return D[t]

def proof():

counts = {‘good’: 0, ‘bad’: 0, ‘ugly’: 0}

seed({‘ugly’: 0.2, ‘bad’: 0.3, ‘good’: 0.5})

# Generate random words 1000000 times and print the counts

for x in range(1000000):

counts[getword()] += 1

for word, occur in counts.items():

print word,’occurence was %s percentage’ % str(occur*1.0/1000000.0)

[/python]

This prints something like…

ugly occurence was 0.199944 percentage

bad occurence was 0.299995 percentage

good occurence was 0.500061 percentage