<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Small Python Challenge No. 3 &#8211; Random Selection</title>
	<atom:link href="http://www.algorithm.co.il/blogs/programming/python/small-python-challenge-no-3-random-selection/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/</link>
	<description>Algorithms, for the heck of it</description>
	<lastBuildDate>Tue, 21 Jun 2011 21:07:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
	<item>
		<title>By: Anand B Pillai</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-95</link>
		<dc:creator>Anand B Pillai</dc:creator>
		<pubDate>Wed, 12 Mar 2008 11:57:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-95</guid>
		<description>Here is a solution which maps the probabilities to ranges and an empirical proof.
I am not sure how to submit code here, so I hope this is properly formatted!

[python]
import random
D= {}

def seed(d):
    global D
    # Reverse the dictionary
    l = []
    for k,v in d.items():
        l.append((v, k))

    l.sort()
    minval = 0
    for prob, word in l:
        D[(minval, minval + int(prob*100))] = word
        minval += int(prob*100)

def getword2():
    r = random.randint(0, 99)
    for t in D:
        if r in range(t[0], t[1]):
            return D[t]

def proof():
    counts = {&#039;good&#039;: 0, &#039;bad&#039;: 0, &#039;ugly&#039;: 0}
    seed({&#039;ugly&#039;: 0.2, &#039;bad&#039;: 0.3, &#039;good&#039;: 0.5})

    # Generate random words 1000000 times and print the counts
    for x in range(1000000):
        counts[getword()] += 1

    for word, occur in counts.items():
        print word,&#039;occurence was %s percentage&#039; % str(occur*1.0/1000000.0)
[/python]
This prints something like...


ugly occurence was 0.199944 percentage
bad occurence was 0.299995 percentage
good occurence was 0.500061 percentage</description>
		<content:encoded><![CDATA[<p>Here is a solution which maps the probabilities to ranges and an empirical proof.<br />
I am not sure how to submit code here, so I hope this is properly formatted!</p>
<p>[python]<br />
import random<br />
D= {}</p>
<p>def seed(d):<br />
    global D<br />
    # Reverse the dictionary<br />
    l = []<br />
    for k,v in d.items():<br />
        l.append((v, k))</p>
<p>    l.sort()<br />
    minval = 0<br />
    for prob, word in l:<br />
        D[(minval, minval + int(prob*100))] = word<br />
        minval += int(prob*100)</p>
<p>def getword2():<br />
    r = random.randint(0, 99)<br />
    for t in D:<br />
        if r in range(t[0], t[1]):<br />
            return D[t]</p>
<p>def proof():<br />
    counts = {&#8216;good&#8217;: 0, &#8216;bad&#8217;: 0, &#8216;ugly&#8217;: 0}<br />
    seed({&#8216;ugly&#8217;: 0.2, &#8216;bad&#8217;: 0.3, &#8216;good&#8217;: 0.5})</p>
<p>    # Generate random words 1000000 times and print the counts<br />
    for x in range(1000000):<br />
        counts[getword()] += 1</p>
<p>    for word, occur in counts.items():<br />
        print word,&#8217;occurence was %s percentage&#8217; % str(occur*1.0/1000000.0)<br />
[/python]<br />
This prints something like&#8230;</p>
<p>ugly occurence was 0.199944 percentage<br />
bad occurence was 0.299995 percentage<br />
good occurence was 0.500061 percentage</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lorg</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-94</link>
		<dc:creator>lorg</dc:creator>
		<pubDate>Tue, 12 Feb 2008 20:48:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-94</guid>
		<description>Indeed I had a bug in the measurement code. I used int(x*100) instead of numpy.floor(x*100) which gave 0 a result that&#039;s twice as much as it should.
Once that bug was out of the way, I could use max instead of an average, and &lt;a href=&quot;http://www.algorithm.co.il/sitecode/random_dist2.png&quot; rel=&quot;nofollow&quot;&gt;here is the resulting graph&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Indeed I had a bug in the measurement code. I used int(x*100) instead of numpy.floor(x*100) which gave 0 a result that&#8217;s twice as much as it should.<br />
Once that bug was out of the way, I could use max instead of an average, and <a href="http://www.algorithm.co.il/sitecode/random_dist2.png" rel="nofollow">here is the resulting graph</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lorg</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-93</link>
		<dc:creator>lorg</dc:creator>
		<pubDate>Tue, 12 Feb 2008 14:06:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-93</guid>
		<description>Paddy:
I almost forgot... about the email, well, statusreport originally emailed his solution instead of putting it in the comments.</description>
		<content:encoded><![CDATA[<p>Paddy:<br />
I almost forgot&#8230; about the email, well, statusreport originally emailed his solution instead of putting it in the comments.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lorg</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-92</link>
		<dc:creator>lorg</dc:creator>
		<pubDate>Tue, 12 Feb 2008 13:24:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-92</guid>
		<description>Hey Paddy:
Generally I agree with you. If you&#039;ve got a solution that works and meets the spec, that&#039;s enough.
However, in this case this consideration is moot - we are talking about a challenge where the target is elegance and having fun solving it.

Without further ado, here&#039;s my solution to the first challenge:
[python]
import numpy
import bisect
import random

def random_select(items, probs):
    probs = numpy.cumsum(probs)
    while True:
        yield items[bisect.bisect(probs, random.random())]

def test():
    items = [&#039;good&#039;, &#039;bad&#039;, &#039;evil&#039;]
    probs = [0.5, 0.3, 0.2]
    x = random_select(items, probs)
    result = []
    for i in xrange(10000):
        result.append(x.next())
    for x, p in zip(items, probs):
        print x, p, result.count(x)/float(len(result))

if __name__ == &#039;__main__&#039;:
    test()
[/python]

A few notes:
1. This solution is correct for any probability distribution.
2. It is O(n) for set-up and O(lgn) for each call to next().
3. There is a single caveat, I am assuming the probabilities sum up well to 1, which with floating point numbers might not always be the case. To be on the safe side the last element might be replicated with probability 1, just to make sure there are no out of bounds references.

Numpy&#039;s cumsum() (cumulative sum) is a good function, quite useful. Before I knew about it I wrote one of my own. About bisect though... I don&#039;t like the name. I think it&#039;s one of the least aptly named modules in the Python stdlib. Still very useful to be aware of. Probably should have been named &#039;sorted_find&#039; or something similar.

Regarding the second challenge:
1. It is more of a mathematical challenge. When I first thought about it, it didn&#039;t take me too long to solve it(&lt;0.5hr). The solution is simpler than it seems. It may also be solved generally for any distribution continuous in [a,b], but the generalization is more complicated, and I don&#039;t know of a better way to solve it.
2. Regarding use-cases. Well, I came up for this challenge independently. To find uses I did a Google codesearch on Python&#039;s random.normalvariate() and as an example, found jitter delay in communication code. Another example was particle systems. I guess you could find other usages.
3. Here is an &lt;a href=&quot;http://www.algorithm.co.il/sitecode/random_dist.png&quot; rel=&quot;nofollow&quot;&gt;example run&lt;/a&gt; of the my solution to the second problem. I don&#039;t know if the apparent error in the middle is because my solution is bad, or because my measurement is bad. I&#039;ll be *very happy* to see a good solution and a good proof.
Here is a sketch of how I took the measurement. It is much more complicated than the actual solution :)
[python]
def p(x):
    return numpy.exp(-x**2)

def solve_p(y):
    x = numpy.sqrt(-numpy.log(y))
    return [-x, x]

result = [random_select.random_dist(p, solve_p) for i in xrange(10000)]
d = [int(x*100) for x in result]
h = {}
for x in d:
    h[x] = h.get(x,0)+1.0/len(result)
yvals = [h[int(x*100)] for x in result]
xvals = result
xvals2 = numpy.arange(-3,3,0.1)
avg = sum(yvals)/len(yvals)
yvals2 = [p(x)*avg for x in xvals2]
pylab.plot(xvals, yvals, &quot;r+&quot;)
pylab.plot(xvals2, yvals2, &quot;b-&quot;)
[/python]</description>
		<content:encoded><![CDATA[<p>Hey Paddy:<br />
Generally I agree with you. If you&#8217;ve got a solution that works and meets the spec, that&#8217;s enough.<br />
However, in this case this consideration is moot &#8211; we are talking about a challenge where the target is elegance and having fun solving it.</p>
<p>Without further ado, here&#8217;s my solution to the first challenge:<br />
[python]<br />
import numpy<br />
import bisect<br />
import random</p>
<p>def random_select(items, probs):<br />
    probs = numpy.cumsum(probs)<br />
    while True:<br />
        yield items[bisect.bisect(probs, random.random())]</p>
<p>def test():<br />
    items = ['good', 'bad', 'evil']<br />
    probs = [0.5, 0.3, 0.2]<br />
    x = random_select(items, probs)<br />
    result = []<br />
    for i in xrange(10000):<br />
        result.append(x.next())<br />
    for x, p in zip(items, probs):<br />
        print x, p, result.count(x)/float(len(result))</p>
<p>if __name__ == &#8216;__main__&#8217;:<br />
    test()<br />
[/python]</p>
<p>A few notes:<br />
1. This solution is correct for any probability distribution.<br />
2. It is O(n) for set-up and O(lgn) for each call to next().<br />
3. There is a single caveat, I am assuming the probabilities sum up well to 1, which with floating point numbers might not always be the case. To be on the safe side the last element might be replicated with probability 1, just to make sure there are no out of bounds references.</p>
<p>Numpy&#8217;s cumsum() (cumulative sum) is a good function, quite useful. Before I knew about it I wrote one of my own. About bisect though&#8230; I don&#8217;t like the name. I think it&#8217;s one of the least aptly named modules in the Python stdlib. Still very useful to be aware of. Probably should have been named &#8216;sorted_find&#8217; or something similar.</p>
<p>Regarding the second challenge:<br />
1. It is more of a mathematical challenge. When I first thought about it, it didn&#8217;t take me too long to solve it(&lt;0.5hr). The solution is simpler than it seems. It may also be solved generally for any distribution continuous in [a,b], but the generalization is more complicated, and I don't know of a better way to solve it.<br />
2. Regarding use-cases. Well, I came up for this challenge independently. To find uses I did a Google codesearch on Python's random.normalvariate() and as an example, found jitter delay in communication code. Another example was particle systems. I guess you could find other usages.<br />
3. Here is an <a href="http://www.algorithm.co.il/sitecode/random_dist.png" rel="nofollow">example run</a> of the my solution to the second problem. I don&#8217;t know if the apparent error in the middle is because my solution is bad, or because my measurement is bad. I&#8217;ll be *very happy* to see a good solution and a good proof.<br />
Here is a sketch of how I took the measurement. It is much more complicated than the actual solution :)<br />
[python]<br />
def p(x):<br />
    return numpy.exp(-x**2)</p>
<p>def solve_p(y):<br />
    x = numpy.sqrt(-numpy.log(y))<br />
    return [-x, x]</p>
<p>result = [random_select.random_dist(p, solve_p) for i in xrange(10000)]<br />
d = [int(x*100) for x in result]<br />
h = {}<br />
for x in d:<br />
    h[x] = h.get(x,0)+1.0/len(result)<br />
yvals = [h[int(x*100)] for x in result]<br />
xvals = result<br />
xvals2 = numpy.arange(-3,3,0.1)<br />
avg = sum(yvals)/len(yvals)<br />
yvals2 = [p(x)*avg for x in xvals2]<br />
pylab.plot(xvals, yvals, &#8220;r+&#8221;)<br />
pylab.plot(xvals2, yvals2, &#8220;b-&#8221;)<br />
[/python]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paddy3118</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-91</link>
		<dc:creator>Paddy3118</dc:creator>
		<pubDate>Tue, 12 Feb 2008 08:16:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-91</guid>
		<description>Hi lorq,
Np email?
But that aside, I&#039;m currently adopting a measure of quality that is &quot;quality meets the spec&quot;, and in which work to exceed a spec after it is already met, detracts from quality. If a spec is vague then it is important to seek out a refinement to the spec. so that you can guage the quality of your code.

I would tend to write, and test, and be wary of both speed optimisations unless it was slow, and elegance optimisations unless someone (such as yourself), pointed out that its hard to read. I do like chasing algorithms though so I might waste effort on a flight of fancy in that way, after already having an algorithm that would suffice :-)

As for the extended challenge:
1: I couldn&#039;t see myself finishing it in an evening.
2: Could you go into more detail? Maybe with sample input/output?
3: I don&#039;t think I could make much use of any result - it would probably be too complex for training others (too much time spent explaining the task, too much time needed to go through a solution).

- Paddy.


- Paddy.</description>
		<content:encoded><![CDATA[<p>Hi lorq,<br />
Np email?<br />
But that aside, I&#8217;m currently adopting a measure of quality that is &#8220;quality meets the spec&#8221;, and in which work to exceed a spec after it is already met, detracts from quality. If a spec is vague then it is important to seek out a refinement to the spec. so that you can guage the quality of your code.</p>
<p>I would tend to write, and test, and be wary of both speed optimisations unless it was slow, and elegance optimisations unless someone (such as yourself), pointed out that its hard to read. I do like chasing algorithms though so I might waste effort on a flight of fancy in that way, after already having an algorithm that would suffice :-)</p>
<p>As for the extended challenge:<br />
1: I couldn&#8217;t see myself finishing it in an evening.<br />
2: Could you go into more detail? Maybe with sample input/output?<br />
3: I don&#8217;t think I could make much use of any result &#8211; it would probably be too complex for training others (too much time spent explaining the task, too much time needed to go through a solution).</p>
<p>- Paddy.</p>
<p>- Paddy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lorg</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-90</link>
		<dc:creator>lorg</dc:creator>
		<pubDate>Mon, 11 Feb 2008 23:30:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-90</guid>
		<description>Statusreport:
Like I wrote in my email to you, you got it right, but it could be better. (Faster, and more elegant.)
Paddy3118:
If you could rewrite probchoice to be fast, short and elegant, would you consider it the better solution?

And some general notes:
1. I like the tests.
2. No one yet tried the second challenge! Still open for grabs :)</description>
		<content:encoded><![CDATA[<p>Statusreport:<br />
Like I wrote in my email to you, you got it right, but it could be better. (Faster, and more elegant.)<br />
Paddy3118:<br />
If you could rewrite probchoice to be fast, short and elegant, would you consider it the better solution?</p>
<p>And some general notes:<br />
1. I like the tests.<br />
2. No one yet tried the second challenge! Still open for grabs :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paddy3118</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-89</link>
		<dc:creator>Paddy3118</dc:creator>
		<pubDate>Mon, 11 Feb 2008 21:11:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-89</guid>
		<description>I did not look at other solutions before coding my two that are below. Of my two, I would like to check for applicability then go with the simpler, memory-hungry probchoice2() if possible over the more than twice as long probchoice().

The test info given is not repeatable and must be checked by hand - In a production environment I would need to select some delta and ensure calculated probabilities are within delta of the input probs.

Now looking at other comments, it&#039;s nice to see that Miki gave tests too :-)
And looking at lorq&#039;s comment to Miki, I had thought of precision and made my bin count selectable. I&#039;m working on a PC with a gig of ram, and thought that working to three digits of precision as the default would be OK.

The prog:

[python]

&#039;&#039;&#039;\
Answer to http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/

Author Donald &#039;Paddy&#039; McCarthy, Feb 2008, paddy3118-at-gmail-dot-com

&quot;You have a mapping between items and probabilities.
 You need to choose each item with its probability.
 For example, consider the items [’good’, ‘bad’, ‘ugly’],
 with probabilities of [0.5, 0.3, 0.2] accordingly.
 Your solution should choose good with probability 50%,
 bad with 30% and ugly with 20%.&quot;


Sample output:
  ##
  ## PROBCHOICE
  ##
  Trials:               100000
  Target probability:   0.500,0.300,0.200
  Attained probability: 0.500,0.302,0.197

  ##
  ## PROBCHOICE2
  ##
  Trials:               100000
  Target probability:   0.500,0.300,0.200
  Attained probability: 0.502,0.299,0.199

  &gt;&gt;&gt; it = probchoice2(&#039;good bad ugly&#039;.split(), [0.5, 0.3, 0.2])
  &gt;&gt;&gt; for x in range(10): print it.next()
  ...
  bad
  bad
  bad
  ugly
  good
  bad
  good
  good
  good
  bad
  &gt;&gt;&gt;

&#039;&#039;&#039;

import random

def probchoice(items, probs):
  &#039;&#039;&#039;\
  Splits the interval 0.0-1.0 in proportion to probs
  then finds where each random.random() choice lies
  &#039;&#039;&#039;

  prob_accumulator = 0
  accumulator = []
  for p in probs:
    prob_accumulator += p
    accumulator.append(prob_accumulator)

  accumZitems = zip(accumulator, items)[:-1]
  last_item = items[-1]
  while True:
    r = random.random()
    for prob_accumulator, item in accumZitems:
      if r &lt;= prob_accumulator:
        yield item
        break
    else:
      # last range handled by else clause
      yield last_item

def probchoice2(items, probs, bincount=1000):
  &#039;&#039;&#039;\
  Puts items in bins in proportion to probs
  then uses random.choice() to select items.

  Larger bincount for more memory use but
  higher accuracy (on avarage).
  &#039;&#039;&#039;

  prob_accumulator = 0
  bins = []
  for item,prob in zip(items, probs):
    bins += [item]*int(bincount*prob)
  while True:
    yield random.choice(bins)


def tester(func=probchoice, items=&#039;good bad ugly&#039;.split(),
                    probs=[0.5, 0.3, 0.2],
                    trials = 100000
                    ):
  def problist2string(probs):
    &#039;&#039;&#039;\
    Turns a list of probabilities into a string
    Also rounds FP values
    &#039;&#039;&#039;
    return &quot;,&quot;.join(&#039;%5.3f&#039; % (p,) for p in probs)

  from collections import defaultdict

  counter = defaultdict(int)
  it = func(items, probs)
  for dummy in xrange(trials):
    counter[it.next()] += 1
  print &quot;\n##\n## %s\n##&quot; % func.func_name.upper()
  print &quot;Trials:              &quot;, trials
  print &quot;Target probability:  &quot;, problist2string(probs)
  print &quot;Attained probability:&quot;, problist2string(
    counter[x]/float(trials) for x in items)

if __name__ == &#039;__main__&#039;:
  tester()
  tester(probchoice2)


[/python]</description>
		<content:encoded><![CDATA[<p>I did not look at other solutions before coding my two that are below. Of my two, I would like to check for applicability then go with the simpler, memory-hungry probchoice2() if possible over the more than twice as long probchoice().</p>
<p>The test info given is not repeatable and must be checked by hand &#8211; In a production environment I would need to select some delta and ensure calculated probabilities are within delta of the input probs.</p>
<p>Now looking at other comments, it&#8217;s nice to see that Miki gave tests too :-)<br />
And looking at lorq&#8217;s comment to Miki, I had thought of precision and made my bin count selectable. I&#8217;m working on a PC with a gig of ram, and thought that working to three digits of precision as the default would be OK.</p>
<p>The prog:</p>
<p>[python]</p>
<p>&#8221;&#8217;\<br />
Answer to <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/" rel="nofollow">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/</a></p>
<p>Author Donald &#8216;Paddy&#8217; McCarthy, Feb 2008, paddy3118-at-gmail-dot-com</p>
<p>&#8220;You have a mapping between items and probabilities.<br />
 You need to choose each item with its probability.<br />
 For example, consider the items [’good’, ‘bad’, ‘ugly’],<br />
 with probabilities of [0.5, 0.3, 0.2] accordingly.<br />
 Your solution should choose good with probability 50%,<br />
 bad with 30% and ugly with 20%.&#8221;</p>
<p>Sample output:<br />
  ##<br />
  ## PROBCHOICE<br />
  ##<br />
  Trials:               100000<br />
  Target probability:   0.500,0.300,0.200<br />
  Attained probability: 0.500,0.302,0.197</p>
<p>  ##<br />
  ## PROBCHOICE2<br />
  ##<br />
  Trials:               100000<br />
  Target probability:   0.500,0.300,0.200<br />
  Attained probability: 0.502,0.299,0.199</p>
<p>  &gt;&gt;&gt; it = probchoice2(&#8216;good bad ugly&#8217;.split(), [0.5, 0.3, 0.2])<br />
  &gt;&gt;&gt; for x in range(10): print it.next()<br />
  &#8230;<br />
  bad<br />
  bad<br />
  bad<br />
  ugly<br />
  good<br />
  bad<br />
  good<br />
  good<br />
  good<br />
  bad<br />
  &gt;&gt;&gt;</p>
<p>&#8221;&#8217;</p>
<p>import random</p>
<p>def probchoice(items, probs):<br />
  &#8221;&#8217;\<br />
  Splits the interval 0.0-1.0 in proportion to probs<br />
  then finds where each random.random() choice lies<br />
  &#8221;&#8217;</p>
<p>  prob_accumulator = 0<br />
  accumulator = []<br />
  for p in probs:<br />
    prob_accumulator += p<br />
    accumulator.append(prob_accumulator)</p>
<p>  accumZitems = zip(accumulator, items)[:-1]<br />
  last_item = items[-1]<br />
  while True:<br />
    r = random.random()<br />
    for prob_accumulator, item in accumZitems:<br />
      if r &lt;= prob_accumulator:<br />
        yield item<br />
        break<br />
    else:<br />
      # last range handled by else clause<br />
      yield last_item</p>
<p>def probchoice2(items, probs, bincount=1000):<br />
  &#8221;&#8217;\<br />
  Puts items in bins in proportion to probs<br />
  then uses random.choice() to select items.</p>
<p>  Larger bincount for more memory use but<br />
  higher accuracy (on avarage).<br />
  &#8221;&#8217;</p>
<p>  prob_accumulator = 0<br />
  bins = []<br />
  for item,prob in zip(items, probs):<br />
    bins += [item]*int(bincount*prob)<br />
  while True:<br />
    yield random.choice(bins)</p>
<p>def tester(func=probchoice, items=&#8217;good bad ugly&#8217;.split(),<br />
                    probs=[0.5, 0.3, 0.2],<br />
                    trials = 100000<br />
                    ):<br />
  def problist2string(probs):<br />
    &#8221;&#8217;\<br />
    Turns a list of probabilities into a string<br />
    Also rounds FP values<br />
    &#8221;&#8217;<br />
    return &#8220;,&#8221;.join(&#8216;%5.3f&#8217; % (p,) for p in probs)</p>
<p>  from collections import defaultdict</p>
<p>  counter = defaultdict(int)<br />
  it = func(items, probs)<br />
  for dummy in xrange(trials):<br />
    counter[it.next()] += 1<br />
  print &#8220;\n##\n## %s\n##&#8221; % func.func_name.upper()<br />
  print &#8220;Trials:              &#8220;, trials<br />
  print &#8220;Target probability:  &#8220;, problist2string(probs)<br />
  print &#8220;Attained probability:&#8221;, problist2string(<br />
    counter[x]/float(trials) for x in items)</p>
<p>if __name__ == &#8216;__main__&#8217;:<br />
  tester()<br />
  tester(probchoice2)</p>
<p>[/python]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lorg</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-88</link>
		<dc:creator>lorg</dc:creator>
		<pubDate>Mon, 11 Feb 2008 18:21:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-88</guid>
		<description>Miki:
While being the simplest and good enough for some cases, it is not the most elegant solution:
1. It doesn&#039;t yield the exact distribution for various inputs. Especially for cases where the probability is less than 0.01. While this might not be a problem for large n (your 100 in create_population),
2. it is a bit too wasteful (especially for such large n).

I must say though, I also thought of this solution at first. That was because in my use-case the probabilities were generated from a histogram of the items, so each item had a natural number of appearances. While this solves away the problem of the exact distribution, it is still not the &#039;right&#039; solution. Consider a Markov chain generated from a large body of text. Now, I would like to generate a sequence of words according to this chain. I would have to create a such a population for each word in the chain.
A population created this way (whether using the original numbers, or just the percentages like you did) is still too much:
The Oxford dictionary contains entries for about 170,000 words. If we just want to look at 50,000 words, than generating such a population (of 100) for each word will require about 19Mbytes total. (Assuming 4-byte pointers to a collection of words.)</description>
		<content:encoded><![CDATA[<p>Miki:<br />
While being the simplest and good enough for some cases, it is not the most elegant solution:<br />
1. It doesn&#8217;t yield the exact distribution for various inputs. Especially for cases where the probability is less than 0.01. While this might not be a problem for large n (your 100 in create_population),<br />
2. it is a bit too wasteful (especially for such large n).</p>
<p>I must say though, I also thought of this solution at first. That was because in my use-case the probabilities were generated from a histogram of the items, so each item had a natural number of appearances. While this solves away the problem of the exact distribution, it is still not the &#8216;right&#8217; solution. Consider a Markov chain generated from a large body of text. Now, I would like to generate a sequence of words according to this chain. I would have to create a such a population for each word in the chain.<br />
A population created this way (whether using the original numbers, or just the percentages like you did) is still too much:<br />
The Oxford dictionary contains entries for about 170,000 words. If we just want to look at 50,000 words, than generating such a population (of 100) for each word will require about 19Mbytes total. (Assuming 4-byte pointers to a collection of words.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: StatusReport</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-87</link>
		<dc:creator>StatusReport</dc:creator>
		<pubDate>Mon, 11 Feb 2008 18:00:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-87</guid>
		<description>Oh well, so I&#039;m bored.

I&#039;m quite sure there&#039;s a better way to do it, but whateva.

[python]
import random

choices = [&quot;good&quot;, &quot;bad&quot;, &quot;ugly&quot;]
probabilities = [0.5, 0.3, 0.2]
ranges = [0]

# convert probabilites to ranges from 0-1
# assumption: probablities sum equals to 1
for i in xrange(len(probabilities)):
    ranges.append(probabilities[i] + ranges[i])

rand_choice = random.random()

print choices[len([choice for choice in ranges if rand_choice &gt; choice]) - 1]
[/python]</description>
		<content:encoded><![CDATA[<p>Oh well, so I&#8217;m bored.</p>
<p>I&#8217;m quite sure there&#8217;s a better way to do it, but whateva.</p>
<p>[python]<br />
import random</p>
<p>choices = ["good", "bad", "ugly"]<br />
probabilities = [0.5, 0.3, 0.2]<br />
ranges = [0]</p>
<p># convert probabilites to ranges from 0-1<br />
# assumption: probablities sum equals to 1<br />
for i in xrange(len(probabilities)):<br />
    ranges.append(probabilities[i] + ranges[i])</p>
<p>rand_choice = random.random()</p>
<p>print choices[len([choice for choice in ranges if rand_choice &gt; choice]) &#8211; 1]<br />
[/python]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Miki</title>
		<link>http://www.algorithm.co.il/blogs/challenges/small-python-challenge-no-3-random-selection/#comment-86</link>
		<dc:creator>Miki</dc:creator>
		<pubDate>Mon, 11 Feb 2008 16:27:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-3-random-selection/#comment-86</guid>
		<description>The simplest way for #1 will be to create the actual population and then just use random.choise on it:

[python]
from operator import add
from random import choice

def create_population(items, probabilities):
    return reduce(add, map(lambda ip: [ip[0]] * (int(ip[1] * 100)), \
                           zip(items, probabilities)))

def test():
    from collections import defaultdict
    items = [&#039;good&#039;, &#039;bad&#039;, &#039;ugly&#039;]
    probabilities = [0.5, 0.3, 0.2]

    selections = [0, 0, 0]
    population = create_population(items, probabilities)
    num_times = 10000
    for i in xrange(num_times):
        item = choice(population)
        selections[items.index(item)] += 1

    for item in items:
        index = items.index(item)
        wanted = probabilities[index]
        got = float(selections[index]) / num_times
        print &quot;item %s: wanted %.4f got %.4f&quot; % (item, wanted, got)

if __name__ == &quot;__main__&quot;:
    test()
[/python]</description>
		<content:encoded><![CDATA[<p>The simplest way for #1 will be to create the actual population and then just use random.choise on it:</p>
<p>[python]<br />
from operator import add<br />
from random import choice</p>
<p>def create_population(items, probabilities):<br />
    return reduce(add, map(lambda ip: [ip[0]] * (int(ip[1] * 100)), \<br />
                           zip(items, probabilities)))</p>
<p>def test():<br />
    from collections import defaultdict<br />
    items = ['good', 'bad', 'ugly']<br />
    probabilities = [0.5, 0.3, 0.2]</p>
<p>    selections = [0, 0, 0]<br />
    population = create_population(items, probabilities)<br />
    num_times = 10000<br />
    for i in xrange(num_times):<br />
        item = choice(population)<br />
        selections[items.index(item)] += 1</p>
<p>    for item in items:<br />
        index = items.index(item)<br />
        wanted = probabilities[index]<br />
        got = float(selections[index]) / num_times<br />
        print &#8220;item %s: wanted %.4f got %.4f&#8221; % (item, wanted, got)</p>
<p>if __name__ == &#8220;__main__&#8221;:<br />
    test()<br />
[/python]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

