<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Algorithm.co.il &#187; Design</title>
	<atom:link href="http://www.algorithm.co.il/blogs/category/programming/design/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.algorithm.co.il/blogs</link>
	<description>Algorithms, for the heck of it</description>
	<lastBuildDate>Tue, 21 Jun 2011 20:37:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>10 Python Optimization Tips and Issues</title>
		<link>http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/</link>
		<comments>http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 21:37:38 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[memory usage]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=373</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/' addthis:title='10 Python Optimization Tips and Issues'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>Following my previous post on Optimizing Javascript, I thought I&#8217;d write a similar post regarding Python optimization. Before going on to the more interesting stuff, there are a few issues that need to be addressed: 0. Basics Know the basics &#8230; <a href="http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/' addthis:title='10 Python Optimization Tips and Issues' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Following my previous post on <a href="http://www.algorithm.co.il/blogs/index.php/programming/javascript/javascript-optimization-tricks/">Optimizing Javascript</a>, I thought I&#8217;d write a similar post regarding Python optimization.<br />
<span id="more-373"></span><br />
Before going on to the more interesting stuff, there are a few issues that need to be addressed:</p>
<h4>0. Basics</h4>
<p>Know the <a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">basics</a> &#8211; especially profiling!<br />
Just by looking at the profiling output, you can tell where does the computing time go. To get that information, I like to sort on cumulative time (i.e, time taken by a given function and all functions called from it, over all of its calls).</p>
<h4>0.5. Knowing your goal, and your enemy</h4>
<p>The kind of optimizations you do, and how far you&#8217;re willing to go is dependent on your code&#8217;s users. If you&#8217;re writing batch processing software, your required time for running might be a minute, an hour, or a day. So far, I had to optimize various cases of weeks to minutes for batch processing and also seconds to milliseconds for web-application UI.<br />
Your timing should apply to typical input, and probably to your biggest probable input as well.</p>
<p>Create some simple benchmark you can test your code against. It&#8217;s important that your benchmark be typical &#8216;complexity-wise&#8217;, but smaller in size, so that running it and getting profiling results takes no more than a few seconds. You may even want multiple benchmarks, each one for a different size. That way, once you are more sure of yourself, you can run your code against the larger benchmark. If your benchmarks are real inputs &#8211; all the better.</p>
<h4>1. Python vs. C and similar considerations</h4>
<p>In my line of work, I usually do research oriented development. That means that it&#8217;s harder to know upfront where the bottlenecks will be. As a result, the prevailing attitude is usually &#8220;let&#8217;s write it in Python, and later, when the need arises, convert the critical code to C&#8221;.<br />
So far, I haven&#8217;t had the chance to do that. Usually what happens is we write the code, it works well enough, and we figure that the flexibility of writing it in Python is more important than the Python to C conversion gains. Also, Python is not always the bottleneck &#8211; sometimes it&#8217;s a database, or some 3rd party API.<br />
Usually &#8220;import pysco&#8221;, and changing the code to allow parallel processing is cheaper and simpler than the conversion to C.</p>
<h4>2. The small time-eater</h4>
<p>A common problem is when a relatively trivial function is taking a lot of cumulative time. That&#8217;s usually a sign you&#8217;re doing something wrong. I had this issue when I used my <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/various-small-python-helpers/">symbolic constants</a> for a new project. Consider the following:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> SymbolInt<span style="color: black;">&#40;</span>value, name<span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">class</span> _SymbolInt<span style="color: black;">&#40;</span><span style="color: #008000;">int</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__str__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
            <span style="color: #ff7700;font-weight:bold;">return</span> name
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__repr__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
            <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">'SymbolInt(%d, &quot;%s&quot;)'</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>value, name<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__eq__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, other<span style="color: black;">&#41;</span>:
            <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>other, <span style="color: #008000;">str</span><span style="color: black;">&#41;</span>:
                other = other.<span style="color: black;">lower</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">int</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>==other <span style="color: #ff7700;font-weight:bold;">or</span> name.<span style="color: black;">lower</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> == other
        <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__ne__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, other<span style="color: black;">&#41;</span>:
            <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span> == other
    <span style="color: #ff7700;font-weight:bold;">return</span> _SymbolInt<span style="color: black;">&#40;</span>value<span style="color: black;">&#41;</span></pre></div></div>

<p>This one is very nice for interactive interfaces. However, in the new project, I found out that __eq__ was taking *a lot* of time. Way more than it should, even when I wasn&#8217;t comparing SymbolInt-s to strings!<br />
It turned out that &#8216;or name.lower() == other&#8217; was very bad speed wise. So for that project, I removed this subcondition, and voila! My code was fast!</p>
<h4>3. The algorithm is critical</h4>
<p>In many cases I&#8217;ve worked on, the greatest reductions in running time were due to algorithm changes. That means that playing with issues such as variable lookups and so on should come after you&#8217;re mostly settled on your algorithm. The latest example that I can think of is the <a href="http://www.algorithm.co.il/blogs/index.php/computer-science/small-python-challenge-no-4-counting-sets/">set counting problem</a>, where using <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/my-solution-to-the-counting-sets-challenge/">my solution</a> got me down from two weeks to 20 something minutes on my real input.<br />
Later I did some simpler optimizations that chopped off a few more minutes.</p>
<h4>4. Avoiding loops</h4>
<p>That one is easy. Everyone knows you should avoid loops, especially nested ones. Still, there are some cases where your code just has to have these loops &#8211; because that&#8217;s the essence of what your code is doing.</p>
<p>To make loop avoidance possible, and specifically cartesian product kind of loops, consider refactoring your code to use set intersections and unions. As simple illustration, consider:<br />
Instead of this,</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">for</span> x <span style="color: #ff7700;font-weight:bold;">in</span> a:
    <span style="color: #ff7700;font-weight:bold;">for</span> y <span style="color: #ff7700;font-weight:bold;">in</span> b:
        <span style="color: #ff7700;font-weight:bold;">if</span> x == y:
            <span style="color: #ff7700;font-weight:bold;">yield</span> <span style="color: black;">&#40;</span>x,y<span style="color: black;">&#41;</span></pre></div></div>

<p>do this:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">set</span><span style="color: black;">&#40;</span>a<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&amp;</span> <span style="color: #008000;">set</span><span style="color: black;">&#40;</span>b<span style="color: black;">&#41;</span></pre></div></div>

<p>Sometimes, applying this change doesn&#8217;t quite fit your algorithm. In that cases, try to change your algorithm to accommodate. For example, it might yield less accurate results. In that case, aim for returning more than you need, and then do a second pass to filter the bad ones. The time gains from avoiding the extra loops should still be worth it.</p>
<p>(Note: this is similar to doing your computations in your database queries instead of in your code. Similar ideas apply.)</p>
<h4>5. Lookups</h4>
<p>If you spend time looking for something, use a dict. If that is not feasable, use any other data-structure that fits your problem. For example, let&#8217;s say you&#8217;re looking for given strings in a lot of files. You can build a small index beforehand, and instead of looking at the files each time, just look at this index.</p>
<p>(Note: this is similar to creating an index on the database column you are searching on.)</p>
<h4>6. Memory</h4>
<p>When dealing with large inputs, you&#8217;ll usually want to reduce your memory requirements. Consider an algorithm that requires O(n) memory, for n-sized inputs. All you need is a factor of 4, and 500 megs of input, and your code will choke on many current machines.<br />
Also, I&#8217;ve found out that writing your code in such a way as to use drastically less memory, will sometimes force me to write it more time-efficiently as well.</p>
<p>There are a few techniques to dealing with the memory issue. The central idea is to have as little of your data as you need available at any time.</p>
<h4>7. Generators</h4>
<p>Generator expressions are usually preferable to list comprehensions. Similarly, consider replacing this kind of functions:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> myfunc<span style="color: black;">&#40;</span>some_input<span style="color: black;">&#41;</span>:
    ...
    <span style="color: black;">result</span> = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> bla <span style="color: #ff7700;font-weight:bold;">in</span> foo:
        ...
        <span style="color: black;">result</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>bar<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> result</pre></div></div>

<p>with the following idiom:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> myfunc<span style="color: black;">&#40;</span>some_input<span style="color: black;">&#41;</span>:
    ...
    <span style="color: #ff7700;font-weight:bold;">for</span> bla <span style="color: #ff7700;font-weight:bold;">in</span> foo:
        ...
        <span style="color: #ff7700;font-weight:bold;">yield</span> bar</pre></div></div>

<p>This has the added advantage of simplifying myfunc, as its state is kept for you. On really big inputs and outputs, this one could save you from keeping all of your output in memory.<br />
If you are not familiar with generators, I suggest reading <a href="http://www.dabeaz.com/generators/index.html">David Beazley&#8217;s presentation</a> on the subject, it&#8217;s an excellent read, regardless of optimizations.</p>
<h4>8. Outputs</h4>
<p>If your goal is to generate output, dump it to a file as soon as possible. This is made simple by the previous idiom:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">for</span> bar <span style="color: #ff7700;font-weight:bold;">in</span> myfunc<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    <span style="color: #808080; font-style: italic;">#process bar</span>
    ...
    <span style="color: black;">dump</span><span style="color: black;">&#40;</span>foobar<span style="color: black;">&#41;</span></pre></div></div>

<p>Just make sure that dump doesn&#8217;t keep your data around for too long.<br />
For example, I once had to insert a lot of data into a database. After I finished processing each record, I would insert it. The bottleneck was the database. I tried flushing only after several inserts (which meant inserts in chunks of N for various N), until I was introduced to the solution: <a href="http://www.algorithm.co.il/blogs/index.php/programming/bulk-inserts-ftw/">bulk inserts</a>.<br />
The, my extraction script just dumped to a text file, which was lightning fast, and later I did the bulk insert.</p>
<h4>9. Summing up</h4>
<p>Do</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #008000;">sum</span><span style="color: black;">&#40;</span>x <span style="color: #ff7700;font-weight:bold;">for</span> x <span style="color: #ff7700;font-weight:bold;">in</span> some_generator<span style="color: black;">&#41;</span></pre></div></div>

<p>Instead of</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">for</span> x <span style="color: #ff7700;font-weight:bold;">in</span> some_list:
    my_sum += x</pre></div></div>

<p>Kidding!<br />
Use your profiler, your head, psyco, and more experienced advice in the best order that suits you. As I&#8217;ve come to learn, getting advice from friends is an excellent way to avoid bashing your head against some mad bugger&#8217;s O(n<sup>2</sup>) wall.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/computer-science/10-python-optimization-tips-and-issues/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A Simple Race-Condition</title>
		<link>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/#comments</comments>
		<pubDate>Sun, 28 Jun 2009 09:38:18 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Bug]]></category>
		<category><![CDATA[Caching]]></category>
		<category><![CDATA[Multi-Threading]]></category>
		<category><![CDATA[Race-Condition]]></category>
		<category><![CDATA[web applications]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=263</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/' addthis:title='A Simple Race-Condition'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>Lately, I&#8217;ve mostly been working on my startup. It&#8217;s a web-application, and one of the first things I&#8217;ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won&#8217;t present the &#8230; <a href="http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/' addthis:title='A Simple Race-Condition' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Lately, I&#8217;ve mostly been working on my startup. It&#8217;s a web-application, and one of the first things I&#8217;ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won&#8217;t present the code itself here, instead I&#8217;ll try to present the essence of the bug.</p>
<p>Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.<br />
So, let&#8217;s say we create some database table, that has the following fields:</p>
<ul>
<li><strong>Unique key:</strong> input</li>
<li>output</li>
<li>use_count *</li>
<li>Last use date *</li>
</ul>
<p>I&#8217;ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.<br />
Here is how code using the cache would look like (in pseudocode form):</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">result = cache.<span style="color: #dc143c;">select</span><span style="color: black;">&#40;</span><span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">if</span> result:
    <span style="color: #ff7700;font-weight:bold;">return</span> result
result = compute<span style="color: black;">&#40;</span><span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
cache.<span style="color: black;">insert</span><span style="color: black;">&#40;</span><span style="color: #008000;">input</span>, result<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">return</span> result</pre></div></div>

<p>This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?<br />
Here&#8217;s a simple scheduling that will show the bug:</p>
<p>Thread1: result = cache.select(input). result is None<br />
Thread1: result = compute(input)<br />
Thread2: result = cache.select(input) result is None<br />
Thread2: result = compute(input)<br />
Thread2: cache.insert(input, result)<br />
Thread1: cache.insert(input, result) &#8211; exception &#8211; duplicate records for the unique key input!</p>
<p>This is a classic race condition. And here&#8217;s a small challenge: What&#8217;s the best way to solve it?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Actual Data Always Needs To Be Explicit</title>
		<link>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 21:01:10 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[startup]]></category>
		<category><![CDATA[design patterns]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=247</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/' addthis:title='Actual Data Always Needs To Be Explicit'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>This might seem obvious, but it wasn&#8217;t to me it first. Now I consider it a database design rule of thumb, or even a patten. I&#8217;ll explain using an example. Consider an application where you also need automatic tagging of &#8230; <a href="http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/' addthis:title='Actual Data Always Needs To Be Explicit' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>This might seem obvious, but it wasn&#8217;t to me it first. Now I consider it a database design rule of thumb, or even a patten.<br />
I&#8217;ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you&#8217;ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.<br />
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.</p>
<p>You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.</p>
<p>The problems start when you want to run the algorithm more than once.<br />
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn&#8217;t require any change to the DB. Now, let&#8217;s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.</p>
<p>What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.</p>
<p>How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark <strong>&#8220;negative tags&#8221;</strong> , which are also pieces of information. The easiest way I found of doing this is adding a boolean &#8220;valid&#8221; column to the mapping table.</p>
<p>It&#8217;s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don&#8217;t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example &#8220;relationship type&#8221; in a social network or &#8220;tag weight&#8221; as in our example, you would already have a separate table anyway.</p>
<p>I encountered this issue when I implemented my <a href="http://www.algorithm.co.il/blogs/index.php/programming/design/database-design-problem/">multiple source db-design</a>. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.<br />
My mistake was that I only considered the records as pieces of data, and didn&#8217;t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Database Design Problem</title>
		<link>http://www.algorithm.co.il/blogs/startup/database-design-problem/</link>
		<comments>http://www.algorithm.co.il/blogs/startup/database-design-problem/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 20:55:55 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[startup]]></category>
		<category><![CDATA[Anti patterns]]></category>
		<category><![CDATA[design patterns]]></category>
		<category><![CDATA[Entity-Attribute-Value model]]></category>
		<category><![CDATA[Harvesting]]></category>
		<category><![CDATA[Inner platform effect]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=178</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/startup/database-design-problem/' addthis:title='Database Design Problem'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was &#8230; <a href="http://www.algorithm.co.il/blogs/startup/database-design-problem/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/startup/database-design-problem/' addthis:title='Database Design Problem' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.</p>
<p>While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.</p>
<p><strong>Problem description:</strong><br />
We want to to have a table with collected information from various sources.</p>
<p>For example, let&#8217;s say we want to collect information about paintings. We&#8217;d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.<br />
(Note: in my original formulation, being able to display the value from the best source was enough).</p>
<p><span id="more-178"></span></p>
<p>Let&#8217;s say our desired, &#8220;regular&#8221; table looks like this:</p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p><strong>Proposed Solution 1:</strong><br />
Add a source column, like so:</p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}</p>
<p>Keep multiple records with different source, combine on query time.</p>
<p>Downside:<br />
combine on query time, really not good.</p>
<p><strong>Proposed solution 1*:</strong><br />
final table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p>source table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}<br />
Application only queries the final table, while a &#8220;source table&#8221; with a source column is added. Now combination is done during harvest time.</p>
<p>Downsides:<br />
Two separate  almost identical tables. Not good for DRY karma.</p>
<p><strong>Proposed Solution 1**:</strong></p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}</p>
<p>Same as solution 1, except now a null source indicates a combined (&#8220;final&#8221;) record. Record combination is done during harvest time, while the application queries this table for records with a null source.</p>
<p>Downsides:<br />
Complicates queries a bit, keeps final data with raw data.</p>
<p><strong>Proposed Solution 2:</strong><br />
{<br />
field_a<br />
field_a_source<br />
field_b<br />
field_b_source<br />
field_c<br />
field_c_source<br />
}</p>
<p>Downsides:<br />
Violates DRY, adds a lot of boilerplate code for handling each column. Doesn&#8217;t easily allow keeping unused data from different sources. For example, if source X was used for field_a, source Y&#8217;s field_a is discarded.<br />
Also, doesn&#8217;t cleanly allow the final field to be a combination of sources, for example, concatenation.</p>
<p><strong>Proposed solution 3:</strong><br />
Add another table, holding sources:</p>
<p>final table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p>source table<br />
{<br />
field_name<br />
field_value<br />
field_source<br />
}</p>
<p>Queries are made against the final table, and record combination is done during harvest time.</p>
<p>Downsides:<br />
Implements a database in a database.<br />
For further reading on this solution see <a href="http://en.wikipedia.org/wiki/Entity-attribute-value_model">Entity-Attribute-Value Model</a>, and the <a href="http://en.wikipedia.org/wiki/Inner-Platform_Effect">Inner Platform Effect</a>.</p>
<p><strong>My Choice</strong><br />
I decided to use solution 1**. For some time I thought I should implement solution 3, but decided against it, mostly because of the inner platform effect, but also because of the other downsides listed (see references).<br />
I also added another column to solution 1, one that indicates for a given &#8220;source record&#8221; (a record with a non-null source) which is its combined record. Now, given a final record, I can also query for all its source records, which is desired for late data processing.<br />
I work with this solution in the following manner:<br />
Offline:<br />
1. Harvest from some source.<br />
2. Harvest from some other source.<br />
3. Run record combination algorithm.<br />
4. Run data analysis algorithm.<br />
5. Harvest from yet another source.<br />
6. Run record combination algorithm.<br />
7. Run data analysis algorithm.<br />
8. etc&#8230;</p>
<p>Online:<br />
Query only from records with a null source.</p>
<p>For further reading see also <a href="http://www.pgcon.org/2008/schedule/events/97.en.html">Database Anti Patterns</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/startup/database-design-problem/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Open Question No. 1: Persistent Predicates?</title>
		<link>http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/#comments</comments>
		<pubDate>Sat, 19 Jul 2008 09:30:24 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Expression Trees]]></category>
		<category><![CDATA[open question]]></category>
		<category><![CDATA[persistent predicates]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=112</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/' addthis:title='Open Question No. 1: Persistent Predicates?'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>Lately I&#8217;ve been developing a website. One issue that I&#8217;ll probably need to address in the near future is &#8220;persistent predicates&#8221;. By &#8220;persistent predicates&#8221; I mean the problem of treating predicates as data. Consider the following situation: you are developing &#8230; <a href="http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/' addthis:title='Open Question No. 1: Persistent Predicates?' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Lately I&#8217;ve been developing a website. One issue that I&#8217;ll probably need to address in the near future is &#8220;persistent predicates&#8221;. By &#8220;persistent predicates&#8221; I mean the problem of treating predicates as data.</p>
<p>Consider the following situation: you are developing some big rss reader/aggregator and you want to allow users to specify handling rules. How would you keep these rules in memory, and how would you keep them on disk?<br />
Obviously, this problem was solved before. Just consider email filters, or even packet filters in ethereal.</p>
<p>One way of approaching the problem is implementing simple predicate templates:<br />
&#8220;%field contains %s&#8221; where field is subject, or body, etc.<br />
Once that is accomplished, you can specify that a &#8220;filter&#8221; is some combination (for example logical and, or logical or) of multiple predicates. To store  this, we&#8217;ll have an actual predicate table (or pickle) with their data, and a one-to-many mapping of filters to predicates.</p>
<p>Another option is allowing just some very simple predicates, and a filter will just &#8220;point&#8221; to (have an id/name of) the required predicate, and the required data. In this option, all data is stored with the filter.</p>
<p>A more complicated solution is to implement some logical serialize-able lanugage (such as the expression trees I used for <a href="http://www.algorithm.co.il/blogs/index.php/tag/vial/">diStorm</a> or <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/pykoan-the-logic-game/">PyKoan</a>). Using this language, the predicates can be very dynamic, and be combined and manipulated programmatically. This solution might be overkill for many projects though.</p>
<p>An interesting issue regarding handling of predicates, is their application to constraint solving. However, this is an issue for a future post. Suffice it to say, that when writing PyKoan I&#8217;m using a constraint solver. Since I&#8217;m representing predicates with expression trees, the ability to analyze and manipulate predicates is very handy.</p>
<p>Besides looking at existing solutions, I&#8217;m very curious to hear other peoples&#8217; opinions. Feel free to write about your preferred solution in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/open-question-no-1-persistent-predicates/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Issues in writing a VM &#8211; Part 3 &#8211; State and Memory</title>
		<link>http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/#comments</comments>
		<pubDate>Wed, 09 Apr 2008 10:28:49 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Distorm]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[sparse list]]></category>
		<category><![CDATA[VM]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/issues-in-writing-a-vm-part-3-state-and-memory/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/' addthis:title='Issues in writing a VM &#8211; Part 3 &#8211; State and Memory'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>When implementing the VM, I had to keep track of state. The state of the VM includes the registers, virtual variables and memory. Fortunately, keeping track of state information is pretty easy. Basically, it amounts to having a dict, where &#8230; <a href="http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/' addthis:title='Issues in writing a VM &#8211; Part 3 &#8211; State and Memory' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>When implementing the VM, I had to keep track of state. The state of the VM includes the registers, virtual variables and memory. Fortunately, keeping track of state information is pretty easy. Basically, it amounts to having a dict, where the keys are registers or variables, and the values are, well, their values.</p>
<p>Holding the state of registers is a bit involved by the fact that registers may overlap, as I mentioned in a <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/issues-in-writing-a-vm-part-2/">previous article</a>. To handle this, the class keeping track of the state information, upon seeing a request to change the value of a register, propagates this change to its parent and child registers.</p>
<p>Holding memory might be different though. At a first glance, it would seem that memory should be kept in some buffer. However, it is much easier to keep the memory in a python dict as well, and treat that dict as if it was a sparse array. This implementation allows programs to write at address 10, and at address 10000, without requiring the VM to keep track of all the addresses in between. (Of course, we are not going to implement paging just now :)</p>
<p>I really liked this idea of treating dicts as sparse lists. In fact, it could work even better with <a href="http://newcenturycomputers.net/projects/rbtree.html">some tree</a> <a href="http://en.wikipedia.org/wiki/Binary_search_tree">data structure</a> instead of a hash table. With a tree data structure your keys would be sorted, and you could do slices.<br />
I went ahead and tried using tree data structures for this, but it seems to be more trouble than it&#8217;s worth, so I think that unless it becomes a real timing issue, I&#8217;ll let it go.</p>
<p>I did have some fun wrapping the memory in some class to abstract away the data structure, and to give it a few more capabilities:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">In <span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>: <span style="color: #ff7700;font-weight:bold;">import</span> vm
&nbsp;
In <span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>: m = vm.<span style="color: black;">VMMemory</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
In <span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span>: m.<span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, <span style="color: #483d8b;">&quot;hello world!<span style="color: #000099; font-weight: bold;">\r</span><span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
In <span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span>: m.<span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">51</span>, <span style="color: #483d8b;">&quot;foobar, foobar<span style="color: #000099; font-weight: bold;">\x</span>00&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
In <span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span><span style="color: black;">&#93;</span>: m.<span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">77</span>, <span style="color: #483d8b;">&quot;a&quot;</span><span style="color: #66cc66;">*</span><span style="color: #ff4500;">30</span><span style="color: black;">&#41;</span>
&nbsp;
In <span style="color: black;">&#91;</span><span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span>: <span style="color: #ff7700;font-weight:bold;">print</span> m
0000: 68656c6c6f20776f726c64210d0a<span style="color: #66cc66;">????</span>  hello world<span style="color: #66cc66;">!</span>..<span style="color: #66cc66;">??</span>
&nbsp;
0033: 666f6f6261722c20666f6f62617200<span style="color: #66cc66;">??</span>  foobar, foobar.<span style="color: #66cc66;">?</span>
&nbsp;
004d: <span style="color: #ff4500;">61616161616161616161616161616161</span>  aaaaaaaaaaaaaaaa
005d: <span style="color: #ff4500;">6161616161616161616161616161</span><span style="color: #66cc66;">????</span>  aaaaaaaaaaaaaa<span style="color: #66cc66;">??</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/issues-in-writing-a-vm-part-3-state-and-memory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>LRU cache solution: a case for linked lists in Python</title>
		<link>http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/</link>
		<comments>http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/#comments</comments>
		<pubDate>Sun, 13 Jan 2008 18:13:32 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Challenges]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[linked-list]]></category>
		<category><![CDATA[LRU-cache]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/lru-cache-solution-a-case-for-linked-lists-in-python/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/' addthis:title='LRU cache solution: a case for linked lists in Python'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>The reason I put up the LRU cache challenge up, was that I couldn&#8217;t think of a good solution to the problem without using linked lists. This has been pointed to by Adam and Erez as well. Adam commented on &#8230; <a href="http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/' addthis:title='LRU cache solution: a case for linked lists in Python' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>The reason I put up the <a href="http://www.algorithm.co.il/blogs/index.php/programming/python/small-python-challenge-no-2-lru-cache/">LRU cache challenge up</a>, was that I couldn&#8217;t think of a good solution to the problem without using linked lists. This has been pointed to by Adam and Erez as well. Adam commented on this,  and <a href="http://www.algorithm.co.il/sitecode/erez_lru.py">Erez&#8217; solution</a> to the problem was algorithmically identical to <a href="http://www.algorithm.co.il/sitecode/lorg_lru.py">mine</a>.</p>
<p>So how to solve the challenge? Here are the two possible solutions I thought about:</p>
<ul>
<li>Use a dict for lookup, and each element&#8217;s age is indicated by its position in a linked list. This is the solution Erez and I implemented.</li>
<li>Keep a &#8216;last time of use&#8217; indicator for each element. This could be just a regular int, incremented by 1 for each lookup. Keep the elements in a <a href="http://en.wikipedia.org/wiki/Min_heap">min heap</a>, and when there are too many elements, pop them using the minimum heap.</li>
</ul>
<p>Generally, I consider the first solution more elegant. It doesn&#8217;t rely on an integer to work, so it could work &#8216;indefinitely&#8217;. Of course, the second solution can be also made to work indefinitely, with some upkeep from time to time. (The added time cost of the upkeep may be amortized over other actions.)</p>
<p>If you can think of some other, more elegant solution, I&#8217;ll be happy to hear about it.</p>
<p>So, given that a linked list solution is more elegant, we come to the crux of the problem: what to do in Python? The Python standard library does not contain a linked list implementation as far as I know. As a result, Python programmers are encouraged to use the <em>list</em> type, which is an array. This is just as well: for most intents and purposes, the list type is good enough.</p>
<p>I tried to think a little about other cases where a linked list was more appropriate, and I didn&#8217;t come up with any more such cases. If you come up with any such case, I&#8217;ll be happy to hear about it.</p>
<p>After looking for a public implementation, and not finding one that seemed good enough, I decided to go ahead and write my own.</p>
<p>Out of curiousity, I also did a small comparison of runtime speeds between my implementation of a linked list, and the <em>list</em> data type. I tried a test where a linked list has an obvious advantage (complexity wise)- removing elements from the middle of the list. The Python <em>list</em> won up to somewhere in the thousands of elements. (Of course, <em>list</em> is implemented in C, and mine is in pure Python).</p>
<p>What is my conclusion from all of this? The same as the conventional wisdom: use the <em>list</em> data type almost always. If you find yourself in need of a linked list, think long and hard (well, not <em>too</em> long) about your problem and solution. There&#8217;s a good chance that either you <em>can</em> use the built-in list with an equivalent solution, or that a regular list will still be faster, for most cases. Of course, if you see no other way &#8211; do what you think is best.</p>
<p>Your thoughts?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/challenges/lru-cache-solution-a-case-for-linked-lists-in-python/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Browser visibility-security and invisibility-insecurity</title>
		<link>http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/</link>
		<comments>http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/#comments</comments>
		<pubDate>Fri, 03 Aug 2007 23:12:01 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[computer science]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Programming Philosophy]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/browser-visibility-security-and-invisibility-insecurity/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/' addthis:title='Browser visibility-security and invisibility-insecurity'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>Formal languages have a knack of giving some output, and then later doing something completely different. For example, take the &#8220;Halting Problem&#8220;, but this is probably too theoretical to be of any relevance&#8230; so read on for something a bit &#8230; <a href="http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/' addthis:title='Browser visibility-security and invisibility-insecurity' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Formal languages have a knack of giving some output, and then later doing something completely different. For example, take the &#8220;<a href="http://en.wikipedia.org/wiki/Halting_problem" title="The Halting Problem on Wikipedia">Halting Problem</a>&#8220;, but this is probably too theoretical to be of any relevance&#8230; so read on for something a bit more practical. We are going to go down the rabbit hole, to the &#8216;in-between&#8217; space&#8230;</p>
<p>My interest was first piqued when I encountered the following annoyance &#8211; some websites would use transparent layers to prevent you from:</p>
<ol>
<li>Marking and copying text.</li>
<li>Left-clicking on anything, including:
<ol>
<li>images, to save them,</li>
<li>just the website, to view its source -</li>
</ol>
</li>
<li>and so on and so forth&#8230;</li>
</ol>
<p>Now I bet most intelligent readers would know how to pass these minor hurdles &#8211; but mostly just taking the steps is usually deterrent enough to prevent the next lazy guy from doing anything. So I was thinking, why not write a browser &#8211; or just a Firefox plugin, that will allow us to view just the top-level of any website?</p>
<p>This should be easy enough to do, but if it bothered enough sites (which it probably won&#8217;t), and they fought back, there would be a pretty standard escalation war. However, since the issue is not that major, I suspect it wouldn&#8217;t matter much.</p>
<p>Now comes the more interesting part. Unlike preventing someone from copying text, html (plus any &#8216;sub-languages&#8217; it may use) may be used to display one thing, and to be read like a different thing altogether. The most common example is with spam &#8211; displaying image spam instead of text. When that was countered by spam filters, animated gif files were used. Now you have it &#8211; your escalation war, par excellence. This property of html was also <a href="http://blogs.securiteam.com/index.php/archives/970">used by honeypots to filter comment-spam, as described in securiteam</a>. In this the securiteam blog post by Aviram, the beginning of another escalation war is described. There are many more examples of this property of html.</p>
<p>All of these examples come from html&#8217;s basic ability to specify what do display, and being able to seem to display completely different things. There are actually two parsers at work here &#8211; one is the &#8216;filter&#8217; &#8211; its goal is to filter out some &#8216;bad&#8217; html, and the other is a bit more complicated &#8211; it is the person reading the browser&#8217;s output (it may be considered to be the &#8216;browser + person&#8217; parser) . These two parsers operate on completely different levels of html. Now, I would like to point out that having to parsers reading the same language is a common insecurity pattern. HTML has a huge space between what is expressible, and what is visible. In that space &#8211; danger lies.</p>
<p>As another, simpler example, consider phishing sites. These are common enough nowadays. How does your browser decide if the site you are looking at is actually a phishing site? Among other things &#8211; reading the code behind the site. However, this code can point to something completely different then what is being displayed. In this &#8216;invisible&#8217; space &#8211; any misleading code can live. In that way, the spammer may pretend to be a legitimate site for the filter, but your run-of-the-mill phishing site for the human viewer. This misleading code in the &#8216;invisible space&#8217; may be used to good &#8211; like a honeypot against some comment-spammer, or it may be used for different purposes &#8211; by the spammer himself.</p>
<p>Now comes the interesting part. The &#8220;what to do part&#8221;. For now let me just describe it theoretically, and later work on its practicality. I suggest using a &#8216;visibility browser&#8217;. This browser will use some popular browser (Internet Explorer, Firefox, Safari, Opera, etc.. ) as its lower level. This lower level browser will render the website to some buffer, instead of the screen. Now, our &#8216;visibility browser&#8217; will OCR all of the visible rendered data, and restructure it as valid HTML. This &#8216;purified&#8217; html may now be used to filter any &#8216;bad&#8217; sites &#8211; whichever criterion you would like to use for &#8216;bad&#8217;.</p>
<p>I know, I know, this is not practical, it is computationally intensive etc etc&#8230; However, it does present a method to close down that nagging &#8216;space&#8217;, this place between readability and visibility, where bad code lies. I also know that the &#8216;visible browser&#8217; itself may be targeted, and probably quite easily. Those attacks will have to rely on implementation faults of the software, or some other flaw, as yet un-thought-of. We all know there will always be bugs. But it seems to me that the &#8216;visibility browser&#8217; does close, or at least cover for a time, one nagging design flaw.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/computer-science/browser-visibility-security-and-invisibility-insecurity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Exception handling policy &#8211; use module exception hierarchies</title>
		<link>http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/#comments</comments>
		<pubDate>Wed, 18 Jul 2007 12:59:16 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[C]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Exceptions]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/index.php/programming/python/exception-handling-policy-use-module-exception-hierarchies/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/' addthis:title='Exception handling policy &#8211; use module exception hierarchies'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>While programming some bigger projects, and not some home-brew script, I used to wonder what to do with exceptions coming from lower level and library modules. The &#8216;home script&#8217; approach to exceptions is &#8220;let it rise&#8221; &#8211; usually because it &#8230; <a href="http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/' addthis:title='Exception handling policy &#8211; use module exception hierarchies' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>While programming some bigger projects, and not some home-brew script, I used to wonder what to do with exceptions coming from lower level and library modules. The &#8216;home script&#8217; approach to exceptions is &#8220;let it rise&#8221; &#8211; usually because it indicates an error anyway, and the script needs to shut-down. If there is any cleanup to be done, it will happen in the __del__ functions and finally clause as required.</p>
<p>After getting some experience with handling problems and writing applications that continue to work despite exceptions, I&#8217;ve come to the conclusion that the best practice approach is that each of the project&#8217;s module must raise its own type of exception. That way, in higher level modules that catch the exception you can set a concrete policy about &#8216;what to do&#8217;.</p>
<p>An example: I have some low level module named &#8220;worker&#8221;. Let&#8217;s say this module&#8217;s job is to copy files around efficiently. The higher level module, &#8220;manager&#8221; decides on &#8216;policy&#8217; about what to do with files (copy them, delete them, etc&#8230;). Now, the &#8220;worker&#8221; module raised an exception. If this exception is an IndexError, the higher level module can&#8217;t really decide &#8211; was this because of a programming error, or an OS error? Even if &#8220;worker&#8221; did define an interface that allowed for throwing IndexError&#8217;s in some cases, if there is a real programming error in the module, it will be missed. The correct way to go about it is to create an exception hierarchy for each module, which will be the &#8216;communication channel&#8217; for errors. This way, any unexpected programming error will percolate up as such, and not get missed &#8211; and any exception handling mechanism on the way can decide what do to about it (log, die, ignore, etc&#8230;).</p>
<p>Every exception handling rule has exceptions: when writing &#8216;low-level&#8217; library modules yourself, such as a dict-like class, it makes sense to use IndexError. Don&#8217;t get confused with other library modules though &#8211; If you are writing some communication library, it should follow the same rules described above for a module.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/exception-handling-policy-use-module-exception-hierarchies/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

