<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Algorithm.co.il &#187; Databases</title>
	<atom:link href="http://www.algorithm.co.il/blogs/category/programming/databases/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.algorithm.co.il/blogs</link>
	<description>Algorithms, for the heck of it</description>
	<lastBuildDate>Tue, 21 Jun 2011 20:37:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Bulk INSERTs FTW</title>
		<link>http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/#comments</comments>
		<pubDate>Fri, 17 Jul 2009 13:23:33 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[bulk-insert]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[sql-server]]></category>
		<category><![CDATA[sqlalchemy]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=286</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/' addthis:title='Bulk INSERTs FTW'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>A short while ago, I had to research some API for a company I&#8217;m consulting for. This API yields very good quality data, but isn&#8217;t comfortable enough to process it for further research. The obvious solution was to dump this &#8230; <a href="http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/' addthis:title='Bulk INSERTs FTW' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>A short while ago, I had to research some API for a company I&#8217;m consulting for. This API yields very good quality data, but isn&#8217;t comfortable enough to process it for further research.<br />
The obvious solution was to dump this data into some kind of database, and process it there.<br />
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.</p>
<p>(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)</p>
<p>Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.</p>
<p>We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.</p>
<p>(As my defense regarding the MemoryError I&#8217;ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API&#8230;)</p>
<p>At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested <a href="http://msdn.microsoft.com/en-us/library/ms188365.aspx">bulk inserts</a>.</p>
<p>The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).<br />
Then do BULK INSERT, and a short while later, you&#8217;ll have your information inside.<br />
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.</p>
<p>My suggestion: add FTW as a possible extension for the bulk insert syntax. It won&#8217;t do anything, but it will certainly fit.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/bulk-inserts-ftw/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A Simple Race-Condition</title>
		<link>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/#comments</comments>
		<pubDate>Sun, 28 Jun 2009 09:38:18 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Bug]]></category>
		<category><![CDATA[Caching]]></category>
		<category><![CDATA[Multi-Threading]]></category>
		<category><![CDATA[Race-Condition]]></category>
		<category><![CDATA[web applications]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=263</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/' addthis:title='A Simple Race-Condition'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>Lately, I&#8217;ve mostly been working on my startup. It&#8217;s a web-application, and one of the first things I&#8217;ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won&#8217;t present the &#8230; <a href="http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/' addthis:title='A Simple Race-Condition' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Lately, I&#8217;ve mostly been working on my startup. It&#8217;s a web-application, and one of the first things I&#8217;ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won&#8217;t present the code itself here, instead I&#8217;ll try to present the essence of the bug.</p>
<p>Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.<br />
So, let&#8217;s say we create some database table, that has the following fields:</p>
<ul>
<li><strong>Unique key:</strong> input</li>
<li>output</li>
<li>use_count *</li>
<li>Last use date *</li>
</ul>
<p>I&#8217;ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.<br />
Here is how code using the cache would look like (in pseudocode form):</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">result = cache.<span style="color: #dc143c;">select</span><span style="color: black;">&#40;</span><span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">if</span> result:
    <span style="color: #ff7700;font-weight:bold;">return</span> result
result = compute<span style="color: black;">&#40;</span><span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
cache.<span style="color: black;">insert</span><span style="color: black;">&#40;</span><span style="color: #008000;">input</span>, result<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">return</span> result</pre></div></div>

<p>This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?<br />
Here&#8217;s a simple scheduling that will show the bug:</p>
<p>Thread1: result = cache.select(input). result is None<br />
Thread1: result = compute(input)<br />
Thread2: result = cache.select(input) result is None<br />
Thread2: result = compute(input)<br />
Thread2: cache.insert(input, result)<br />
Thread1: cache.insert(input, result) &#8211; exception &#8211; duplicate records for the unique key input!</p>
<p>This is a classic race condition. And here&#8217;s a small challenge: What&#8217;s the best way to solve it?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/a-simple-race-condition/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Actual Data Always Needs To Be Explicit</title>
		<link>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/</link>
		<comments>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/#comments</comments>
		<pubDate>Fri, 10 Apr 2009 21:01:10 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[startup]]></category>
		<category><![CDATA[design patterns]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=247</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/' addthis:title='Actual Data Always Needs To Be Explicit'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>This might seem obvious, but it wasn&#8217;t to me it first. Now I consider it a database design rule of thumb, or even a patten. I&#8217;ll explain using an example. Consider an application where you also need automatic tagging of &#8230; <a href="http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/' addthis:title='Actual Data Always Needs To Be Explicit' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>This might seem obvious, but it wasn&#8217;t to me it first. Now I consider it a database design rule of thumb, or even a patten.<br />
I&#8217;ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you&#8217;ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.<br />
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.</p>
<p>You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.</p>
<p>The problems start when you want to run the algorithm more than once.<br />
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn&#8217;t require any change to the DB. Now, let&#8217;s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.</p>
<p>What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.</p>
<p>How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark <strong>&#8220;negative tags&#8221;</strong> , which are also pieces of information. The easiest way I found of doing this is adding a boolean &#8220;valid&#8221; column to the mapping table.</p>
<p>It&#8217;s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don&#8217;t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example &#8220;relationship type&#8221; in a social network or &#8220;tag weight&#8221; as in our example, you would already have a separate table anyway.</p>
<p>I encountered this issue when I implemented my <a href="http://www.algorithm.co.il/blogs/index.php/programming/design/database-design-problem/">multiple source db-design</a>. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.<br />
My mistake was that I only considered the records as pieces of data, and didn&#8217;t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/programming/actual-data-always-needs-to-be-explicit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Database Design Problem</title>
		<link>http://www.algorithm.co.il/blogs/startup/database-design-problem/</link>
		<comments>http://www.algorithm.co.il/blogs/startup/database-design-problem/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 20:55:55 +0000</pubDate>
		<dc:creator>lorg</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[startup]]></category>
		<category><![CDATA[Anti patterns]]></category>
		<category><![CDATA[design patterns]]></category>
		<category><![CDATA[Entity-Attribute-Value model]]></category>
		<category><![CDATA[Harvesting]]></category>
		<category><![CDATA[Inner platform effect]]></category>

		<guid isPermaLink="false">http://www.algorithm.co.il/blogs/?p=178</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/startup/database-design-problem/' addthis:title='Database Design Problem'  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_counter addthis_pill_style"></a></div>A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was &#8230; <a href="http://www.algorithm.co.il/blogs/startup/database-design-problem/">Continue reading <span class="meta-nav">&#8594;</span></a><div class="addthis_toolbox addthis_default_style " addthis:url='http://www.algorithm.co.il/blogs/startup/database-design-problem/' addthis:title='Database Design Problem' ><a href="http://addthis.com/bookmark.php?v=250&#38;username=xa-4d2b47597ad291fb" class="addthis_button_compact">Share</a><span class="addthis_separator">&#124;</span><a class="addthis_button_preferred_1"></a><a class="addthis_button_preferred_2"></a><a class="addthis_button_preferred_3"></a><a class="addthis_button_preferred_4"></a></div>]]></description>
			<content:encoded><![CDATA[<p>A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.</p>
<p>While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.</p>
<p><strong>Problem description:</strong><br />
We want to to have a table with collected information from various sources.</p>
<p>For example, let&#8217;s say we want to collect information about paintings. We&#8217;d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.<br />
(Note: in my original formulation, being able to display the value from the best source was enough).</p>
<p><span id="more-178"></span></p>
<p>Let&#8217;s say our desired, &#8220;regular&#8221; table looks like this:</p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p><strong>Proposed Solution 1:</strong><br />
Add a source column, like so:</p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}</p>
<p>Keep multiple records with different source, combine on query time.</p>
<p>Downside:<br />
combine on query time, really not good.</p>
<p><strong>Proposed solution 1*:</strong><br />
final table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p>source table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}<br />
Application only queries the final table, while a &#8220;source table&#8221; with a source column is added. Now combination is done during harvest time.</p>
<p>Downsides:<br />
Two separate  almost identical tables. Not good for DRY karma.</p>
<p><strong>Proposed Solution 1**:</strong></p>
<p>{<br />
field_a<br />
field_b<br />
field_c<br />
source<br />
}</p>
<p>Same as solution 1, except now a null source indicates a combined (&#8220;final&#8221;) record. Record combination is done during harvest time, while the application queries this table for records with a null source.</p>
<p>Downsides:<br />
Complicates queries a bit, keeps final data with raw data.</p>
<p><strong>Proposed Solution 2:</strong><br />
{<br />
field_a<br />
field_a_source<br />
field_b<br />
field_b_source<br />
field_c<br />
field_c_source<br />
}</p>
<p>Downsides:<br />
Violates DRY, adds a lot of boilerplate code for handling each column. Doesn&#8217;t easily allow keeping unused data from different sources. For example, if source X was used for field_a, source Y&#8217;s field_a is discarded.<br />
Also, doesn&#8217;t cleanly allow the final field to be a combination of sources, for example, concatenation.</p>
<p><strong>Proposed solution 3:</strong><br />
Add another table, holding sources:</p>
<p>final table<br />
{<br />
field_a<br />
field_b<br />
field_c<br />
}</p>
<p>source table<br />
{<br />
field_name<br />
field_value<br />
field_source<br />
}</p>
<p>Queries are made against the final table, and record combination is done during harvest time.</p>
<p>Downsides:<br />
Implements a database in a database.<br />
For further reading on this solution see <a href="http://en.wikipedia.org/wiki/Entity-attribute-value_model">Entity-Attribute-Value Model</a>, and the <a href="http://en.wikipedia.org/wiki/Inner-Platform_Effect">Inner Platform Effect</a>.</p>
<p><strong>My Choice</strong><br />
I decided to use solution 1**. For some time I thought I should implement solution 3, but decided against it, mostly because of the inner platform effect, but also because of the other downsides listed (see references).<br />
I also added another column to solution 1, one that indicates for a given &#8220;source record&#8221; (a record with a non-null source) which is its combined record. Now, given a final record, I can also query for all its source records, which is desired for late data processing.<br />
I work with this solution in the following manner:<br />
Offline:<br />
1. Harvest from some source.<br />
2. Harvest from some other source.<br />
3. Run record combination algorithm.<br />
4. Run data analysis algorithm.<br />
5. Harvest from yet another source.<br />
6. Run record combination algorithm.<br />
7. Run data analysis algorithm.<br />
8. etc&#8230;</p>
<p>Online:<br />
Query only from records with a null source.</p>
<p>For further reading see also <a href="http://www.pgcon.org/2008/schedule/events/97.en.html">Database Anti Patterns</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.algorithm.co.il/blogs/startup/database-design-problem/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

