Categories
Databases Design Programming startup

Actual Data Always Needs To Be Explicit

This might seem obvious, but it wasn’t to me it first. Now I consider it a database design rule of thumb, or even a patten.
I’ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you’ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.

You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.

The problems start when you want to run the algorithm more than once.
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn’t require any change to the DB. Now, let’s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.

What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.

How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark “negative tags” , which are also pieces of information. The easiest way I found of doing this is adding a boolean “valid” column to the mapping table.

It’s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don’t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example “relationship type” in a social network or “tag weight” as in our example, you would already have a separate table anyway.

I encountered this issue when I implemented my multiple source db-design. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.
My mistake was that I only considered the records as pieces of data, and didn’t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.

Categories
Databases Design startup

Database Design Problem

A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.

While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.

Problem description:
We want to to have a table with collected information from various sources.

For example, let’s say we want to collect information about paintings. We’d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).

Categories
Compilation Math Programming Python

Interesting links – 4

http://ivory.idyll.org/articles/advanced-swc/

“Intermediate and Advanced Software Carpentry in Python” is an excellent reading by Titus Brown. If you feel you’re good with Python but want to improve it, or if you are an experienced programmer that wants to get better, this is a good place to go. I liked it.

http://c2.com/cgi/wiki?AlternateHardAndSoftLayers

During the time I was working on my Compilation course, I was thinking about the challenge of writing yacc in yacc. Well, I went searching for “yacc in yacc” and stumbled across this page. It is a part of a very strange ‘pattern wiki’. It has fascinating discussions of the subjects in it, but I don’t like the old-style wiki navigation. Still worth a taste.

http://www.scottaaronson.com/writings/bignumbers.html

Do you know the kind of duel where two people need to name the biggest number? If you thought something along the lines of ‘hey, I’ll write something with a lot of nines’, well, you are in for a surprise. This article has a nice solution for the problem. Excellent read, especially if you are into computation theory.  I really liked that one.