A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.
While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.
Problem description:
We want to to have a table with collected information from various sources.
For example, let’s say we want to collect information about paintings. We’d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).
Let’s say our desired, “regular” table looks like this:
{
field_a
field_b
field_c
}
Proposed Solution 1:
Add a source column, like so:
{
field_a
field_b
field_c
source
}
Keep multiple records with different source, combine on query time.
Downside:
combine on query time, really not good.
Proposed solution 1*:
final table
{
field_a
field_b
field_c
}
source table
{
field_a
field_b
field_c
source
}
Application only queries the final table, while a “source table” with a source column is added. Now combination is done during harvest time.
Downsides:
Two separate almost identical tables. Not good for DRY karma.
Proposed Solution 1**:
{
field_a
field_b
field_c
source
}
Same as solution 1, except now a null source indicates a combined (“final”) record. Record combination is done during harvest time, while the application queries this table for records with a null source.
Downsides:
Complicates queries a bit, keeps final data with raw data.
Proposed Solution 2:
{
field_a
field_a_source
field_b
field_b_source
field_c
field_c_source
}
Downsides:
Violates DRY, adds a lot of boilerplate code for handling each column. Doesn’t easily allow keeping unused data from different sources. For example, if source X was used for field_a, source Y’s field_a is discarded.
Also, doesn’t cleanly allow the final field to be a combination of sources, for example, concatenation.
Proposed solution 3:
Add another table, holding sources:
final table
{
field_a
field_b
field_c
}
source table
{
field_name
field_value
field_source
}
Queries are made against the final table, and record combination is done during harvest time.
Downsides:
Implements a database in a database.
For further reading on this solution see Entity-Attribute-Value Model, and the Inner Platform Effect.
My Choice
I decided to use solution 1**. For some time I thought I should implement solution 3, but decided against it, mostly because of the inner platform effect, but also because of the other downsides listed (see references).
I also added another column to solution 1, one that indicates for a given “source record” (a record with a non-null source) which is its combined record. Now, given a final record, I can also query for all its source records, which is desired for late data processing.
I work with this solution in the following manner:
Offline:
1. Harvest from some source.
2. Harvest from some other source.
3. Run record combination algorithm.
4. Run data analysis algorithm.
5. Harvest from yet another source.
6. Run record combination algorithm.
7. Run data analysis algorithm.
8. etc…
Online:
Query only from records with a null source.
For further reading see also Database Anti Patterns.
Why not –
source
{
id
type (website or … can be normalized to type_id)
name
… (timestamp or whatever else is source appropriate)
}
item
{
id
name
…(whatever is unique across all sources)
}
itemfromsource
{
id
source_id
item_id
}
@jack9:
First, thank you for your comment and proposed solution.
Regarding normalization, I just didn’t describe it to clarify of the actual problem and proposed solutions.
Regarding your solution, since I ended up using solution 1**, I had a one-to-many mapping between items and sources, not many-to-many as you suggest. If I understand your idea correctly, you suggest keeping only the combined (final) data, and using a join table to keep track of all sources involved in creating a combined record.
There are two problems with this: (a) You don’t keep the source data, for later analysis. If you do, I don’t see a way to differentiate between source records and final records. (b) For a final record, you can’t seem to be able to tell for a single field what is its source. You can change your solution by adding a field_name to itemfromsource, but I don’t think the resulting solution is pretty elegant.
Just something to consider, but this type of “non-normal” data structure is something that couchdb would be great at.
You could have a “generic entity” entry and then programmatically check for the existence of other document attributes and expand your behavior if they exist.
@Jay:
Actually, before I started working on the database, I considered using couchdb, and rejected it. I did so because:
1. It didn’t seem “ready enough” for me.
2. I did not find anyone I know who already worked with it (or knew about it, for that matter. I was the one telling them about it!).
3. My hosting doesn’t support it.
4. Relational databases are good enough.
Now, under other circumstances, being the first one “in my neighborhood” to use it could be really fun, but I thought that for my startup it isn’t the right way to go. It is an unnecessary risk, with not much to gain (See point no. 4).
So I decided that for this startup I’d keep to sql, and for my next fun project I would acquaint myself with couchdb.
As for your actual suggestion: I’m not well versed in couchdb usage, but I’m not sure I like it. Since when querying for records I usually don’t want to get the extra baggage of source records/data. I know I might be optimizing prematurely but I think this solution is a bit too heavy. Also, it seems to me this solution would keep source data and final data in different forms, which would make it harder to process.