Lately, I’ve mostly been working on my startup. It’s a web-application, and one of the first things I’ve written was a cache mechanism for some lengthy operations. Yesterday, I found a classic race-condition in that module. I won’t present the code itself here, instead I’ll try to present the essence of the bug.
Consider a web application, required to do lengthy operations from time to time, either IO bound, or CPU bound. To save time, and maybe also bandwidth or CPU time, we are interested in caching the results of these operations.
So, let’s say we create some database table, that has the following fields:
- Unique key: input
- use_count *
- Last use date *
I’ve marked with an asterisk the fields that are optional, and are related to managing the size of the cache. These are not relevant at the moment.
Here is how code using the cache would look like (in pseudocode form):
result = cache.select(input) if result: return result result = compute(input) cache.insert(input, result) return result
This code will work well under normal circumstances. However, in a multithreaded environment, or any environment where access to the database is shared, there is a race-condtion: What happens if there are two requests for the same input at about the same time?
Here’s a simple scheduling that will show the bug:
Thread1: result = cache.select(input). result is None
Thread1: result = compute(input)
Thread2: result = cache.select(input) result is None
Thread2: result = compute(input)
Thread2: cache.insert(input, result)
Thread1: cache.insert(input, result) – exception – duplicate records for the unique key input!
This is a classic race condition. And here’s a small challenge: What’s the best way to solve it?
This might seem obvious, but it wasn’t to me it first. Now I consider it a database design rule of thumb, or even a patten.
I’ll explain using an example. Consider an application where you also need automatic tagging of text. (As in generating keywords.) So you’ll have a table for objects that have textual fields (for instance, blog posts), and a table for tags.
Now, you would need a many-to-many mapping between these two tables. Various ORMs might do this automatically for you, or you might add a PostTag table yourself, with foreign keys to the other tables.
You think this might be enough, as your smart tagging algorithm can add tags and attach tags to blog posts. If you want to change it manually, then no problem, you just modify any of these tables. For example, if the algorithm makes a mistake, you just erase the mapping and/or the tag.
The problems start when you want to run the algorithm more than once.
First, the algorithm must not create duplicates on the second run. This is very easy to implement and doesn’t require any change to the DB. Now, let’s say that a taggable object (our blog post) has changed, and we want to update the tags accordingly. We might want to erase all the mappings we created for this object. No problem, also easy to do.
What about manual changes? Should these be erased as well? Probably not, at least not without alerting their creator. So we need to record the source of these mappings in an extra column of the mapping table, and use it to mark manually and algorithmically generated mappings differently.
How about deletions? What if the first time around, the algorithm made a mistake, and added a wrong tag, which was manually removed? Running the algorithm again will cause the tag to be added again. We need some way to mark “negative tags” , which are also pieces of information. The easiest way I found of doing this is adding a boolean “valid” column to the mapping table.
It’s important to note that this also applies to all mapping types and not just to many-to-many. So even when you don’t naturally need a separate table for a mapping, you should consider adding one, if the mapping is part of the actual data you keep. Also, if you need to keep extra data about the mapping itself, for example “relationship type” in a social network or “tag weight” as in our example, you would already have a separate table anyway.
I encountered this issue when I implemented my multiple source db-design. A reminder: I had data collected from various sources, and then combined together to a final merged record. The combining was done automatically.
My mistake was that I only considered the records as pieces of data, and didn’t consider that the actual grouping of raw data records is also part of the information I keep. As such, I should have represented the groupings in a separate table, with the added columns, as I outlined in this blog post.
A few weeks ago, I had to work out a database design for my startup. I had a bit of a hard time deciding on a design direction, but after thinking about it, I settled on a design I was happy with.
While I was still making up my mind, I discussed the problem with a couple of friends, and to better describe the problem and the proposed solutions I wrote up a short document describing them. I decided to publish this document along with my choice and considerations. Maybe someone else will benefit from my choice, or at least from the alternatives I listed.
We want to to have a table with collected information from various sources.
For example, let’s say we want to collect information about paintings. We’d want to have a database holding for each painting we know about its dimensions, painter, description, link to an image file, etc. Since we collect this information from various sources (maybe harvest information from multiple websites), we would like our application to display each field either from all sources, or from the best source available.
(Note: in my original formulation, being able to display the value from the best source was enough).
Lately I’ve been developing a website. One issue that I’ll probably need to address in the near future is “persistent predicates”. By “persistent predicates” I mean the problem of treating predicates as data.
Consider the following situation: you are developing some big rss reader/aggregator and you want to allow users to specify handling rules. How would you keep these rules in memory, and how would you keep them on disk?
Obviously, this problem was solved before. Just consider email filters, or even packet filters in ethereal.
One way of approaching the problem is implementing simple predicate templates:
“%field contains %s” where field is subject, or body, etc.
Once that is accomplished, you can specify that a “filter” is some combination (for example logical and, or logical or) of multiple predicates. To store this, we’ll have an actual predicate table (or pickle) with their data, and a one-to-many mapping of filters to predicates.
Another option is allowing just some very simple predicates, and a filter will just “point” to (have an id/name of) the required predicate, and the required data. In this option, all data is stored with the filter.
A more complicated solution is to implement some logical serialize-able lanugage (such as the expression trees I used for diStorm or PyKoan). Using this language, the predicates can be very dynamic, and be combined and manipulated programmatically. This solution might be overkill for many projects though.
An interesting issue regarding handling of predicates, is their application to constraint solving. However, this is an issue for a future post. Suffice it to say, that when writing PyKoan I’m using a constraint solver. Since I’m representing predicates with expression trees, the ability to analyze and manipulate predicates is very handy.
Besides looking at existing solutions, I’m very curious to hear other peoples’ opinions. Feel free to write about your preferred solution in the comments.
When implementing the VM, I had to keep track of state. The state of the VM includes the registers, virtual variables and memory. Fortunately, keeping track of state information is pretty easy. Basically, it amounts to having a dict, where the keys are registers or variables, and the values are, well, their values.
Holding the state of registers is a bit involved by the fact that registers may overlap, as I mentioned in a previous article. To handle this, the class keeping track of the state information, upon seeing a request to change the value of a register, propagates this change to its parent and child registers.
Holding memory might be different though. At a first glance, it would seem that memory should be kept in some buffer. However, it is much easier to keep the memory in a python dict as well, and treat that dict as if it was a sparse array. This implementation allows programs to write at address 10, and at address 10000, without requiring the VM to keep track of all the addresses in between. (Of course, we are not going to implement paging just now :)
I really liked this idea of treating dicts as sparse lists. In fact, it could work even better with some tree data structure instead of a hash table. With a tree data structure your keys would be sorted, and you could do slices.
I went ahead and tried using tree data structures for this, but it seems to be more trouble than it’s worth, so I think that unless it becomes a real timing issue, I’ll let it go.
I did have some fun wrapping the memory in some class to abstract away the data structure, and to give it a few more capabilities:
In : import vm In : m = vm.VMMemory() In : m.set(0, "hello world!\r\n") In : m.set(51, "foobar, foobar\x00") In : m.set(77, "a"*30) In : print m 0000: 68656c6c6f20776f726c64210d0a???? hello world!..?? 0033: 666f6f6261722c20666f6f62617200?? foobar, foobar.? 004d: 61616161616161616161616161616161 aaaaaaaaaaaaaaaa 005d: 6161616161616161616161616161???? aaaaaaaaaaaaaa??
The reason I put up the LRU cache challenge up, was that I couldn’t think of a good solution to the problem without using linked lists. This has been pointed to by Adam and Erez as well. Adam commented on this, and Erez’ solution to the problem was algorithmically identical to mine.
So how to solve the challenge? Here are the two possible solutions I thought about:
- Use a dict for lookup, and each element’s age is indicated by its position in a linked list. This is the solution Erez and I implemented.
- Keep a ‘last time of use’ indicator for each element. This could be just a regular int, incremented by 1 for each lookup. Keep the elements in a min heap, and when there are too many elements, pop them using the minimum heap.
Generally, I consider the first solution more elegant. It doesn’t rely on an integer to work, so it could work ‘indefinitely’. Of course, the second solution can be also made to work indefinitely, with some upkeep from time to time. (The added time cost of the upkeep may be amortized over other actions.)
If you can think of some other, more elegant solution, I’ll be happy to hear about it.
So, given that a linked list solution is more elegant, we come to the crux of the problem: what to do in Python? The Python standard library does not contain a linked list implementation as far as I know. As a result, Python programmers are encouraged to use the list type, which is an array. This is just as well: for most intents and purposes, the list type is good enough.
I tried to think a little about other cases where a linked list was more appropriate, and I didn’t come up with any more such cases. If you come up with any such case, I’ll be happy to hear about it.
After looking for a public implementation, and not finding one that seemed good enough, I decided to go ahead and write my own.
Out of curiousity, I also did a small comparison of runtime speeds between my implementation of a linked list, and the list data type. I tried a test where a linked list has an obvious advantage (complexity wise)- removing elements from the middle of the list. The Python list won up to somewhere in the thousands of elements. (Of course, list is implemented in C, and mine is in pure Python).
What is my conclusion from all of this? The same as the conventional wisdom: use the list data type almost always. If you find yourself in need of a linked list, think long and hard (well, not too long) about your problem and solution. There’s a good chance that either you can use the built-in list with an equivalent solution, or that a regular list will still be faster, for most cases. Of course, if you see no other way – do what you think is best.
Formal languages have a knack of giving some output, and then later doing something completely different. For example, take the “Halting Problem“, but this is probably too theoretical to be of any relevance… so read on for something a bit more practical. We are going to go down the rabbit hole, to the ‘in-between’ space…
My interest was first piqued when I encountered the following annoyance – some websites would use transparent layers to prevent you from:
- Marking and copying text.
- Left-clicking on anything, including:
- images, to save them,
- just the website, to view its source –
- and so on and so forth…
Now I bet most intelligent readers would know how to pass these minor hurdles – but mostly just taking the steps is usually deterrent enough to prevent the next lazy guy from doing anything. So I was thinking, why not write a browser – or just a Firefox plugin, that will allow us to view just the top-level of any website?
This should be easy enough to do, but if it bothered enough sites (which it probably won’t), and they fought back, there would be a pretty standard escalation war. However, since the issue is not that major, I suspect it wouldn’t matter much.
Now comes the more interesting part. Unlike preventing someone from copying text, html (plus any ‘sub-languages’ it may use) may be used to display one thing, and to be read like a different thing altogether. The most common example is with spam – displaying image spam instead of text. When that was countered by spam filters, animated gif files were used. Now you have it – your escalation war, par excellence. This property of html was also used by honeypots to filter comment-spam, as described in securiteam. In this the securiteam blog post by Aviram, the beginning of another escalation war is described. There are many more examples of this property of html.
All of these examples come from html’s basic ability to specify what do display, and being able to seem to display completely different things. There are actually two parsers at work here – one is the ‘filter’ – its goal is to filter out some ‘bad’ html, and the other is a bit more complicated – it is the person reading the browser’s output (it may be considered to be the ‘browser + person’ parser) . These two parsers operate on completely different levels of html. Now, I would like to point out that having to parsers reading the same language is a common insecurity pattern. HTML has a huge space between what is expressible, and what is visible. In that space – danger lies.
As another, simpler example, consider phishing sites. These are common enough nowadays. How does your browser decide if the site you are looking at is actually a phishing site? Among other things – reading the code behind the site. However, this code can point to something completely different then what is being displayed. In this ‘invisible’ space – any misleading code can live. In that way, the spammer may pretend to be a legitimate site for the filter, but your run-of-the-mill phishing site for the human viewer. This misleading code in the ‘invisible space’ may be used to good – like a honeypot against some comment-spammer, or it may be used for different purposes – by the spammer himself.
Now comes the interesting part. The “what to do part”. For now let me just describe it theoretically, and later work on its practicality. I suggest using a ‘visibility browser’. This browser will use some popular browser (Internet Explorer, Firefox, Safari, Opera, etc.. ) as its lower level. This lower level browser will render the website to some buffer, instead of the screen. Now, our ‘visibility browser’ will OCR all of the visible rendered data, and restructure it as valid HTML. This ‘purified’ html may now be used to filter any ‘bad’ sites – whichever criterion you would like to use for ‘bad’.
I know, I know, this is not practical, it is computationally intensive etc etc… However, it does present a method to close down that nagging ‘space’, this place between readability and visibility, where bad code lies. I also know that the ‘visible browser’ itself may be targeted, and probably quite easily. Those attacks will have to rely on implementation faults of the software, or some other flaw, as yet un-thought-of. We all know there will always be bugs. But it seems to me that the ‘visibility browser’ does close, or at least cover for a time, one nagging design flaw.
While programming some bigger projects, and not some home-brew script, I used to wonder what to do with exceptions coming from lower level and library modules. The ‘home script’ approach to exceptions is “let it rise” – usually because it indicates an error anyway, and the script needs to shut-down. If there is any cleanup to be done, it will happen in the __del__ functions and finally clause as required.
After getting some experience with handling problems and writing applications that continue to work despite exceptions, I’ve come to the conclusion that the best practice approach is that each of the project’s module must raise its own type of exception. That way, in higher level modules that catch the exception you can set a concrete policy about ‘what to do’.
An example: I have some low level module named “worker”. Let’s say this module’s job is to copy files around efficiently. The higher level module, “manager” decides on ‘policy’ about what to do with files (copy them, delete them, etc…). Now, the “worker” module raised an exception. If this exception is an IndexError, the higher level module can’t really decide – was this because of a programming error, or an OS error? Even if “worker” did define an interface that allowed for throwing IndexError’s in some cases, if there is a real programming error in the module, it will be missed. The correct way to go about it is to create an exception hierarchy for each module, which will be the ‘communication channel’ for errors. This way, any unexpected programming error will percolate up as such, and not get missed – and any exception handling mechanism on the way can decide what do to about it (log, die, ignore, etc…).
Every exception handling rule has exceptions: when writing ‘low-level’ library modules yourself, such as a dict-like class, it makes sense to use IndexError. Don’t get confused with other library modules though – If you are writing some communication library, it should follow the same rules described above for a module.