Two bugs don’t make a right

Three lefts roadsign
While working on my new startup, we are doing a little bit of reasoning using implications. One of the more curious forms of implications is the negative form: consider the following exaggerated example:

  • a place being kid-friendly implies that it is not romantic.
  • a place being a strip club implies it is not kid-friendly

If we allow negative implications to be transitive, then it would follow that since being a strip club makes a place less kid-friendly, it makes it more romantic. We don’t want that. So I had to write some code to specifically ignore that situation. Before writing that, in the best tradition of TDD I wrote a test for two chained negative implications. I implemented the code, the test passed and I was happy.

For a while.

Fast forward a couple of weeks, and I’m trying out adding some negative implications, and the program doesn’t behave as expected. My code doesn’t work. I turn back to my test, check it out, and sure enough, all the thing the test asserts as True are actually True, and the test does test the right thing.

Digging deeper, I discovered the issue. I had two bugs: the first was that the code handling chained negative implications wasn’t working right. The second was in my graph building algorithm – it seems that I was forgetting to add some edges. What made that second bug insidious was that it hid the effect of the first bug from the test – effectively making the test pass.

So – for me it was – two negative implications don’t mean a positive one, and two bugs don’t make a feature.


Collision: the story of the random bug

So here I was, trying to write some Django server-side code, when every once in a while, some test would fail.
Now, it is important to know that we are using any_model, a cute little library that allows you to specify only the fields you need when creating objects, and randomizes the rest (to help uncover more bugs).

In this particular instance, the test that was failing was trying to store objects on the server using an API, and then check that the new objects exist in the DB. Every once in a while, an object didn’t exist. It should be noted that the table with the missing rows had a Djano-ORM URLField.

So first things first, I changed the code to print the random seed it was using on every failure. Now the next time it failed (a day later), I had the random seed in hand.

I then proceeded to use that random seed – and now I had a reproducible bug – it failed every time, consistently.

The next step was finding the cause of the bug. To cut a long story short – it turns out that it looked for an object with a specific URL. Which url? the url created for the first object (we had two).

The bug was that the second object was getting the same url as the first. I remind you, these urls are generated randomly. The troublesome url was

I leave you now to guess/check what are the chances for the collision here
(the correct way to do that would be to check any_model’s code for generating urls, and not just say 1 in 2^32… :)

So I made sure the second object got a new url, and all was well, and the land had rest for forty years. (or less).

Python Testing

Fuzz-Testing With Nose

A few days ago, I found a in my website, The bug was in a new feature I added to the algorithm. The first thing I did was write a small unit-test to reproduce the bug. With that unit-test in hand, I then worked to fix the bug, and got this unit-test to pass.

As I previously persumed this feature to be (relatively :) bug free, I decided that more testing was in order. This time however, a single test-case would not be enough – I needed to make sure that the trip-generation algorithm works in many cases. Enter fuzzing. generates trips according to trip preferences. Why not generate the trip preferences with a fuzzer, and then check if the planning algorithm chokes on them? While fuzzing is usually used to generate invalid input with the goal of causing the program to crash, in this case I’m generating valid input with the goal of causing the planning algorithm to fail.

Usually fuzzing is done with one of two techniques – exhaustive fuzzing, that goes systematically (possibly selectively) over the input space and random fuzzing, which picks inputs at random – or “somewhat” randomly. In my case, the input space consists of “world data” – locations of attractions, restaurants, etc, and trip preferences – intensity, required attractions, and so on. Since the input space is so large and “unstructured”, I found it much easier to go with random fuzzing.

In each test-case, I will generate a “random world”, and random trip preferences for that world.
Here is some sample code that shows how this might look:

trip_prefs.num_days = random.randint(0, 5)
trip_prefs.intensity = random(0, 5)
if randbit():
    trip_prefs.schedule_lunch = True

Where randbit is defined like so:

def randbit(prob = 0.5):
    return random.random() <  prob

This is all very well, but tests need to be reproducible. If a fuzzer-generated test case fails and I can’t recreate it to analyze the error and later verify that it is fixed, it isn’t of much use. To solve this issue, the input generation function receives some value, and sets the random seed with this parameter. Now, generating test cases is just a matter of generating a sequence of random values. Here is my code to do that:

class FuzzTestBase(object):
    __test__ = False
    def run_single_fuzz(self, random_seed):
    def fuzz_test(self):
        random_seeds = [str(random.random()) for i in range(NUM_FUZZ_TESTS)]
        for seed in random_seeds:
            yield self.run_single_fuzz, seed

FuzzTestBase is a base-class for actual test classes. Each test class just needs to define its own version of run_single_fuzz, and in it call random.seed(random_seed) and log random_seed.

This code uses nose‘s ability to test generators: it assumes that a test generator yields test functions and their parameters.

A few interesting issues:
* I generate the random seeds beforehand, so that calling random.seed() in the actual test case doesn’t affect the seed sequence.
* Originally I used just random.random() as a seed instead of str(random.random()). The problem with that is that this way it’s not reproducible. random.random() returns a floating point value x, for which usually x != eval(str(x)):

In [10]: x = random.random()
In [11]: x == eval(str(x))
Out[11]: False

Even though x == eval(repr(x)) for that case, there’s still room for error. Unlike floating point numbers, it’s harder to go wrong with string equality. So str(random.random()) is just a cheap way to generate random strings.

I’d recommend that if your testing mostly consists of selected test cases based on what you think is possible user behavior, you might want to add some fuzzed inputs. I originally started the fuzz-testing described in this blog-post to better test for a specific bug. After adding the fuzz-testing, I found another bug I didn’t know was there. This just goes to show how useful fuzzing is as a testing tool. The fact that it’s so easy to implement is just a bonus.

Personal Programming Python Testing

My Bad Memory, High Load, and Python

About a month ago the new Ubuntu 8.04 was released and I wanted a clean install. I downloaded an image and burned it to a CD. Just before installing, I tried “check CD for defects” and found a few. Turns out (*) this was because of bad memory – and memtest confirmed it.
So I went to the shop, replaced the bad memory, and also bought two new sticks. I went home to install the new Ubuntu, and after the installation, Firefox crashed. After rebooting back to memtest, I saw this:

memory errors in memtest

Back at the computer shop, they asked me to reproduce the errors. Just firing up the computer and booting directly into memtest didn’t seem to do the trick, so I suspected that I had to overwork my computer a bit to reproduce this.

Since I was at the lab, I didn’t want to muck around too much.
So I thought, “what’s the quickest way to give your CPU a run around the block?”
That’s right – a tight loop:

while True:

However, this snippet doesn’t really play with the memory.

The next simplest thing to do, that also jiggles some ram is the following (and one of my favorites):

In [1]: x = 4**(4**4)
In [2]: y = 4**x

I will talk about this peculiar piece of code at a later post.

In any case, this snippet also didn’t reproduce the error. It is also quite unwieldy, as it raises a MemoryError after some time. Later at home I tried two more scripts.
The first is a variation on the one above:

x = 4**(4**4)
while True:
        y = 4**x
    except MemoryError:

I ran a few of those in parallel. However, my Ubuntu machine actually killed the processes running this one by one.

The second is smarter. It allocates some memory and then just copies it around:

import sys
import copy
megabytes = int(sys.argv[1])
x1 = [["a"*1000 + str(i) for i in range(1000)] for j in range(megabytes)]
while True:
    x2 = copy.deepcopy(x1)

After both of these scripts didn’t reproduce the problem and it still persisted arbitrarily, I returned the computer to the lab. Turns out that the two replacement sticks and the two new sticks weren’t exactly identical, and that was the cause of the problem. So now my memory is well again.

As for the scripts above, I once wrote a similar script at work. I was asked to help with testing some software in some stress testing. The goal was to simulate a heavily used computer. A few lines of Python later and the testing environment was ready.

(*) – Finding out that it was a memory issue wasn’t as easy as it sounds. I didn’t think of running memtest. I checked the image on my HD with md5, and the hash didn’t match. I downloaded a second image, and again the hash didn’t match. I checked twice.
At this point I was really surprised: not only the second check didn’t match the published md5, it also didn’t match the first check. Some hours and plenty of voodoo later, a friend suggested running memtest, and the culprit was found.

Assembly Challenges Programming Testing

Some Assembly Required No. 1

I’ve been working on some of the instruction tests in vial, and I wanted to test the implementation of LOOP variants. My objective was to make sure the vial version is identical to the real CPU version (as discussed here). To achieve this, I had to cover all of the essential behaviors of LOOP.

Well, using the framework Gil and I wrote, I hacked up some code that should cover the relevant cases:

code_template = """
mov edx, ecx ; control the start zf
mov ecx, eax ; number of iterations
mov eax, 0 ; will hold the result, also an iteration counter
    cmp eax, ebx    ; check if we need to change zf
    setz dh
    xor dh, dl      ; if required, invert zf
    inc eax         ; count the iteration
    cmp dh, 0       ; set zf
    loop%s loop_start
for loop_kind in ['','z','nz']:
    code_text = code_template % loop_kind
    c = FuncObject(code_text)
    for start_zf_value in [0,1]:
        for num_iters in [1,4,10]:
            for when_zf_changes in [1,2,15]:
                c(num_iters, when_zf_changes, start_zf_value)

Note that c(…) executes the code both on vial’s VM, and on the real cpu. c.check() compares their return value (EAX) and flags after the execution. I also wanted to avoid other kinds of jumps in this test.

To check that the code ran the same number of times, I returned EAX as the number of iterations.
All the games with edx are there to make sure that I’m testing different zf conditions.

The challenge for today:
Can you write a shorter assembly snippet that tests the same thing?

Assembly computer science Programming Projects Testing

Issues in writing a VM – Part 1

Arkon and I decided to write a VM for vial. First though, a short explanation on what is vial:
vial is a project aimed at writing a general disassembler that outputs expression trees instead of text. On top of vial, we intend to write various code-analysis tools. The expression trees in the output should be an accurate description of the all of the code’s actions.
(note: the x86 disassembler behind vial is Arkon’s diStorm.)

So why do we need a VM? Apart from it being ‘nice and all’, it is critical for testing.

Some time ago, I described writing a VM to test a compiler I wrote as university homework. It is a similar issue here.
The disassembler is written according to the x86 specification. If we just check its output against this specification, we are not doing much to verify the code’s correctness. This is evident when you try to implement such a testing module – you end up writing another disassembler, and testing it against the original one. There has to be a different test method, one that does not directly rely on the specification.

Enter the VM. If you write a program, you can disassemble it, and then try to execute the disassembly. If it yields the same output as the original program – your test passed.
This is a good testing method, because it can be easily automated, reach good code coverage, and it tests against known values.
Consider the following illustration:

Testing Process

We are testing here a complete process on the left hand, against a known valid value, the original program’s output, on the right hand. All of the boxes on the left hand are tested along the way. Of course, one test may miss. For example, both the VM and the disassembler may generate wrong output for register overflows. We can try to cover as many such cases as possible by writing good tests for this testing framework. In this case, good tests are either c programs, or binary programs. This is essentially what I was doing when I manually fuzzed my own compiler.

Once the VM is finished, we can start writing various optimizations for the disassembler’s generated output. We can test these optimizations by checking the VM’s output on the optimized code against the output on the original code. This makes the VM a critical milestone on the road ahead.

Compilation computer science Programming Projects Python

Manually fuzzing my own compiler

As I mentioned before, I had to write a compiler for simplified CPL. An obvious requirement was that the compiler generate correct code. A less obvious requirement, but important none-the-less, was that after a syntax error, the compiler will keep parsing the source program.

Now, the default behavior of a parser generated by Bison for a synatx error is to return from the parsing function, yyparse. You may of-course call yyparse again, but this would be mostly meaningless – you lost all your context. A simple example would be a language that has to have ‘program’ as the first word. Once you are past that token, you will not be able to parse the source-program again, because your first production (which might look like this):

program: TOK_PROGRAM declarations statements TOK_END

won’t parse.

This is solved in Bison by adding error productions. For example:

expression: '(' error ')'

This production means that an error encountered within parenthesis may be considered a valid expression for the purposes of parsing. A proper semantic action for that error (the code that runs when the production is parsed) will add an error message to some error list, and maybe do some general book-keeping.

So where does the fuzzing come in?
Well, my compiler was mostly working, but it still had no error recovery. That means that any syntax error would cause it to exit, with just that one error. Consider your favorite compiler, (such as gcc), exiting on the first missing semicolon. This is just not acceptable. So I added my basic error recovery and was pretty satisfied.
Then, I had to test the newly written error handling. So I wrote a CPL program in which my goal was to try and kill my own compiler. Thats a fun way to test your code. This also happens to be a ‘fuzzing mindset’. I actually managed to find some holes I didn’t think about, and closed them. Of course, these were not security holes, just ‘compilation holes’.
Here is an excerpt from one of the programs I wrote:
for (x=1; x<100; x=x+1; { x = x; } else () { for (x=1; x<100; x=x+1;) { x = -1; } } [/c] It goes on like this for a bit, so I'll save you the trouble. Later I did some deep testing for the code generation. I wrote a test-case for every possible operator in the language (there aren't that many) and for each type (real and int. Did I mention it was simplified cpl?). Since the language doesn’t support strings, each of these test cases printed 1 for success and 0 for failure. I ran the compiled output with my VM script and then had a simple Python script collect all the results and return failure if any of them failed. I later also tested control flow structures using the same method.
I had all of these tests available in my makefile, and after each change all I had to do was ‘make test’. Which I did, often.

Looking back after finishing the assignment, it occurred to me that I could have also written an actual ‘fuzzer’ for source programs. However, in the context of such a university assignment this seems like overkill. Had I been writing a real compiler, there’s a very good chance I’d have written a ‘source fuzzer’.

All in all, after I got over the nastiness which is Bison, it was pretty fun. It has been a long time since I wrote C or C++, and it’s good to clean some of the rust away from time to time.

Personal Programming Python

Testing, 1 2 3, Testing

Finally, I’m after the test in complex functions. Not that it went as well as I wanted, but hey, you can’t have it all. Maybe gonna try again sometime later. In the meantime, I got this “today is the first day of the rest of your life” feeling. After finishing with my previous work, and the previous semester, I’m going to move to Haifa (with my girlfriend) in less than a month, start the next (and last) semester soon, and also start real work on some real projects (one of those being diStorm). Here’s to starting and ending things.

On the flipside, after finishing the test, I had some testing issues to work out. See – let’s say you got a piece of code you change. Well, the obvious thing to do is make sure it works. It is also good to find all the places that reference that piece of code, and make sure these are updated, and test those as well. It is even better if you also have some unit-testing code available (built using the easy-to-use unittest module). It’s even better if your test code actually checks the relevant pieces of code.

But alas, it can all fail due to human stupidity. In that particular instance, mine. After all this work, not running the damn unit-test code is just plain stupid. So, I hope I’ve learned my lesson. Again. Never commit untested code.

And while we are on the subject of testing, check out, over at Ned Batchelder’s place (I also happen to read and like his blog). Coverage is an excellent tool to improve your test code. Just run your test code with, likewise: -x

and then run: -r

and get some nice looking, informative results:

Name                                        Stmts   Exec  Cover
exptree                                        65     33    50%
exputils                                       85     48    56%
template_gen                                  100     67    67%
test_exputils                                  41     41   100%

Excellent. You can also get an annotated version of the source, telling you which line was run, and which wasn’t. It just doesn’t get anymore useful then that. So happy testing. I hope you fare better then me.