Manually fuzzing my own compiler

As I mentioned before, I had to write a compiler for simplified CPL. An obvious requirement was that the compiler generate correct code. A less obvious requirement, but important none-the-less, was that after a syntax error, the compiler will keep parsing the source program.

Now, the default behavior of a parser generated by Bison for a synatx error is to return from the parsing function, yyparse. You may of-course call yyparse again, but this would be mostly meaningless – you lost all your context. A simple example would be a language that has to have ‘program’ as the first word. Once you are past that token, you will not be able to parse the source-program again, because your first production (which might look like this):

program: TOK_PROGRAM declarations statements TOK_END

won’t parse.

This is solved in Bison by adding error productions. For example:

expression: '(' error ')'

This production means that an error encountered within parenthesis may be considered a valid expression for the purposes of parsing. A proper semantic action for that error (the code that runs when the production is parsed) will add an error message to some error list, and maybe do some general book-keeping.

So where does the fuzzing come in?
Well, my compiler was mostly working, but it still had no error recovery. That means that any syntax error would cause it to exit, with just that one error. Consider your favorite compiler, (such as gcc), exiting on the first missing semicolon. This is just not acceptable. So I added my basic error recovery and was pretty satisfied.
Then, I had to test the newly written error handling. So I wrote a CPL program in which my goal was to try and kill my own compiler. Thats a fun way to test your code. This also happens to be a ‘fuzzing mindset’. I actually managed to find some holes I didn’t think about, and closed them. Of course, these were not security holes, just ‘compilation holes’.
Here is an excerpt from one of the programs I wrote:
[c]
x=x++++++++++++++100;
x=x=3;
}
for (x=1; x<100; x=x+1; {
x = x;
}
else () {
for (x=1; x<100; x=x+1;) {
x = -1;
}
}
[/c]

It goes on like this for a bit, so I'll save you the trouble. Later I did some deep testing for the code generation. I wrote a test-case for every possible operator in the language (there aren't that many) and for each type (real and int. Did I mention it was simplified cpl?). Since the language doesn’t support strings, each of these test cases printed 1 for success and 0 for failure. I ran the compiled output with my VM script and then had a simple Python script collect all the results and return failure if any of them failed. I later also tested control flow structures using the same method.
I had all of these tests available in my makefile, and after each change all I had to do was ‘make test’. Which I did, often.

Looking back after finishing the assignment, it occurred to me that I could have also written an actual ‘fuzzer’ for source programs. However, in the context of such a university assignment this seems like overkill. Had I been writing a real compiler, there’s a very good chance I’d have written a ‘source fuzzer’.

All in all, after I got over the nastiness which is Bison, it was pretty fun. It has been a long time since I wrote C or C++, and it’s good to clean some of the rust away from time to time.

This entry was posted in Compilation, computer science, Programming, Projects, Python and tagged , , , , . Bookmark the permalink.