Categories
Projects Statistics

The Art and Science of Pulling Numbers Out of Your Sleeve

About a year or so ago, I was reading R. V. Jones’ excellent book ‘Most Secret War’. One of the stories I remembered and told my colleagues about, was how Jones estimated the rocket production capabilities of the Germans. He did so after looking at an aerial photograph of a rocket fuel shed. One of my colleague then told me of a course she took at the Hebrew University of Jerusalem that teaches how to make such estimates. I let it go at the time, and just remembered that there is such a course.

A few days ago, I met up with this colleague, and I was reminded of this course. You see, right now I’m collecting data for a project I’m doing with a friend, and I needed estimation know-how. So I asked her the name of the course, and she gave it to me. Two Google search later and I had the English name of the course, “Order of Magnitude in Physics Problems”, and a textbook to look at. I just finished reading the introduction, and I know that I’m going to read the rest of it too. Not just for my current project – but because it’s such a good skill to have.

The book opens with a problem:

We dedicate the first example to physicists who need employment outside of physics. […] How much money is there in a fully loaded Brinks armored car?

The book then goes to show how to answer such a question intelligently. I’m hooked.

Categories
Compilation computer science Programming Projects Python

Manually fuzzing my own compiler

As I mentioned before, I had to write a compiler for simplified CPL. An obvious requirement was that the compiler generate correct code. A less obvious requirement, but important none-the-less, was that after a syntax error, the compiler will keep parsing the source program.

Now, the default behavior of a parser generated by Bison for a synatx error is to return from the parsing function, yyparse. You may of-course call yyparse again, but this would be mostly meaningless – you lost all your context. A simple example would be a language that has to have ‘program’ as the first word. Once you are past that token, you will not be able to parse the source-program again, because your first production (which might look like this):

program: TOK_PROGRAM declarations statements TOK_END

won’t parse.

This is solved in Bison by adding error productions. For example:

expression: '(' error ')'

This production means that an error encountered within parenthesis may be considered a valid expression for the purposes of parsing. A proper semantic action for that error (the code that runs when the production is parsed) will add an error message to some error list, and maybe do some general book-keeping.

So where does the fuzzing come in?
Well, my compiler was mostly working, but it still had no error recovery. That means that any syntax error would cause it to exit, with just that one error. Consider your favorite compiler, (such as gcc), exiting on the first missing semicolon. This is just not acceptable. So I added my basic error recovery and was pretty satisfied.
Then, I had to test the newly written error handling. So I wrote a CPL program in which my goal was to try and kill my own compiler. Thats a fun way to test your code. This also happens to be a ‘fuzzing mindset’. I actually managed to find some holes I didn’t think about, and closed them. Of course, these were not security holes, just ‘compilation holes’.
Here is an excerpt from one of the programs I wrote:
[c]
x=x++++++++++++++100;
x=x=3;
}
for (x=1; x<100; x=x+1; { x = x; } else () { for (x=1; x<100; x=x+1;) { x = -1; } } [/c] It goes on like this for a bit, so I'll save you the trouble. Later I did some deep testing for the code generation. I wrote a test-case for every possible operator in the language (there aren't that many) and for each type (real and int. Did I mention it was simplified cpl?). Since the language doesn’t support strings, each of these test cases printed 1 for success and 0 for failure. I ran the compiled output with my VM script and then had a simple Python script collect all the results and return failure if any of them failed. I later also tested control flow structures using the same method.
I had all of these tests available in my makefile, and after each change all I had to do was ‘make test’. Which I did, often.

Looking back after finishing the assignment, it occurred to me that I could have also written an actual ‘fuzzer’ for source programs. However, in the context of such a university assignment this seems like overkill. Had I been writing a real compiler, there’s a very good chance I’d have written a ‘source fuzzer’.

All in all, after I got over the nastiness which is Bison, it was pretty fun. It has been a long time since I wrote C or C++, and it’s good to clean some of the rust away from time to time.

Categories
Linux Programming Projects Python

Distorm3 progress and SVN

Finally, Distorm3 is progressing. With heaps of work done by Gil, and some more by me, the project will soon be on its feet. One thing that really gave us a feeling of progress, is setting up Subversion. We thought of setting our own, but finally decided on using Assembla, after a good friend recommended it. There are other options, but Assembla seems really good, although it does have its drawbacks. (Your code goes in clear-text for one).

When I was using windows I worked with TortoiseSVN, but now on my Ubuntu, I’m using RapidSVN, with meld as my diff tool. Although Rapid is cool, I liked Tortoise better. It was faster, much more intuitive, and I liked the commit window better. It gave me a choice on what to commit, and I could (from the commit window) run a diff on each file I changed, and using the diff write a short comment on my changes in the commit comment. While RapidSVN obviously also allows for commit comments, I have to do all the work beforehand. It’s a bit more cumbersome.

Another thing – I’m working on unit-testing using the excellent unittest module. Using coverage.py along with the testing makes my code so much better, and me so much happier. Since we have a little bit of c-code generation, one of the (a little bit hackish) tests I wrote was running gcc on some sample output, and making sure there were no errors or warnings. Fun. One of my next todos is compiling a few small executables, and making sure that they run without errors, all from within the unit-testing of the code generation module.

All in all, it’s good to finally be working organized.