Design Programming Python

Issues in writing a VM – Part 3 – State and Memory

When implementing the VM, I had to keep track of state. The state of the VM includes the registers, virtual variables and memory. Fortunately, keeping track of state information is pretty easy. Basically, it amounts to having a dict, where the keys are registers or variables, and the values are, well, their values.

Holding the state of registers is a bit involved by the fact that registers may overlap, as I mentioned in a previous article. To handle this, the class keeping track of the state information, upon seeing a request to change the value of a register, propagates this change to its parent and child registers.

Holding memory might be different though. At a first glance, it would seem that memory should be kept in some buffer. However, it is much easier to keep the memory in a python dict as well, and treat that dict as if it was a sparse array. This implementation allows programs to write at address 10, and at address 10000, without requiring the VM to keep track of all the addresses in between. (Of course, we are not going to implement paging just now :)

I really liked this idea of treating dicts as sparse lists. In fact, it could work even better with some tree data structure instead of a hash table. With a tree data structure your keys would be sorted, and you could do slices.
I went ahead and tried using tree data structures for this, but it seems to be more trouble than it’s worth, so I think that unless it becomes a real timing issue, I’ll let it go.

I did have some fun wrapping the memory in some class to abstract away the data structure, and to give it a few more capabilities:

In [2]: import vm
In [3]: m = vm.VMMemory()
In [4]: m.set(0, "hello world!\r\n")
In [5]: m.set(51, "foobar, foobar\x00")
In [6]: m.set(77, "a"*30)
In [7]: print m
0000: 68656c6c6f20776f726c64210d0a????  hello world!..??
0033: 666f6f6261722c20666f6f62617200??  foobar, foobar.?
004d: 61616161616161616161616161616161  aaaaaaaaaaaaaaaa
005d: 6161616161616161616161616161????  aaaaaaaaaaaaaa??
Algorithms Assembly decompilation Programming

First Code Transformation: Removing Flag Computations

Short introduction to code transformations

If you intend to write a decompiler, you’ll find yourself writing code transformations.
In our context, code transformations are operations that take as input an expression tree, and return an equivalent but different expression tree.

An equivalent expression tree is one that does the same thing as the original. More formally, it’s an expression tree that when run, will reach the same final state as the original expression tree.

Here’s a simple example, converting a while loop to a do-while loop:


while cond:


if cond:
    while cond

Code transformations are a very tricky business. You may find yourself changing the meaning of your expression trees without noticing. That’s one of the reasons we wrote the VM: we wanted to make sure that our code transformations will be correct.

Generally, you’ll want to transform your code to a somewhat ‘better’ version. ‘Better’ in what sense?
Well, the code may be:

  • More readable. A good purpose, as we are writing a decompiler for human beings. Unfortunately, this goal is not well defined. ‘More readable’ will have to be defined by heuristics.
  • Shorter/smaller. An easier goal to define, smaller code has the following advantages:
    • It might be faster to execute.
    • It might faster to process by other algorithms – after all, it is shorter.
    • It might be more readable.
  • Simpler, in the sense that it uses less types of operations. For example, take transforming a while to a do-while (shown above). If we wanted to analyze loops, it will allow us to write code to handle just do-while loops, instead of handling both cases.

There are many more possible goals for code transformations. The bottom line is that code transformations may do any of the following:
* Make the life of the human reader easier.
* Make the life of the decompiler programmer easier.
* Make the life of the decompiler easier.

An important aspect of code transformations is when to do them. Here is a good rule of thumb:

If it makes it harder for the programmer or the decompiler – it should be done as late as possible.
Otherwise – it should be done as early as possible.

Removing Flag Computations

This is the first code transformation we intend to write. Consider the following case:
sub eax, ebx
sub ebx, ecx

The equivalent expression tree for these two instructions looks something like this:

eax = eax - ebx
zf = eax == 0
ebx = ebx - ecx
zf = ebx == 0

Here I listed only the changes to zf and the actual registers. All the other flags (OF,CF,…) are changed as well. In a sequence of such instructions, all the modifications to the flags in the middle are actually dead code, and may be removed. This transformation will make the code shorter, faster to execute on our VM and faster to analyze by later stages.

This is a special case of dead code removal. This raises a valid question: why not let some general dead code removal transformation handle the flags?

There are a few reasons:

  • A general dead code removal might be more complex to write. Just writing exactly the same algorithm for general purpose registers will be more complicated, because of overlapping registers.
  • This transformation may remove up to 80% of many of the disassembled instructions. Flag calculations take up a lot nodes in the expression trees!
  • This transformation may be run as soon as the disassembly stage itself, while a general dead code removal transformation will not. This is so because of some assumptions we are allowed to make on the way flags behave. For example, an instruction template (“%0=%0-%1; ZF=%0==0” for sub) might have its parameters changed, but the flags affected stay the same.

The implementation of this transformation is not yet complete. When it is, I’ll talk a bit about the mechanics behind it. It is much more complicated than it seems!

Algorithms Assembly computer science Programming Projects Python

Issues in Writing a VM – Part 2

The Incredible Machine

Writing a VM capable of executing expression trees is different from writing a VM for executing assembly instructions. Here I’ll cover several issues stemming from this difference.

The first group of issues involve generality. Supporting a specific instruction set is a straightforward job, even if a hard one. Vial has to support multiple platforms and that’s a little tricky. These are just a few of the problems:

  1. Different registers for each instruction set. This one is easy enough to solve. Just have a node in the expression tree with some value to indicate the register.
  2. Register overlaps. A change to one register implies a change to its parents and its children. Consider RAX->EAX->AX->AH,AL. Changing EAX will affect all of the registers in the hierarchy.
    To handle this, we wrote a CPU class to keep all the info about the platform, including register overlaps.
  3. Delayed branches. Some platforms have branch delay slots. This means that after any branch instruction, right before the branch is taken, instructions in the delayed slots are executed anyway. For instance, SPARC has three delay slots, while MIPS has just one. This isn’t an easy issue to solve, and for now we didn’t tackle it. We’ve got a few ideas though.

To make sure that our implementation is generic enough, we decided to write a skeleton disassembler implementation for MIPS as well.

The second group of issues involve the nature of expression trees versus instructions:

  1. Stepping over statements or instructions? Each expression tree for an instruction usually holds more than one statement. For example, dec eax changes eax as well as the zero flag. Since some instructions like rep stosd may contain a long loop, being able to step over statements instead of expressions is preferable.

    The problem is, executing expression trees is done with a DFS-like walk. If implemented with recursion it makes pausing the walk for each step a bit complicated. However, recursion is the clearest way to write the execution code, and I’d rather not give it up.

    My solution was to use Python generators. Each time a statement was executed, the function would yield, thus returning control to the calling function, while keeping its state intact.

  2. Instruction Pointer changes. The lower-level disassembler returns expression trees for each instruction. However, it does not return the jump to the next instruction as well. While this is the required behavior, it means that the VM should change the instruction pointer after executing each instruction.

    This is easier said than done: What should we do after jumps? Several possible solutions emerged.

    The first was to append an IP change to each instruction’s expression. Those changes will have to be appended only where needed.
    A second solution was to check if the IP was changed, and if it was not, change it. This solution however will not support instructions that jump to themselves.
    The last and our preferred solution was to check if the IP was touched. This solution is simple and very straightforward.

There are many more issues I didn’t write about, for example, self modifying code. I’ll leave those issues for future articles.

Assembly computer science Programming Projects Testing

Issues in writing a VM – Part 1

Arkon and I decided to write a VM for vial. First though, a short explanation on what is vial:
vial is a project aimed at writing a general disassembler that outputs expression trees instead of text. On top of vial, we intend to write various code-analysis tools. The expression trees in the output should be an accurate description of the all of the code’s actions.
(note: the x86 disassembler behind vial is Arkon’s diStorm.)

So why do we need a VM? Apart from it being ‘nice and all’, it is critical for testing.

Some time ago, I described writing a VM to test a compiler I wrote as university homework. It is a similar issue here.
The disassembler is written according to the x86 specification. If we just check its output against this specification, we are not doing much to verify the code’s correctness. This is evident when you try to implement such a testing module – you end up writing another disassembler, and testing it against the original one. There has to be a different test method, one that does not directly rely on the specification.

Enter the VM. If you write a program, you can disassemble it, and then try to execute the disassembly. If it yields the same output as the original program – your test passed.
This is a good testing method, because it can be easily automated, reach good code coverage, and it tests against known values.
Consider the following illustration:

Testing Process

We are testing here a complete process on the left hand, against a known valid value, the original program’s output, on the right hand. All of the boxes on the left hand are tested along the way. Of course, one test may miss. For example, both the VM and the disassembler may generate wrong output for register overflows. We can try to cover as many such cases as possible by writing good tests for this testing framework. In this case, good tests are either c programs, or binary programs. This is essentially what I was doing when I manually fuzzed my own compiler.

Once the VM is finished, we can start writing various optimizations for the disassembler’s generated output. We can test these optimizations by checking the VM’s output on the optimized code against the output on the original code. This makes the VM a critical milestone on the road ahead.

Linux Programming Projects Python

Distorm3 progress and SVN

Finally, Distorm3 is progressing. With heaps of work done by Gil, and some more by me, the project will soon be on its feet. One thing that really gave us a feeling of progress, is setting up Subversion. We thought of setting our own, but finally decided on using Assembla, after a good friend recommended it. There are other options, but Assembla seems really good, although it does have its drawbacks. (Your code goes in clear-text for one).

When I was using windows I worked with TortoiseSVN, but now on my Ubuntu, I’m using RapidSVN, with meld as my diff tool. Although Rapid is cool, I liked Tortoise better. It was faster, much more intuitive, and I liked the commit window better. It gave me a choice on what to commit, and I could (from the commit window) run a diff on each file I changed, and using the diff write a short comment on my changes in the commit comment. While RapidSVN obviously also allows for commit comments, I have to do all the work beforehand. It’s a bit more cumbersome.

Another thing – I’m working on unit-testing using the excellent unittest module. Using along with the testing makes my code so much better, and me so much happier. Since we have a little bit of c-code generation, one of the (a little bit hackish) tests I wrote was running gcc on some sample output, and making sure there were no errors or warnings. Fun. One of my next todos is compiling a few small executables, and making sure that they run without errors, all from within the unit-testing of the code generation module.

All in all, it’s good to finally be working organized.