The BEAM Dispatcher

The instruction decoder in Beam is implemented with a technique called directly threaded code. In this context the word threaded has nothing to do with OS threads, concurrency or parallelism. It is the execution path which is threaded through the virtual machine itself.

Imagine a simple language of arithmetic expressions, where we can write expressions like "8 + 17 * 2.":


Statement :=  Expression '.'
Expression := Number
           |  Expression Op Expression
Op         := '+' | '*'
Number     := [0..9]
           |  [0..9] Number

We can use the Erlang compiler to generate code for a simple stack machine which can evaluate these expressions.
(See full code on github)


compile(String) ->
    [ParseTree] = element(2,
			  erl_parse:parse_exprs(
			    element(2,
				    erl_scan:string(String)))),
    generate_code(ParseTree).

generate_code({op, _Line, '+', Arg1, Arg2}) -> 
    generate_code(Arg1) ++ generate_code(Arg2) ++ [add];
generate_code({op, _Line, '*', Arg1, Arg2}) -> 
    generate_code(Arg1) ++ generate_code(Arg2) ++ [multiply];
generate_code({integer, _Line, I}) -> [push, I].

And now we can write an simplistic virtual stack machine:


interpret(Code) -> interpret(Code, []).

interpret([push, I |Rest], Stack)              -> interpret(Rest, [I|Stack]);
interpret([add     |Rest], [Arg2, Arg1|Stack]) -> interpret(Rest, [Arg1+Arg2|Stack]);
interpret([multiply|Rest], [Arg2, Arg1|Stack]) -> interpret(Rest, [Arg1*Arg2|Stack]);
interpret([],              [Res|_])            -> Res.

And a quick test run gives us the answer:


1> stack_machine:interpret(stack_machine:compile("8 + 17 * 2.")).
42

Great, you have built your first virtual machine! Handling subtraction, division and the rest of the Erlang language is left as an exercise for the reader.

If we take a look at our naive stack machine for arithmetic expressions we see that we use Erlang atoms and pattern matching to decode which instruction to execute. This is a very heavy machinery to just decode machine instructions. In a real machine we would code each instruction as a “machine word” integer.

We can rewrite our stack machine to be a byte code machine implemented in C. First we rewrite the compiler so that it produces byte codes. This is pretty straight forward, just replace each instruction encoded as an atom with a byte representing the instruction. To be able to handle integers larger than 255 we encode integers with a size byte followed by the integer encoded in bytes.

(See full code on github)


compile(Expression, FileName) ->
  [ParseTree] = element(2, erl_parse:parse_exprs(
                  element(2, erl_scan:string(Expression)))),
  file:write_file(FileName, generate_code(ParseTree) ++ [stop()]).

generate_code({op, _Line, '+', Arg1, Arg2}) ->
  generate_code(Arg1) ++ generate_code(Arg2) ++ [add()];
generate_code({op, _Line, '*', Arg1, Arg2}) ->
  generate_code(Arg1) ++ generate_code(Arg2) ++ [multiply()];
generate_code({integer, _Line, I}) -> [push(), integer(I)].

stop()     -> 0.
add()      -> 1.
multiply() -> 2.
push()     -> 3.
integer(I) ->
  L = binary_to_list(binary:encode_unsigned(I)), 
  [length(L) | L].

Now lets write a simple virtual machine in C.


#define STOP 0
#define ADD 1
#define MUL 2
#define PUSH 3

#define pop() (stack[--sp])
#define push(X) (stack[sp++] = X)

int run(char *code) {
  int stack[1000];
  int sp = 0, size = 0, val = 0;
  char *ip = code;

  while (*ip != STOP) {
    switch (*ip++) {
      case ADD: push(pop() + pop()); break;
      case MUL: push(pop() * pop()); break;
      case PUSH:
        size = *ip++;
        val = 0;
        while (size--) { val = val * 256 + *ip++; }
        push(val);
        break;
    }
  }
  return pop();
}

You see, a virtual machine written in C does not need to be very complicated. This machine is just a loop checking the byte code at each instruction by looking at the value
pointed to by the instruction pointer (ip). For each byte code instruction it will switch on the instruction byte code and jump to the case which executes the instruction. This requires a decoding of the instruction and then a jump to the correct code. If we look at the assembly for vsm.c (gcc -S vsm.c) we see the inner loop of the decoder:


L11:
    movl   -16(%ebp), %eax
    movzbl    (%eax), %eax
    movsbl       %al, %eax
    addl          $1, -16(%ebp)
    cmpl          $2, %eax
    je            L7
    cmpl          $3, %eax
    je            L8
    cmpl          $1, %eax
    jne           L5

It has to compare the byte code with each instruction code and then do a conditional jump. In a real machine with many instructions this can become quite expensive.

A better solution would be to have a table with the address of the code then we could just use an index into the table to load the address and jump without the need to do a compare. This technique is sometimes called token threaded code. Taking this a step further we can actually store the address of the function implementing the instruction in the code memory. This is called subroutine threaded code.

This approach will make the decoding simpler at runtime, but it makes the whole VM more complicated by requiring a loader. The loader replaces the byte code instructions with addresses to functions implementing the instructions.

A loader might look like:

(See full code on github)


typedef void (*instructionp_t)(void);

instructionp_t *read_file(char *name) {
  FILE *file;
  instructionp_t *code;
  instructionp_t *cp;
  long size;
  char ch;
  unsigned int val;

  file = fopen(name, "r");

  if(file == NULL) exit(1);

  fseek(file, 0L, SEEK_END);
  size = ftell(file);
  code = calloc(size, sizeof(instructionp_t));
  if(code == NULL) exit(1);
  cp = code;

  fseek(file, 0L, SEEK_SET);
  while ( ( ch = fgetc(file) ) != EOF )
  {
    switch (ch) {
      case ADD: *cp++ = &add; break;
      case MUL: *cp++ = &mul; break;
      case PUSH:
        *cp++ = &pushi;
        ch = fgetc(file);
        val = 0;
        while (ch--) { val = val * 256 + fgetc(file); }
        *cp++ = (instructionp_t) val;
        break;
    }
  }
  *cp = &stop;

  fclose(file);
  return code;
}

As we can see, we do more work at load time here, including the decoding of integers larger than 255. (Yes, I know, the code is not safe for very large integers.)

The decode and dispatch loop of the VM becomes quite simple though:


int run() {
  sp = 0;
  running = 1;

  while (running) (*ip++)();

  return pop();
}

Then we just need to implement the instructions:


void add() { int x,y; x = pop(); y = pop(); push(x + y); }
void mul() { int x,y; x = pop(); y = pop(); push(x * y); }
void pushi(){ int x; x = (int)*ip++; push(x); }
void stop() { running = 0; }

In Beam this concept is taken one step further, and beam uses directly threaded code (sometimes called only thread code). In directly threaded code the call and return sequence is replaced by direct jumps to the implementation of the next instruction. In order
to implement this in C, Beam uses the GCC extension “labels as values”.

We can see how the Beam dispatcher is implemented by looking at the add instruction in beam_emu.c. The STORE_ARITH_RESULT macro actually hides the dispatch function which looks something like: I += 4; Goto(*I);.


#define OpCase(OpCode) lb_##OpCode
#define Goto(Rel) goto *(Rel)

...

OpCase(i_plus_jId):
{
  Eterm result;

  if (is_both_small(tmp_arg1, tmp_arg2)) {
    Sint i = signed_val(tmp_arg1) + signed_val(tmp_arg2);
    ASSERT(MY_IS_SSMALL(i) == IS_SSMALL(i));
    if (MY_IS_SSMALL(i)) {
      result = make_small(i);
      STORE_ARITH_RESULT(result);
    }

  }
  arith_func = ARITH_FUNC(mixed_plus);
  goto do_big_arith2;
}

I will talk about the Beam virtual machine at EUC, covering how to look at compiled Beam code, how preemption is handled and some of the Beam instructions are implemented.

Hope to see you there.