When is the Book Coming Out?

”So Erik, When is the book coming out?”

Last week I attended and spoke at Erlang Factory San Francisco.

The number one question that I got, and I got it from basically everyone I have ever met before, was ”When is the book coming out?”.

It was really nice to see that there was such an interest for the book.

The sad part is that the answer, at least in part, is: ”I don’t know”.

It is unfortunately late already. I was planning for a release in early 2014, and clearly that didn’t happen. I have only written about one third of the content. None of the chapters I have written has been edited, proofread, rewritten or even really finished. 

There is a huge amount of work left.

In the summer and fall of 2013 I again ended up as a manager at Klarna with the responsibility to get ”Kred” or ”Klarna Online”, our (well Klarna’s) OLTP system, to survive the Christmas load. The system was in a slightly unstable state at the beginning, but the Kred Core team managed to get it stable, responsive and really performing well straight through the Christmas rush. Unfortunately this took all of my time and energy, and the book writing got put on hold.

Fortunately, this led me to finally take the step and resign

The plan now is to take a week of vacation here in California, and then I will work full time on the book. If all goes well I should be able to finish most of the writing by early summer. Then perhaps I can get reviewer comments by the end of the summer and have a ”final” version ready by September or so.

From now on I will also try to keep you posted on the progress here on the blog.

I'm writing a book, I got the page numbers done.

The BEAM Dispatcher

The instruction decoder in Beam is implemented with a technique called directly threaded code. In this context the word threaded has nothing to do with OS threads, concurrency or parallelism. It is the execution path which is threaded through the virtual machine itself.

Imagine a simple language of arithmetic expressions, where we can write expressions like "8 + 17 * 2.":


Statement :=  Expression '.'
Expression := Number
           |  Expression Op Expression
Op         := '+' | '*'
Number     := [0..9]
           |  [0..9] Number

We can use the Erlang compiler to generate code for a simple stack machine which can evaluate these expressions.
(See full code on github)


compile(String) ->
    [ParseTree] = element(2,
			  erl_parse:parse_exprs(
			    element(2,
				    erl_scan:string(String)))),
    generate_code(ParseTree).

generate_code({op, _Line, '+', Arg1, Arg2}) -> 
    generate_code(Arg1) ++ generate_code(Arg2) ++ [add];
generate_code({op, _Line, '*', Arg1, Arg2}) -> 
    generate_code(Arg1) ++ generate_code(Arg2) ++ [multiply];
generate_code({integer, _Line, I}) -> [push, I].

And now we can write an simplistic virtual stack machine:


interpret(Code) -> interpret(Code, []).

interpret([push, I |Rest], Stack)              -> interpret(Rest, [I|Stack]);
interpret([add     |Rest], [Arg2, Arg1|Stack]) -> interpret(Rest, [Arg1+Arg2|Stack]);
interpret([multiply|Rest], [Arg2, Arg1|Stack]) -> interpret(Rest, [Arg1*Arg2|Stack]);
interpret([],              [Res|_])            -> Res.

And a quick test run gives us the answer:


1> stack_machine:interpret(stack_machine:compile("8 + 17 * 2.")).
42

Great, you have built your first virtual machine! Handling subtraction, division and the rest of the Erlang language is left as an exercise for the reader.

If we take a look at our naive stack machine for arithmetic expressions we see that we use Erlang atoms and pattern matching to decode which instruction to execute. This is a very heavy machinery to just decode machine instructions. In a real machine we would code each instruction as a “machine word” integer.

We can rewrite our stack machine to be a byte code machine implemented in C. First we rewrite the compiler so that it produces byte codes. This is pretty straight forward, just replace each instruction encoded as an atom with a byte representing the instruction. To be able to handle integers larger than 255 we encode integers with a size byte followed by the integer encoded in bytes.

(See full code on github)


compile(Expression, FileName) ->
  [ParseTree] = element(2, erl_parse:parse_exprs(
                  element(2, erl_scan:string(Expression)))),
  file:write_file(FileName, generate_code(ParseTree) ++ [stop()]).

generate_code({op, _Line, '+', Arg1, Arg2}) ->
  generate_code(Arg1) ++ generate_code(Arg2) ++ [add()];
generate_code({op, _Line, '*', Arg1, Arg2}) ->
  generate_code(Arg1) ++ generate_code(Arg2) ++ [multiply()];
generate_code({integer, _Line, I}) -> [push(), integer(I)].

stop()     -> 0.
add()      -> 1.
multiply() -> 2.
push()     -> 3.
integer(I) ->
  L = binary_to_list(binary:encode_unsigned(I)), 
  [length(L) | L].

Now lets write a simple virtual machine in C.


#define STOP 0
#define ADD 1
#define MUL 2
#define PUSH 3

#define pop() (stack[--sp])
#define push(X) (stack[sp++] = X)

int run(char *code) {
  int stack[1000];
  int sp = 0, size = 0, val = 0;
  char *ip = code;

  while (*ip != STOP) {
    switch (*ip++) {
      case ADD: push(pop() + pop()); break;
      case MUL: push(pop() * pop()); break;
      case PUSH:
        size = *ip++;
        val = 0;
        while (size--) { val = val * 256 + *ip++; }
        push(val);
        break;
    }
  }
  return pop();
}

You see, a virtual machine written in C does not need to be very complicated. This machine is just a loop checking the byte code at each instruction by looking at the value
pointed to by the instruction pointer (ip). For each byte code instruction it will switch on the instruction byte code and jump to the case which executes the instruction. This requires a decoding of the instruction and then a jump to the correct code. If we look at the assembly for vsm.c (gcc -S vsm.c) we see the inner loop of the decoder:


L11:
    movl   -16(%ebp), %eax
    movzbl    (%eax), %eax
    movsbl       %al, %eax
    addl          $1, -16(%ebp)
    cmpl          $2, %eax
    je            L7
    cmpl          $3, %eax
    je            L8
    cmpl          $1, %eax
    jne           L5

It has to compare the byte code with each instruction code and then do a conditional jump. In a real machine with many instructions this can become quite expensive.

A better solution would be to have a table with the address of the code then we could just use an index into the table to load the address and jump without the need to do a compare. This technique is sometimes called token threaded code. Taking this a step further we can actually store the address of the function implementing the instruction in the code memory. This is called subroutine threaded code.

This approach will make the decoding simpler at runtime, but it makes the whole VM more complicated by requiring a loader. The loader replaces the byte code instructions with addresses to functions implementing the instructions.

A loader might look like:

(See full code on github)


typedef void (*instructionp_t)(void);

instructionp_t *read_file(char *name) {
  FILE *file;
  instructionp_t *code;
  instructionp_t *cp;
  long size;
  char ch;
  unsigned int val;

  file = fopen(name, "r");

  if(file == NULL) exit(1);

  fseek(file, 0L, SEEK_END);
  size = ftell(file);
  code = calloc(size, sizeof(instructionp_t));
  if(code == NULL) exit(1);
  cp = code;

  fseek(file, 0L, SEEK_SET);
  while ( ( ch = fgetc(file) ) != EOF )
  {
    switch (ch) {
      case ADD: *cp++ = &add; break;
      case MUL: *cp++ = &mul; break;
      case PUSH:
        *cp++ = &pushi;
        ch = fgetc(file);
        val = 0;
        while (ch--) { val = val * 256 + fgetc(file); }
        *cp++ = (instructionp_t) val;
        break;
    }
  }
  *cp = &stop;

  fclose(file);
  return code;
}

As we can see, we do more work at load time here, including the decoding of integers larger than 255. (Yes, I know, the code is not safe for very large integers.)

The decode and dispatch loop of the VM becomes quite simple though:


int run() {
  sp = 0;
  running = 1;

  while (running) (*ip++)();

  return pop();
}

Then we just need to implement the instructions:


void add() { int x,y; x = pop(); y = pop(); push(x + y); }
void mul() { int x,y; x = pop(); y = pop(); push(x * y); }
void pushi(){ int x; x = (int)*ip++; push(x); }
void stop() { running = 0; }

In Beam this concept is taken one step further, and beam uses directly threaded code (sometimes called only thread code). In directly threaded code the call and return sequence is replaced by direct jumps to the implementation of the next instruction. In order
to implement this in C, Beam uses the GCC extension “labels as values”.

We can see how the Beam dispatcher is implemented by looking at the add instruction in beam_emu.c. The STORE_ARITH_RESULT macro actually hides the dispatch function which looks something like: I += 4; Goto(*I);.


#define OpCase(OpCode) lb_##OpCode
#define Goto(Rel) goto *(Rel)

...

OpCase(i_plus_jId):
{
  Eterm result;

  if (is_both_small(tmp_arg1, tmp_arg2)) {
    Sint i = signed_val(tmp_arg1) + signed_val(tmp_arg2);
    ASSERT(MY_IS_SSMALL(i) == IS_SSMALL(i));
    if (MY_IS_SSMALL(i)) {
      result = make_small(i);
      STORE_ARITH_RESULT(result);
    }

  }
  arith_func = ARITH_FUNC(mixed_plus);
  goto do_big_arith2;
}

I will talk about the Beam virtual machine at EUC, covering how to look at compiled Beam code, how preemption is handled and some of the Beam instructions are implemented.

Hope to see you there.

The Erlang Tagging Scheme

Erlang is dynamically typed. That is, types will be checked at runtime and if a type error occurs an exception is thrown. The compiler does not check the types at compile time, unlike in a statically typed language like C or Java where you can get a type error during compilation.

These aspects of the Erlang type system, strongly statically typed with an order on the types puts some constraints on the implementation of the language. In order to be able to check and compare types at runtime each Erlang term has to carry its type with it. This is solved by tagging the terms.

In the memory representation of an Erlang term a few bits are reserved for a type tag. For performance reasons the terms are divided into immediate and boxed terms. An immediate term can fit into a machine word, that is, in a register or on a stack slot. A boxed term consists of two parts: a tagged pointer and a number of words stored on the process heap. The boxes stored on the heap have a header and a body, unless it is a list. Currently ERTS uses a staged tag scheme, the history and reasoning behind the this scheme is explained in a technical report from the HiPE group. (See http://www.it.uu.se/research/publications/reports/2000-029/)

The tagging scheme is implemented in erl_term.h. The basic idea is to use the least significant bits for tags. Since most modern CPU architectures aligns 32- and 64-bit words, there are at least two bits that are “unused” for pointers. These bits can be used as tags instead. Unfortunately those two bits are not enough for all the types in Erlang, more bits are therefore used as needed.

The current (R15-R16) tagging scheme looks like this:

 aaaaaaaaaaaaaaaaaaaaaaaaaatttt00 HEADER (see below)
 pppppppppppppppppppppppppppppp01 CONS
 pppppppppppppppppppppppppppppp10 BOXED (pointer to header)
 iiiiiiiiiiiiiiiiiiiiiiiiiiii0011 PID
 iiiiiiiiiiiiiiiiiiiiiiiiiiii0111 PORT
 iiiiiiiiiiiiiiiiiiiiiiiiii001011 ATOM
 iiiiiiiiiiiiiiiiiiiiiiiiii011011 CATCH
 iiiiiiiiiiiiiiiiiiiiiiiiii111011 NIL (i always zero...)
 iiiiiiiiiiiiiiiiiiiiiiiiiiii1111 SMALL_INT

 aaaaaaaaaaaaaaaaaaaaaaaaaa000000 ARITYVAL (Tuple)
 vvvvvvvvvvvvvvvvvvvvvvvvvv000100 BINARY_AGGREGATE       |
 vvvvvvvvvvvvvvvvvvvvvvvvvv001x00 BIGNUM with sign bit   |
 vvvvvvvvvvvvvvvvvvvvvvvvvv010000 REF                    |
 vvvvvvvvvvvvvvvvvvvvvvvvvv010100 FUN                    | THINGS
 vvvvvvvvvvvvvvvvvvvvvvvvvv011000 FLONUM                 |
 vvvvvvvvvvvvvvvvvvvvvvvvvv011100 EXPOR                  |
 vvvvvvvvvvvvvvvvvvvvvvvvvv100000 REFC_BINARY |          |
 vvvvvvvvvvvvvvvvvvvvvvvvvv100100 HEAP_BINARY | BINARIES |
 vvvvvvvvvvvvvvvvvvvvvvvvvv101000 SUB_BINARY  |          |
 vvvvvvvvvvvvvvvvvvvvvvvvvv101100 Not used
 vvvvvvvvvvvvvvvvvvvvvvvvvv110000 EXTERNAL_PID  |        |
 vvvvvvvvvvvvvvvvvvvvvvvvvv110100 EXTERNAL_PORT | EXT    |
 vvvvvvvvvvvvvvvvvvvvvvvvvv111000 EXTERNAL_REF  |        | 
 vvvvvvvvvvvvvvvvvvvvvvvvvv111100 Not used