Erlang is dynamically typed. That is, types will be checked at runtime and if a type error occurs an exception is thrown. The compiler does not check the types at compile time, unlike in a statically typed language like C or Java where you can get a type error during compilation.
These aspects of the Erlang type system, strongly statically typed with an order on the types puts some constraints on the implementation of the language. In order to be able to check and compare types at runtime each Erlang term has to carry its type with it. This is solved by tagging the terms.
In the memory representation of an Erlang term a few bits are reserved for a type tag. For performance reasons the terms are divided into immediate and boxed terms. An immediate term can fit into a machine word, that is, in a register or on a stack slot. A boxed term consists of two parts: a tagged pointer and a number of words stored on the process heap. The boxes stored on the heap have a header and a body, unless it is a list. Currently ERTS uses a staged tag scheme, the history and reasoning behind the this scheme is explained in a technical report from the HiPE group. (See http://www.it.uu.se/research/publications/reports/2000-029/)
The tagging scheme is implemented in erl_term.h. The basic idea is to use the least significant bits for tags. Since most modern CPU architectures aligns 32- and 64-bit words, there are at least two bits that are “unused” for pointers. These bits can be used as tags instead. Unfortunately those two bits are not enough for all the types in Erlang, more bits are therefore used as needed.
The current (R15-R16) tagging scheme looks like this:
aaaaaaaaaaaaaaaaaaaaaaaaaatttt00 HEADER (see below) pppppppppppppppppppppppppppppp01 CONS pppppppppppppppppppppppppppppp10 BOXED (pointer to header) iiiiiiiiiiiiiiiiiiiiiiiiiiii0011 PID iiiiiiiiiiiiiiiiiiiiiiiiiiii0111 PORT iiiiiiiiiiiiiiiiiiiiiiiiii001011 ATOM iiiiiiiiiiiiiiiiiiiiiiiiii011011 CATCH iiiiiiiiiiiiiiiiiiiiiiiiii111011 NIL (i always zero...) iiiiiiiiiiiiiiiiiiiiiiiiiiii1111 SMALL_INT aaaaaaaaaaaaaaaaaaaaaaaaaa000000 ARITYVAL (Tuple) vvvvvvvvvvvvvvvvvvvvvvvvvv000100 BINARY_AGGREGATE | vvvvvvvvvvvvvvvvvvvvvvvvvv001x00 BIGNUM with sign bit | vvvvvvvvvvvvvvvvvvvvvvvvvv010000 REF | vvvvvvvvvvvvvvvvvvvvvvvvvv010100 FUN | THINGS vvvvvvvvvvvvvvvvvvvvvvvvvv011000 FLONUM | vvvvvvvvvvvvvvvvvvvvvvvvvv011100 EXPOR | vvvvvvvvvvvvvvvvvvvvvvvvvv100000 REFC_BINARY | | vvvvvvvvvvvvvvvvvvvvvvvvvv100100 HEAP_BINARY | BINARIES | vvvvvvvvvvvvvvvvvvvvvvvvvv101000 SUB_BINARY | | vvvvvvvvvvvvvvvvvvvvvvvvvv101100 Not used vvvvvvvvvvvvvvvvvvvvvvvvvv110000 EXTERNAL_PID | | vvvvvvvvvvvvvvvvvvvvvvvvvv110100 EXTERNAL_PORT | EXT | vvvvvvvvvvvvvvvvvvvvvvvvvv111000 EXTERNAL_REF | | vvvvvvvvvvvvvvvvvvvvvvvvvv111100 Not used
There is a comment buried in erl_term.h (not sure of its exact location) which explains why ERTS doesn’t use 00 for the fixnum tag. Do you think it would be hard to change that?
Well, I do not think it would be that hard to change it but I don’t think the gain would be that big. (I think we did some measurements on this back in the days but I can’t find any numbers, so I guess one could try it and measure it again.)
We do want to store immediates on the stack, and on the stack (especially the native hipe stack) we really want 00 for cp (return addresses) so that we can use the address directly without untagging. Also, the tag used now means that in most cases we can do integer arithmetic by just untagging one of the arguments. I’ll try to go into the details in the book. (And maybe do some measurements.) Not promising anything yet though.