Elixir : Internal data representation of heap binaries
Binaries in elixir are internally represented by four different types of implementations such as heap binary, refc binary, sub binary and match context . Heap binaries are containers present in the individual process heap, used to represent binaries that are of size, less or than equal to 64 bytes. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of heap binaries in elixir.
Boxed term
Heap binaries fall under the boxed term category, where the actual data is stored in the heap. The boxed term takes up a word and it contains the value 10
in its least two significant bits, denoting the primary tag. The rest of the bits contain a pointer to a header word in the heap. This boxed term can be present either in the stack defined as a local variable in a function, or it can be found in the heap when the heap binary is part of another data structure.
Header word
The header word in the heap, pointed to by the pointer in a boxed term, takes up a word and contains 00
as its least two significant bits, denoting the primary tag for the header word. The next 4 least significant bits contain the value 1001
, which is the secondary header tag for heap binaries. The rest of the bits in the header are used for representing arity which denotes the number of data words that contiguously follow the header word. The arity value for heap binaries range from 1 to 9 in a 64 bit system.
Size word
The first data word stored after the header word is the size word that denotes the number of bytes present in the binary, represented as a binary value. This could range from 0 to 64 for a heap binary, where 0 indicates an empty binary and 64 indicates that the binary has the maximum of 64 bytes of data. Since the byte size of the binary is precomputed and stored explicitly, the kernel guard function byte_size/1
is a constant time operation. Please note that unlike the byte_size/1
function, the String.length/1
is a linear time operation since the raw bytes have to be traversed and decoded into codepoints.
Data words
The data words followed by the size word represent the actual bytes present in the heap binary. If the heap binary is an empty binary with no data, there will not be any data word after the size word. The first byte of the binary will always be present in the highest address of the first word i.e. the least significant byte of the word.
"Hello-world"
0x100 0x101 0x102 0x103 0x104 0x105 0x106 1x107
word 1 - o w - o l l e H
0x108 0x109 0x110 0x111 0x112 0x113 0x114 0x115
word 2 - 0 0 0 0 0 d l r
Example bit representation
Let us take the string “hello world” whose internal binary representation encoded using utf8 is <<104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100>>. The byte size of the string is 11 which falls under the heap binary category. Let us represent the above string as a heap binary in a 64 bit system. The boxed term for the above binary takes up a word of 64 bits where the least two significant bits are denoted by primary tag 10
and the rest of the bits contain a pointer to a header word in the heap.
The header word in the heap, taking up 64 bits, contains 00
as the least two significant bits, denoting the primary tag for a header word. The next four least significant bits will contain the value 1001
denoting the secondary header tag for heap binaries. The rest of the bits in the word contain the arity 3 denoted as the binary value 11
. The arity 3 includes one size word and the two data words required to represent the original bytes of the binary.
The size word stored right next to the header word will contain the decimal value 11 denoted in binary as 1011
, which is the byte size of the binary that is being represented.
The first data word followed by the size word will contain the first 8 bytes of the binary, with the first byte 104 stored in the least significant byte and the eighth byte 111 stored in the most significant byte.
The last data word will contain the rest of the 3 bytes in the binary stored in a way where the first byte of the three bytes, 114 stored in the least significant byte and the last byte 100 stored in the third least significant byte. The rest of the unused bytes in the data word will be padded with the value 0.