Elixir : Internal data representation of Refc binaries
Binaries in elixir are internally represented by four different types of implementations such as heap binary, refc binary, sub binary and match context . Reference counted(refc) binaries are containers used to represent binaries that are of size more than 64 bytes. It enables sharing of large binaries among multiple processes. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of refc binaries in elixir.
The reference counted binary consists of two main parts such as the binary object and the ProcBin object.
Binary object
The binary object also called as the off-heap binary is present in a common shared memory area called binary heap which consists of the original byte data of the large binary and its meta data. It consists of data such as flags, reference count, byte size and the actual byte data.
Flag word
The first word of the binary object in the shared memory area consists of flag bits used internally by different parts of erlang.
Reference count word
The next word followed by the flags contains the reference count integer represented as binary value. This count indicates the total number of references that are pointing to this particular binary object. This count is incremented after new references are created to this binary object and is decremented after a reference to this binary object is no longer alive. A binary object will be eligible for garbage collection once its reference count gets to 0.
Size word
The next word followed by the reference count contains the total byte size of the represented binary. For e.g. if the byte size of the binary is 65, the word will contain the binary representation 1000001
.
Data words
The next words followed by the size word will represent each byte of the actual binary. The first byte of the binary will always be present in the highest address of the first word i.e. the least significant byte of the word.
Byte data representation of <<1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18....>>
x x x x x x x x - Size word
8 7 6 5 4 3 2 1 - Data word 1
16 15 14 13 12 11 10 9 - Data word 2
. . . 18 17 - Data word 3
ProcBin Object
The ProcBin object is present in the process heap and it serves as a reference to the actual binary object in the binary heap. Whenever a refc binary is sent to another process through messages, only the ProcBin object will be created in the new process and both the ProcBin objects in both processes will refer to the same binary object in the binary heap. The ProcBin consists of a boxed term, header word, size word, next ProcBin pointer word, binary object pointer word, actual data bytes pointer word and a flag word.
Boxed term
The boxed term of the ProcBin object takes up a word and it contains the value 10
in its least two significant bits, denoting the primary tag. The rest of the bits contain a pointer to a header word in the heap. This boxed term can be present either in the stack defined as a local variable in a function, or it can be found in the heap when the binary is part of another data structure.
Header word
The header word in the heap, pointed to by the pointer in a boxed term, takes up a word and contains 00
as its least two significant bits, denoting the primary tag for the header word. The next 4 least significant bits contain the value 1000
, which is the secondary header tag for refc binaries. The rest of the bits in the header are used for representing arity which denotes the number of data words that contiguously follow the header word. The arity value in the header word for a refc binary will always be 5, denoting the size word, next ProcBin pointer word, binary object pointer word, actual data bytes pointer word and the flag word.
Size word
The first data word stored after the header word is the size word that denotes the number of bytes present in the off-heap binary, represented as a binary value. For e.g. if the byte size of the binary is 65 the word will contain the binary representation 1000001
.
Next ProcBin pointer word
In order to keep track of all the ProcBin references in a particular process, all of the ProcBin objects are connected together using a linked list. This linked list, also called as the mso list(mark and sweep object list) aids in updating the reference counter of binary objects in the binary heap after every garbage collection in the process. Newly created ProcBin objects will be prepended to the mso list and the location of the head of the mso list will be stored in the ErlOffHeap data in the process control block memory area. The process control block(PCB) is a memory area present in each process just like the stack and the heap. It holds information about the process such as the PID, total heap size, its registered name, status etc. Thus the ErlOffHeap data in the PCB will always contain the pointer to the latest ProcBin object in the process and every ProcBin object will in turn contain a pointer to the next ProcBin object present in the mso list.
Binary object pointer word
The next word in the ProcBin object contains a pointer to the referenced binary object present in the binary heap. This pointer points exactly to the beginning of the binary object i.e. to the flag word of the binary object.
Actual data bytes pointer word
The next word in the ProcBin object contains a pointer to the actual byte data of the off-heap binary. This pointer points exactly to the first data word present in the referenced binary object.
Flag word
The last word of the ProcBin object contains flag bits related to various binary operation optimisations such as the binary append optimisation.
Example bit representation
Let us represent the binary <<1,2,3,4,5,6,7…..64,65>> in a 64 bit system. The total bytes in the binary is 65 which is above the heap binary limit of 64. Hence this binary will be created as a refc binary.
The binary object will be created in the binary heap with its first word initialised to 0x0 as flag bits. The next word will contain the initial value 1 as reference count for the binary object, since a ProcBin object will refer to this binary object. The next word will contain the byte size of the actual data which is 65, represented as 1000001
. The next 9 words will be the data words required to store the actual data bytes of the binary. The bytes will be represented in each word in the same order as explained above in the binary object section.
The ProcBin object will require a boxed term in the stack that has 10
as its primary tag and the rest of the bits will contain a pointer to the header word in the heap. The header word in the heap will contain 00
as primary tag, 1000
as secondary tag and 5 as the arity, represented as 101
in the rest of the bits. The next word in the ProcBin is the size word that represents 65 as 1000001
. The next word will be the next ProcBin pointer for the mso list. Let us assume that this is the first ProcBin created in this process and hence the pointer will just contain zeros denoting the end of the mso list. The ProcBin’s header word’s address will in turn be stored in the PCB’s ErlOffHeap data as the head of the mso list. The next word will be a pointer to the flag word of the binary object. The next word will be pointer to the first data word of the binary object. The final word will be the flag word that will be filled with zeros. This word’s flag bits will be mutated and used when this particular ProcBin is involved in an optimisation-supported operations such as the append operation.