Elixir : Internal data representation of sub binaries
Binaries in elixir are internally represented by four different types of implementations such as heap binary, refc binary, sub binary and match context. Sub binaries are just references that refer to a particular part of a binary container such as a heap binary or a refc binary. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of sub binaries in elixir.
A sub binary can be explicitly created when you define a bitstring whose size in bits is not divisible by 8. Sub binaries can also be implicitly created in order to facilitate efficient handling of binaries during operations such as recursive appending and binary pattern matching etc. This implicit creation of sub binaries will only refer to parts of the original binary and will eliminate the need for creating new copies of the original heap or refc binaries during recursive calls.
defmodule Sub do
def read_byte(<<>>), do: :ok
def read_byte(<<f, rest::bytes>>) do
IO.puts("#{f}, Remaining bytes - #{byte_size(rest)}")
read_byte(rest)
end
end
In the above function a binary is read byte by byte using recursion. In each iteration the first byte is extracted and the remaining bytes are bound to the variable rest
and recursed. In this case creating a new copy of rest as a heap or refc binary in each recursive function call is inefficient. Instead, the implicitly created sub binaries are more efficient, especially when reading large binaries. Please note that a match context is even more efficient than sub binaries in these cases which will be discussed in another article.
Boxed term
A sub binary falls under the boxed term category. The boxed term of a sub binary takes up a word and it contains the value10
in its least two significant bits, denoting the primary tag. The rest of the bits contain a pointer to a header word in the heap.
Header word
The header word in the heap, pointed to by the pointer in a boxed term, takes up a word and contains 00
as its least two significant bits, denoting the primary tag for the header word. The next 4 least significant bits contain the value 1010
, which is the secondary header tag for sub binaries. The rest of the bits in the header are used for representing arity which denotes the number of data words that contiguously follow the header word. The arity value in the header word for a sub binary will be 3, denoting the byte size word, byte offset word and bit info word. The 3 data words will be followed by a boxed term that points to the header word of either a heap binary or a ProcBin object.
Byte size word
The first data word stored after the header word is the byte size word that denotes the number of complete bytes present in the part of the original binary that it is referring to. This value will not include the last partial byte if any. For e.g. if the sub binary is referring to a bit string with a partial byte <<15, 15::4>>, the byte size word will contain the value 1, denoting the one complete byte and not including the last partial byte.
Byte offset word
The next data word following the byte size word is the byte offset word. Sub binaries refer to a particular part of another binary and this part can begin and end anywhere within the original binary. The byte offset value represents the starting byte of the sub binary in the referred original binary. An offset of value 0 indicates that the sub binary starts with the first byte of the original binary. For e.g. if the original binary is <<11,12,13,14,15,16,17>> and the sub binary starts with the byte 13, then the offset value in the byte offset word will be 2 represented as 10.
Bit info word
The next word followed by the byte offset word comprises three types of data. The least significant byte of this word indicates the bit size of the partial byte if any. The second least significant byte indicates the bit offset of the first byte of the sub binary and the third least significant byte of the word is used as a flag called is_writable flag, used for optimisation of operations like binary append. The bit size and the bit offset bytes can have values from 0 to 7 and the is_writable flag can have the value 0 or 1.
For e.g. if the sub binary is referring to this bit string <<15, 10::4>> the bit size byte will contain the value 4, represented as 100
, indicating that the last partial byte has the bit size of 4. In the original binary the last partial byte will be represented as 10100000
, where the first 4 bits will be read as the integer 10.
Since binaries in elixir give you a bit level control, when a sub binary refers to a part of another binary, this part can not only start from a byte but also from a particular bit of the byte. This is where the bit offset comes into play, pinpointing exactly at which bit of the first byte, does this sub binary begin from. If the bit offset value is zero then it indicates that the sub binary begins with the first bit and will include the whole first byte.
For e.g. if an original binary is pattern matched into a sub binary as
<<10, int::4, rest::bits>> = <<10, 15, 10>>, the int variable will take up the first four bits of the byte 15 and the rest
sub binary will start at the 5th bit of the second byte 15. In this case the byte offset word for rest
will contain 1 and the bit offset byte of the bit info word will contain the value 4 indicating that the sub binary starts from the 5th bit of the second byte of the original binary.
Original boxed term
Followed by all of the data words will be a boxed term with 10
as its primary tag and a pointer value in the rest of the bits. This pointer will point to the header word of the original binary which could be either a heap binary or the ProcBin object of a refc binary. This word will not be included in the arity of the sub binary’s header word for simplifying the garbage collection process.
Example bit representation
Let us represent the bit string <<10, 15::4>> in a 64 bit system. For every sub binary there must exist an original heap or refc binary that the sub binary can point to. In case of bit strings with partial bytes, a new heap or refc binary will be created first based on the size and then a new sub binary will be created to point to the newly created original binary. In our case, a heap binary will be created since the size of the binary is less than 65 bytes. When a bit string with more than 64 bytes is defined, then the original binary will instead be created as a refc binary. In our case, the original heap binary will contain the actual data bytes as 11110000_0001010 in its data word where the least significant byte will represent the value 10 and the next least significant byte will represent 15 in the first 4 bits. The rest of the bits in a partial byte will be filled with zeros.
The sub binary will start with a boxed term that will contain 10
as its primary tag and a pointer in the rest of the bits pointing to a header word in the heap.
The header word will contain 00
as its primary tag, 1010
as its secondary tag and an arity value of 3.
The first data word is the byte size word which will contain the value 1 denoting the one complete byte present in the sub binary.
The second data word will be the byte offset word which will contain the value 0. This is because for explicitly created bit strings, the original binary is newly created for this sub binary and hence the sub binary will refer to the whole data of the original binary. Hence both the byte offset and the bit offset will be zero for explicitly created sub binaries for bitstrings.
The third word will contain the value 4 as bit size in the least significant byte, indicating that the last partial byte present in the original binary, takes up 4 bits. The next least significant byte of the third word will contain the value 0 as bit offset since this is an explicitly created sub binary for bit strings. The third least significant byte of the third word denoting the is_writable flag, will be having the value 0 . This value will be mutated when a sub binary is involved in append optimisation.
The final word after the data words is a boxed term with 10
as its primary tag and a pointer that points to the header word of the newly created heap binary.
Let us now represent an implicitly created sub binary. Let us assume that a part of an original binary is created as a sub binary. Let us take the heap binary <<10, 15, 10>> out of which the last 12 bits denoted as rest
are created as a sub binary. <<int::12, rest::bits>> = <<10, 15, 10>>. In this case the original heap binary is already present and hence only a sub binary will be created. This sub binary will be created with a boxed term, a header word with arity 3, byte size word containing the value 2, byte offset containing the value 1, bit size byte containing the value 0 since the last byte of the sub binary is not a partial byte, bit offset byte value as 4, is_writable flag byte value as 0 and finally a boxed term pointing to the header word of the original heap binary.