Elixir : Internal data representation of match contexts
Binaries in elixir are internally represented by four different types of implementations such as heap binary, refc binary, sub binary and match context. Match contexts are just references that refer to a particular part of a binary container such as a heap binary or a refc binary. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of match contexts in elixir.
Match contexts are internal data structures produced by a compiler optimisation, that enables you to efficiently perform binary matching. Match contexts are created only under certain conditions and they are used as an efficient alternative for creating multiple sub binaries during binary matching. Unlike sub binaries which are immutable and are created newly for each iteration, match contexts are mutable. Hence, a single match context can be created once, mutated and passed around for the whole binary matching process. The match contexts hold data for the current position in the binary and this meta data will be updated directly after each iteration, thus efficiently using constant space to traverse through the binary. Match contexts are only created when the compiler knows that the matched part is not shared and is only used for the successive matching process.
defmodule Sub do
def read_byte(<<>>), do: :ok
def read_byte(<<f, rest::bytes>>) do
IO.puts(f)
read_byte(rest)
end
end
In the above code, the binary is read byte by byte, and in each iteration, the rest of the binary is only passed on to the next iteration and is not used or accessed anywhere else. These conditions will enable the creation of a match context instead of sub binaries. In order to obtain more information about where and why your code has not been optimised for match context creation, you can use the bin_opt_info compiler option while compiling.
# Windows Command Prompt
set ERL_COMPILER_OPTIONS=bin_opt_info && mix compile --force
Compiling 1 file (.ex)
warning: OPTIMIZED: creation of sub binary delayed
lib/test.ex:4
warning: NOT OPTIMIZED: sub binary is used or returned
lib/test.ex:8
warning: NOT OPTIMIZED: called function test_bin/2 does not begin with a suitable binary matching instruction
lib/test.ex:12
Generated test app
In order to understand more about efficient handling of binaries, check out these articles.
Optimizing Erlang Binary Matching
Erlang efficiency guide
Internal representation
Header word
The match contexts like other boxed terms, has a header word that has 00 as its primary tag, 0001 as its secondary header tag and an arity value in the rest of its bits. The arity can have a minimum value of 4.
Original binary word
The first data word will be the original binary word, which is a boxed term that contains a pointer to the header word of the original heap binary or the ProcBin object.
Base pointer word
The second data word is the base word that contains a pointer to a byte in the original binary, denoting the current base position in the binary.
Offset word
The third data word is the offset word that contains the total offset in bits relative to the current base position pointed to by the base pointer word. These offset bits can be mutated to change the position of the match context over the binary matching process.
Size word
The fourth data word is the size word that contains the total size of the original binary’s actual data in bits.
Saved offsets
After the fourth data word, the match contexts may or may not contain a variable length of words used for saving offsets. In each word, a previously visited position can be saved as a bit offset. These saved offsets can then be accessed later to efficiently backtrack or to move the current position of the match context to another position.