Elixir : Internal data representation of small maps
Maps in elixir are internally represented by two different implementations based on their size. Maps having key-value pairs less than or equal to 32 are called small maps or flat maps and they are implemented based on a sorted tuple structure. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of small maps in elixir.
Boxed term
Maps fall under the boxed term category, where the actual data is stored in the heap. The boxed term takes up a word and it contains the value 10
in its least two significant bits, denoting the primary tag. The rest of the bits contain a pointer to a header word in the heap. This boxed term can be present either in the stack defined as a local variable in a function, or it can be found in the heap when the map is part of another data structure.
Header word
The header word in the heap, pointed to by the pointer in a boxed term, takes up a word and contains 00
as its least two significant bits, denoting the primary tag for the header word. The next 4 least significant bits contain the value 1111
, which is the secondary header tag for maps. The next two bits are used as the map sub tag to distinguish the flat map and large map(hash map) types. For the small map, the sub tag bits will contain 00
. The next 8 bits are used for the arity which will always be 1
for a small map and the rest of the 16 bits which are used for value, always contain 0
for small maps.
Data word 1
The first data word following the header word will contain the binary value of the total size of the map. If the map has a size of 5, indicating 5 key-value pairs, then the binary equivalent of 5, 101
will be present in the first data word. Since the size is precomputed and stored as metadata, getting the size of a small map will be a constant time, performant operation.
Data word 2
The second data word contains a boxed term for the key tuple, that contains a pointer to another header word in the heap. This tuple will have data words associated with all the keys in the map. If the map has a size of 10, all the 10 keys will be sorted and stored in this tuple. The sorting will be based on the data type, values and not on the insertion order.
Data type ordering in elixir
integer < float < atom < reference < function < port < pid < tuple < map
< list < bitstring
Data word 3..n
The next data words contain the values associated with each key. They will be stored contiguously in the same order as the sorted keys in the tuple. If the map has a size of 10, then there will be 10 data words stored contiguously right after the tuple boxed term, each representing the value associated with the keys in the same order.
Example bit representation
Let us take the map %{:one => 1, 2 => 2}
in a 32-bit system. The boxed term for the above map, taking up 32 bits, will contain the primary tag, 10
as the least two significant bits. The rest of the bits will contain a pointer to a header word in the heap.
The header word will contain 00
as its least two significant bits, denoting the primary tag. The next four least significant bits will contain 1111
as the secondary header tag for maps. The next two map sub-tag bits will contain 00
denoting the flat map type. The next 8 bits will contain the arity 1
and the rest of the 16 bits will contain the value 0
.
The next data word will contain the size of the tuple, which is 2 in our case. Hence the 32 bit data word will contain the binary representation of the number 2, 10
.
Followed by the size data word, will be the boxed term for the key tuple. This boxed term will again have 10
as the primary tag and a pointer to a header word in the heap. The header word will in turn contain the representation for storing the sorted key tuple. It will have 00
as its primary tag, 0000
as its secondary header tag for tuple. The rest of the 26 bits will contain the arity of data words, which in our case will be 2, each denoting a key in the map. The keys will be sorted based on the data type and the value. In our case, the keys are the atom :one
and the integer 2
. According to Elixir’s data type ordering, the integer comes before the atom and hence the key order will be 2, :one. If there are multiple keys of the same data type, then they will be sorted according to the sorting rules of the data type.
Followed by the key tuple’s boxed term, will be the data words that will contiguously represent the values in the same order as the keys in the key tuple. In our case we have two values and so there will be two more data words that are stored after the key tuple’s boxed term. The keys are sorted as 2, :one and hence the values will be sorted in the same order. The first value data word will represent the small integer term 2, associated with the key 2 and the second value data word will contain the small integer term 1, associated with the atom :one key.
Hence the map %{:one => 1, 2 => 2}
takes up a total of 9 words.
Accessing value by key - Linear search
Whenever a value is accessed using a key, all of the keys in the key tuple will be iterated one by one until the provided key is equal to a particular key in the key tuple. Since both keys and values are stored in the same order, the index of the accessed key will be used to access the value directly from the data words that store the values. The === operation will be used to check if the provided key is equal to any of the keys in the key tuple. Since the keys are iterated one by one, accessing values of the keys present in the beginning of the key tuple is faster than accessing the values for keys present in the far end of the key tuple. This can be visualised from the following benchmark. The performance of small maps also depends on the complexity of the key terms used, as they are compared with the provided key. A map containing immediate terms such as atoms, small numbers etc as keys is much more performant than a map containing a complex structure such as a list, tuple or another map as key. This is because comparison of immediate terms involves just comparing their values, while comparison of structures like lists may involve complex operations such as calculating and comparing their sizes, comparing their each element etc.
Map modification
Since elixir’s data is immutable, every time a map is updated, a new version of the map will be created, which will share data as much as it can from the previous version. Sharing of the key tuple among different versions of the map is efficient and possible when you know all the keys of your map beforehand. In this case, you can create the map with all the keys and default values which can be later updated as required. This enables sharing of the key tuple among all the future versions of the map as long as the keys are not modified.