Elixir : Internal data representation of Large Maps - HAMT

Arunmuthuram M
13 min readOct 7, 2023

--

Maps in elixir are internally represented by two different implementations based on their size. Maps having key-value pairs greater than 32 are called large maps or hash maps and they are implemented based on a persistent Hash Array Mapped Trie data structure. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of large maps in elixir. Here are some of the resources that can explain what HAMTs are and their advantages.
- Hash Array Mapped Tries
- The magic behind Immutable maps
- Why you should love Hash Array Mapped Tries

Boxed term

Maps fall under the boxed term category, where the actual data is stored in the heap. The boxed term takes up a word and it contains the value 10 in its least two significant bits, denoting the primary tag. The rest of the bits contain a pointer to a header word in the heap. This boxed term can be present either in the stack defined as a local variable in a function, or it can be found in the heap when the map is part of another data structure.

Header term: Root bitmap header node

The header word in the heap, pointed to by the pointer in a boxed term, denotes the root bitmap header node, which takes up a word, and contains 00 as its least two significant bits, denoting the primary tag for the header word. The next 4 least significant bits contain the value 1111, which is the secondary header tag for maps. The next two bits are used as the map sub tag to distinguish the type of the map node. For the root bitmap header node, the map sub tag bits will contain 11. The next 8 bits are used for the arity which will always be 1 for a root bitmap header node and the rest of the 16 bits which are used for value, always contain 16-bit bitmap values, which we will be discussing later.

Size word

The arity of 1 in a root bitmap header node denotes the size word that will be stored right next to the root bitmap header node. The size will contain the binary representation of the total map size i.e. no of key value pairs present in the map. If the map has a size of 33, then the size word will contain its binary equivalent 100001 as its value.

Inner nodes

Followed by the size word, the large maps will contain contiguously stored inner nodes. The total number of inner nodes will be determined by the bitmap value in the root bitmap header node. The total number of 1 bits in the 16-bit bitmap equals the total number of inner nodes present after the size word. Each 1 bit in the root bitmap node’s bitmap value corresponds to an inner node stored contiguously after the size word. If the bitmap value of the root bitmap node is 10001000_11100011, then the total number of inner nodes will be 7 which is equal to the number of 1 bits present in the 16-bit bitmap value. The least significant 1 bit in the bitmap value corresponds to the first inner node and the most significant 1 bit in the bitmap corresponds to the last inner node. These inner nodes can be of two types such as a leaf node or a boxed node.

Leaf node

The leaf node contains the same representation of a list term with 01 as its least two significant bits, denoting the primary tag for a list and the rest of the bits would contain a pointer. This pointer will point to the word denoting a key, followed by another word denoting the respective value for the key. The main difference between a list term and the leaf node is that in lists, the tail of a list node will in turn again be a list term pointing to the second element and the last element will have an immediate nil term to denote the end of the list. But in this leaf node, the key and value will be stored right next to each other without having a second list term as tail and the immediate nil term.

Boxed node

The boxed node contains the same representation of a boxed term, having 10 as its least two significant bits, denoting the primary tag of a boxed term. The rest of the bits in the boxed node will contain a pointer which will either point to an inner bitmap header node or an array header node.

Inner bitmap header node

The inner bitmap header node resembles the root bitmap header node. It is a header word with 00 as its primary tag, 1111 as its secondary tag, 01 as its map sub tag, 8 bits for arity which will always be 0 for an inner bitmap header node and then a 16 bit value containing a bitmap. Unlike the root bitmap node, the inner bitmap node will not contain a size word. The inner bitmap header node will be followed directly by the inner nodes similar to the root bitmap header node.

Array header node

The array header node is used when more than one keys have the same hash value leading to a hash collision. The array header node will contain 00 as its primary tag, 1111 as its secondary tag and 10 as its map sub tag. The next 8 bits will denote the arity value equal to the number of key-value pairs having the same hash value. If two keys have the same hash, then the arity will be two. The next 16 bits reserved for the value will always be filled with 1s for an array header node. Followed by the array header node, there will be leaf nodes equal to the arity. These leaf nodes will in turn point to its respective key value pairs.

Bitmap calculation

Now that we have seen the different types of nodes present in a large map, let us see how the 16 bit bitmap value present in the bitmap nodes are created. The 1s and 0s present in the bitmap value directly depend on the hash value of all the keys present in the large map. In erlang, any term can be hashed to produce a hash value and the same hash value is returned every time the term is hashed. Erlang uses a 64 bit hash value for computing the bitmap values. For e.g. if we are using the small integer 1 as a key, when hashing the small number 1, we may get a hash value such as 7938660269479274665. The 64 bit binary representation of this hash value is as follows

01101110_00101011_11001001_01110100_01001010_11010001_10101000_10101001

Now let us group the bits by 4 and represent the hash value as a base 16 number.

0110_1110_0010_1011_1100_1001_0111_0100_0100_1010_1101_0001_1010_1000_1010_1001

which can be written as 6_14_2_11_12_9_7_4_4_10_13_1_10_8_10_9, which can in turn be written as the base 16 number 6E2BC9744AD1A8A9. The minimum possible digit in this 16 digit hash value is 0 and the maximum possible digit in this 16 digit hash value is F, which equals 15. These 16 digits correspond to the 16-bit bitmap value used in the bitmap nodes. Every digit in this 16 bit hash is used to construct the bitmap at different levels. The last digit of the hash will be used to construct the root bitmap node’s bitmap value and the next last digit will be used to construct the second level inner bitmap node’s bitmap value and so on. The first digit of the 16 digit hash will be used to construct the 16th - the last level’s bitmap value of the inner most bitmap node.

Let us now construct the root node’s bitmap for the above hash value. For the root bitmap, the last digit of the hashes are used. The last digit for the above hash value is 9. Now, starting from the least end of the 16-bit bitmap, we have to find the bit corresponding to the value 9 and set it as 1. This means that the 10th bit from the least end will be a 1 bit in the bitmap value of the root bitmap node.

0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

A large map contains more than 32 keys and so the last digit of all the keys’ base 16 hashes will be used to construct the bitmap value of the root bitmap node using the above technique.

Constructing the inner nodes

Now that the 16-bit bitmap value has been constructed for the root bitmap header node, we can move on to the construction of the inner nodes. We have already seen that the number of inner nodes constructed, will be equal to the number of 1 bits in the bitmap value, and each 1 bit in the bitmap value will correspond to a single inner node stored below. The least 1 bit in the bitmap will correspond to the first inner node and the most significant 1 bit in the bitmap will correspond to the last inner node.

Now, we have to determine the type of inner node that each 1 bit in the bitmap corresponds to. There are two types of inner nodes such as the leaf node and the boxed node. When constructing the bitmap with a hash digit of all the keys, more than one key can have the same hash digit. If multiple keys share the same hash digit used in this level, then the respective inner node will be a boxed node that will point to another inner bitmap node. If a key’s hash value has a hash digit that is not shared by any other keys in the same level, then the inner node will be a leaf node that directly points to a key value pair. Hence, the next level of the map is only constructed if the current level’s hash digit is the same for multiple keys. In case of a hash collision where more than one key has the same hash digit in all the 16 levels, then the boxed node in the last level will point to an array header node that holds all the key value pairs corresponding to the same hash value.

Example representation

The representation of a large map is complex and a large map with a minimum size of 33 would take up more than a 100 words. Hence for simplicity, let us represent a large map with just 10 keys. Let us take the map %{1=>1, 2=>2, 3=>3, 4=>4, 5=>5, 6=>6, 7=>7, 8=>8, 9=>9, 10=>10}. The first thing that we have to do is generate 64 bit hash values for all the keys and convert them into base 16 numbers. The original internal hash values for these numbers are different and the hash values used here are just for the sake of representation.

1 -> 6E2BC9744AD1A8A9
2 -> 31060820874A0533
3 -> 8ABCB2AF1B4948A3
4 -> 8373AC4DBDC90B70
5 -> 928BAAC9B4D9DB1D
6 -> 5AE8895DD431346A
7 -> 8ABCB2AF1B4948A3
8 -> 7E0864650969D0C5
9 -> 9BF2D882EFABD7E6
10-> 2A6D351865CCA326

Let us now construct the bitmap value for the root bitmap node. Let us take the last digit of all the 10 hash values of the keys. 9,3,3,0,D,A,3,5,6,6
Let us assign the value 1 for all the places in the bitmap that contain the digits above.

0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1
F E D C B A 9 8 7 6 5 4 3 2 1 0

Now that we have constructed the root bitmap node’s bitmap value, we have to create inner nodes for all the 1 bits in the bitmap. The number of 1 bits in the root bitmap is 7 and hence there will be 7 inner nodes created after the size word. Let us now determine what the type of the inner node will be. This is done by checking how many keys share the same last digit. The first inner node will correspond to the 1 bit in the least end of the bitmap. For us the first 1 bit in the least end is for the position 0. If we check the number of hashes having 0 as the last digit, it is only one key, 4. Hence the inner node corresponding to the 0th position 1 bit, will be a leaf node which will point to the key value pair [4 | 4].
Let us take the second 1 bit in the least end which is in the position 3. If we check the number of hashes sharing the digit 3 as its last digit, we will find that there are three keys, 2, 3 and 7. Since the number of keys sharing the hash digit is more than one, the second inner node corresponding to the 1 bit at position 3 will be a boxed node that will point to an inner bitmap node. This inner bitmap node will in turn contain another 16 bit bitmap value corresponding to the second level of the map. Unlike the root bitmap, this second level bitmap is only constructed with the hashes that collided in the previous level. Hence the second last digit of the hashes of the keys 2, 3 and 7 are used to construct the bitmap of this second level inner bitmap node.

2 -> 31060820874A0533
3 -> 8ABCB2AF1B4948A3
7 -> 8ABCB2AF1B4948A3

The second digits from the least end are 3,A,A

0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
F E D C B A 9 8 7 6 5 4 3 2 1 0

This inner bitmap will hence contain two inner nodes. The node corresponding to the digit 3 is a leaf node and will point to the key value pair [2 | 2]. The second inner node corresponding to the position A in the bitmap has two keys 3 and 7 sharing the hash digit. Hence its corresponding node will again be a boxed node that points to another inner bitmap node for the third level of the map. As you can see the two hash values of the keys 3 and 7 are the same and this will lead to a hash collision. The inner bitmap nodes are created for these two keys all the way up to the 16th level, after which the final inner node of the 16th level will point to an array header node that is used for handling collisions. This array header node contains the arity 2 and two contiguously stored leaf nodes pointing to the key value pairs [3 | 3] and [7 | 7]. Similarly the whole large map is constructed for all the keys. Please note that a hash collision occurs very rarely in real life large maps.

Representation of a large map

Reading values by keys

Let us now see various reading scenarios in the above constructed large map. Whenever a value is accessed by a key in a large map, the base 16 hash value for the key is first generated. The last digit of the key is taken and from the root bitmap value, the bit value of the respective position of the last digit is checked. If the bit value is zero, then the particular key is not present in the map and so nil is returned.

Let the key be 11, whose base 16 hash value could be 2A98945BC87687D1. The last digit is 1 and the 1st position of the root bitmap value contains the bit value 0. Hence the key is not present.

Let the key be 12, whose hash value may be 5E98945AB87383C0. The last digit is 0 and the 0th position of the root bitmap value contains the bit value 1. Hence we access the corresponding inner node for the 0th position. The index of the inner node can be determined by a bit count operation where the number of 1 bits to the right of the current position’s 1 bit, determines the index of the corresponding inner node. In our case, the 1 bit is in the 0th position. There are no 1 bits to the right of the 0th position and hence the index of the corresponding inner node is 0. We now access the inner node at index 0 which is a leaf node. Hence we follow the respective key pointed by the leaf node and check if the key is the same as the requested key. The stored key is 4 and the requested key is 12. Hence the key is not present and nil is returned.

Let the key be 9, whose base 16 hash value is 9BF2D882EFABD7E6. The last digit is 6 and the 6th position of the root bitmap value contains the bit value 1. Hence we access the corresponding inner node for the 6th position. The index for the inner node is 3 since we have three 1 bits to the right of 6th position in the bitmap. The inner node at index 3 is a boxed node and so we follow the pointer in the boxed node to an inner bitmap header node. Now we are at the second level of the map. We need to use the second last digit of our hash which is E. Now we check the Eth position in the second level inner bitmap header node’s bitmap value, which is a 1 bit. Hence we find the index by checking the number of 1 bits to the right of Eth position, which is 1. We access the inner node at index 1 which is a leaf node that points to the key 9. Since the requested key and the stored key are equal, we return the value stored next to the key, which is 9.

This is the technique used to read a large map by using keys. Hence, no matter how big the size of the large map is, any value can be accessed by traversing within 16 levels of the map.

Updating the large map

Since data in elixir is immutable, whenever a large map is updated, new words will be created to accommodate the changes. However, only the path or branch, starting from the root to the updated key is newly created and the words corresponding to the rest of the paths and branches of the old version of the large map will be reused in the new version of the large map. This is what gives the immutable maps a bit of an improvement in terms of memory efficiency and performance.

--

--

No responses yet