Elixir : Internal data representation of Tuples
Tuples are a built-in collection data type in elixir which are implemented based on the array data structure. In order to understand the internal working of elixir’s data representation, please go through this article that gives you an overview of the underlying erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of the tuple data type in elixir.
Boxed term
Tuples fall under the boxed term category, where the actual data and metadata is stored in the heap. The tuples are represented by a boxed term which takes up one word. The least two significant bits of the boxed term contain 10
as the primary tag and the rest of the bits contain a pointer that points to a header word in the heap. The boxed term can be present either in the stack when defined as a local variable in a function, or in the heap if the tuple is part of another data structure.
Header term
The header term present in the heap will be pointed to by the pointer in the boxed term. The header word contains 00
as the primary tag in its least two significant bits, denoting the header term. The secondary tag 0000
will be present in the next four least significant bits, denoting the tuple data type. The rest of the bits contain the binary value of the arity of the data words. If the tuple has 3 elements, then the arity would be its equivalent binary representation 11
. The arity bits correspond to the metadata that stores the precomputed size of the tuple. Please note that for empty tuples, a recent optimisation involves using a global shared header word(arity 0) present in a global memory area instead of creating header words individually within the respective process heap, every time an empty tuple is defined. The boxed terms for all empty tuples will contain a pointer that points to this global empty tuple header word.
Data words
Followed by the header word, each element present in the tuple will be represented by a separate data word associated with the element’s value. These words will be contiguously stored in memory right after the header word, thus following an array like structure, that enables the fast index based data access in the tuples. If the arity bits of the header word is 3, then it means that there are three elements in the tuple. So there will be 3 data words stored contiguously right after the header word, each corresponding to the respective element.
Example bit representation
Let us take a tuple {100, :atom1, 1.5} in a 32 bit system. This tuple will have a boxed term that has 10
as the primary tag bits and will contain a pointer in the rest of the 30 bits available in the boxed term’s word. This pointer will point to the header word in the heap.
The header word in the heap will contain 00
(primary tag for header word) as the least two bits, 0000
(secondary tag for tuple) as the next four bits in the least significant end. The rest of the 32 - (2 + 4) = 26 bits will contain the arity of the data words. This tuple has 3 elements and hence the arity is 3. The binary equivalent of 3 is 11
which will be stored in the 26 arity bits in the header word.
Followed by the header word, there will be three data words stored contiguously in memory with each representing the elements, in the same defined order. The first data word will contain the term associated with the first element which is 100. 100 falls under the small integer category and hence the first data word will contain the immediate term representing the small integer 100. The second data word will contain the immediate term representing the atom, :atom1. The last element is a float and hence the last data word will contain a boxed term which contains a pointer to a header word in the heap, representing the float value 1.5.
Please note that the above representation does not include the header and data words required to represent the float value 1.5. Hence the tuple {100, :atom1, 1.5} takes up a total of 8 words in a 32 bit system including the header word and 2 data words required to represent the value 1.5.
Tuple efficiency
Tuples are very efficient for index based data access and for obtaining the size of the tuple. The reason why index based access is efficient and performant(O(1)) in tuples is because of the fact that the elements are stored contiguously. This enables directly calculating the memory address of an element based on its index and the base address of the first element, instead of traversing the nodes linearly like in the lists until the index is reached.
On the other hand, tuples are not efficient for modification. Elixir’s data is immutable and so whenever any function operates on any datatype, a new version of the data is returned and the original version of the data is maintained. Elixir does this efficiently by reusing as much as it can and only creating new words if required. Unlike the lists, for which the position of the node being modified determines the overall performance of the operation, the tuples require shallow copying of all the words irrespective of the position of the element being modified. In case an element 4 is added at index 0 in the tuple {1,2,3}, a new boxed term, header term, and 4 data words will be created in addition to the previously existing words. In case of elements such as floats, only the boxed term will be duplicated and the existing header and data words representing the float will be reused when shallow copying.