Elixir : Internal data representation of Lists
Lists are a commonly used built-in collection datatype in elixir which are implemented based on the singly linked list structure. In order to understand the internal working of elixir’s data representation, please go through this article that gives you an overview of the underlying erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of the list datatype in elixir.
List term
Every list with elements, starts with the root list term, which takes up one word. The list term’s primary tag comprises the binary value 01
as the least two significant bits. The rest of the bits contain a pointer that points to a node in the heap. The list term can be present in the stack as a local variable defined within a function, or in the heap, if it is part of another list node or a part of another data structure.
NIL term
The nil term is an immediate term that is used to denote an empty list or the end of a list. The nil term contains the primary tag value of 11
as its least two significant bits. The next four least significant bits contain the value 1110
as the secondary immediate tags, denoting the nil term. The rest of the bits in the word are filled with either 1s or with 0s. Hence when an empty list is defined, only the nil immediate term is used to represent the list and not the list term.
Node
A node comprises two contiguously stored words in the heap. The first word of the node denotes the head and will contain the term associated with the head element’s value. The second word of the node denotes the tail and may contain either another list term pointing to the next node or a nil term indicating the end of the list. Every element in a list is associated with a node in the heap and hence the number of nodes equals the number of elements in the list. Every list node’s first word will be pointed to by the pointer in the root list term or the pointer in the previous node’s list term.
Example bit representation
Let us take a list [100, :atom1, 1.5] in a 32 bit system. This list will have a root list term that has 01
as its primary tag bits and will contain a pointer in the rest of the 30 bits available in the list term’s word. This pointer will point to the first word of the first node in the heap.
The first word of the first node will contain the term associated with the first element which is 100. 100 falls under the small integer category and hence the first word of the first node will contain the immediate term representing the small integer 100. The second word of the first node will again contain a list term with a pointer that points to the next node.
The second node’s first word will contain the immediate term representing the atom, :atom1. The second node’s second word will in turn contain another list term pointing to the last node’s first word.
The last element is a float and hence the last node’s first word will contain a boxed term which contains a pointer to a header word in the heap, representing the float value 1.5. The last node’s second word will contain the bit representation of the nil term indicating the end of the list.
Please note that the above representation does not include the header and data words required to represent the float value 1.5. Hence the list [100, :atom1, 1.5] takes up a total of 10 words in a 32 bit system including the header word and 2 data words required to represent the value 1.5. With the above representation of the list, it is easier to visualise the head and tail notation that is widely used in lists. Every first word in a node will contain the head element and every second word can be visualised as a separate list, forming the tail. Just imagine that the cons operator is like a boundary between the first word and the second word in a list node.
List modification
Elixir’s data is immutable and this means that once data is stored in the memory area, the only time it gets altered is when it gets removed. In the heap, this removal happens during the garbage collection or when the memory is reclaimed once the process is stopped or terminated. So whenever any function operates on any datatype, a new version of the data is returned and the original version of the data is maintained. Elixir does this efficiently by reusing as much as it can and only creating new words if required. When it comes to lists in elixir, since they are implemented based on the linked list structure, random index based access is not as efficient as the array based tuple structure. The list has to be traversed all the way to the respective index. But the one thing that they are advantageous over the array based tuple, is that the modification of elements is more efficient in this case.
Appending vs Prepending
Prepending elements to the list is much more efficient in terms of performance and memory usage than appending elements to the list. In fact, any kind of operation in the head end of the list is more efficient than in the tail end of the list. When we are prepending an element to a list, all we have to do is create a new root list term, create a new node with the new element and point the new node to the existing list’s first node. Let’s prepend the element 3 to the list [1,2].
But, this is not the case with appending 3 to the list [1,2]. All of the existing nodes will have to be created with the new pointer data in order to achieve this.
If the elements in the original lists are boxed terms such as floats, the header and data words in the heap will be reused and only the boxed terms will be duplicated. The position where the node is modified plays a direct role in the performance of the operation. The lesser the number of nodes before the modified node’s position, the faster the operation will be. All the list nodes before the position of the modified node will have to be created again with the altered pointer data.
In case of prepending, the position of modification is essentially the front and so none of the existing nodes get modified. But in case of appending, the node being modified(created in this case) is present at the very end of the list and hence this requires creation of all of the nodes before it. This is the same for inserting or deleting a node. Hence modification done in the head end of the list is more efficient when compared to the modifications done in the tail end of the list.