Elixir : Internal data representation.
Elixir runs on top of erlang, and in order to understand the internal representation of elixir data types, we need to learn how the erlang VM operates and manages memory. All of the data types in elixir translate to the underlying erlang’s type system either directly or with some transformation. This article gives a general overview of how erlang represents its various datatypes internally. Let’s understand some of the terms used in erlang which are required to understand the internal representation of its data types.
Term
A term in Erlang/Elixir indicates any form of data or value. All of the data types such as integer, float, atom, list, tuple, functions, map, binary etc are terms.
Word
A word is a basic unit of memory with fixed size of bits that is based on the system architecture. In Erlang, all of its terms’ sizes are calculated in terms of words. For a 32-bit system, a word would be equal to 4 bytes or 32 bits, and for a 64-bit system, a word equals 8 bytes or 64 bits.
Erlang’s tagging system
Erlang uses a tagging system as a way to attach metadata to the values stored in words. This metadata would include information about the data type of the term, its size and information about how the actual data is stored or distributed in memory. This metadata is stored as specific tags in the least significant bits of the words. These terms which are tagged are generally denoted in erlang as an Eterm data type. The least significant two bits of an Eterm would determine its primary data type and are called as the primary tags. Erlang consists of four main primary tag types such as the immediate, boxed, list and the header, each representing a particular term type being stored. Out of these primary tag types, the immediate and boxed terms have more subsequent tags as bits that can be used to further classify the exact data type of the term being stored.
Immediate
The immediate term’s primary tag comprises of the binary value 11
as its least two significant bits. The immediate term takes up the size of one word whose bits can hold both the actual data and the tags within the same word. The next two or four least significant bits constitute more tags that can be used to further pinpoint the specific data type of the term. The major types that come under the immediate terms are local PID, local port, small integer, atom, NIL etc. The immediate terms can be present in both stack and heap memory locations.
Boxed
The boxed term’s primary tag comprises the binary value 10
as its least two significant bits. In the remaining available bits, the boxed terms always contain a pointer that points to a header term stored in the heap. Boxed terms are used when the actual data and the respective tags cannot be stored within a single word. The specific term type and its details can be obtained from the header term in the heap that it points to. The boxed term, similar to an immediate term, can be found in both stack and heap.
Header
The header term’s primary tag comprises the binary value 00
as its least two significant bits. The header word uses four more bits in the least significant end to identify the exact data type it holds. Followed by the secondary tag bits is the arity of the actual data in binary representation i.e. how many more words contain the actual data. The actual data words are always stored contiguously after the header term. Hence, the arity information from the header term can be used to get actual data from the data words present right next to the header term. A header term is only found in the heap and a boxed term containing a pointer always points to a respective header term. The various erlang subtypes of header term would constitute big integer, float, tuple, binary, reference, external PID, external port, external reference, fun table etc
List
The list term’s primary tag comprises the binary value 01
as its least two significant bits. The rest of the bits contain a pointer that points to a node in the heap. A node in the heap contains two contiguous words with the first word containing the first element(CAR) of the list and the second word containing another list term that represents the tail(CDR) of the list. Note that for an empty list, it would just contain the immediate value NIL and not the list term. In fact the immediate value NIL will be stored in the last node’s CDR to denote the end of the list.
Internal data representation of specific data types
Internal data representation of small integers.
Internal data representation of big integers.
Internal data representation of float.
Internal data representation of atoms.
Internal data representation of lists.
Internal data representation of tuples.
Internal data representation of small maps.
Internal data representation of large maps.
Internal data representation of heap binaries.
Internal data representation of refc binaries.
Internal data representation of sub binaries.
Internal data representation of match contexts.