Elixir : Internal data representation of atoms
Atoms are one of the most widely used datatypes in elixir and erlang. In order to understand the internal working of elixir’s data representation, please go through this article that gives you an overview of the underlying erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of the atom datatype in elixir.
Immediate term
Atoms in elixir fall under the immediate term category, which means that the data and the tag bits can be fit into a single word. The bit representation of atoms has a primary tag value of 11
as their least two significant bits, indicating that they are immediate terms. The next two least significant bits 10
are used as a secondary immediate tag to identify the immediate category. The next two least significant bits containing the value 00
indicate the atom data type. The rest of the bits in the word contain the actual data which in our case, is the index value referring to the atom stored in the global atom table. The immediate term for atom can be present in the stack as a local variable defined within a function, or in the heap, if the atom is part of another data structure such as tuple, list, map etc.
Atom table
The atom table is a global memory area in the beam VM that stores the atom content of all the atoms created. The atom table stores details such as the total number of atoms created so far, the index value of each atom, length of each atom’s content or name, the actual content or name of the atom and other details. Indexes are just unique numbers associated with each atom’s content on a one-to-one basis. The index values start from 0 and increase all the way to how many ever atoms have been created. The internal atom :false has the index 0 and the internal atom :true has the index 1 and this value keeps incrementing. Whenever a new atom is created, the index is bumped up to one more than the previous index and the new atom’s content is associated with it. The internal structure of the atom table and its bit representation of data allows fast lookup of the index for a particular atom content and vice versa. Please note that the global atom table also has a default finite capacity of 1,048,576, after which the VM would crash.
When a new atom is defined, the first thing that happens is a lookup, which is done to check if the atom with the same content is already present in the atom table. If the created atom is not present in the atom table, then a new entry containing all the information required for the new atom’s content is created in the atom table and the newly associated index value is returned and used as the data part for the atom’s immediate term. If the atom content is already present in the atom table, then the corresponding index of the atom content is looked up, returned and is used as the data part for the immediate term. This essentially means that no matter how many times an atom with the same content is defined within the elixir node, the atom content and its details are created and stored only once in the atom table and then reused. This act of interning the atom content in the atom table makes it more efficient in terms of memory when compared to strings. Since the atom table is a global memory area, all of the processes in the current running beam VM instance share this memory area. This means an atom content will have a unique index associated with it, across all the processes running within the same elixir node.
Example bit representation
Let’s take the atom :atom_one
which is being defined for the first time in a 32-bit system . The first thing that happens is that the content of the atom, atom_one
is looked up in the atom table. The lookup procedure of the atom content in the atom table involves content hashing and other internal ops. In our case, since the atom is being created for the first time, there will be no existing entry associated with our atom content and so a new entry along with other details will be created in the atom table. Let’s just say that the atom table already contains 21000 atoms. The new index for our atom content would equal previous atom index + 1. In our case, the previous index value is 20999, since it is a zero based index. Hence, the new index for our new atom content, atom_one would be 20999 + 1 = 21000. This 21000 is stored as index value for our atom content in the atom table and is also returned to be stored as the actual data for the immediate term. Hence our immediate term containing 32 bits(word size) will contain 11
as its least two significant bits, 0010
as the secondary tag bits. This leaves us with the remaining 26 bits for actual data, which will contain the binary representation of the index value 21000, 101001000001000
.
So the next time the same atom, :atom_one
is defined, the atom’s content will be looked up in the atom table and since we already created an entry for the atom’s content, the associated index for the content atom_one will be looked up and 21000 will be returned, which will be used as the data part of the immediate term. This is why two atoms having the same content will always be same and equal, having the same bit representation for their immediate terms.
In order to decode the immediate atom, we have 11
as the primary tag bits, which indicate an immediate term. Then we look at the secondary tag bits 0010
which indicate the atom data type. Hence, the rest of the bits contain the binary representation of an index value, which is 21000 in our case. This index value is then used to perform a lookup in the global atom table to find the associated atom content which would be atom_one in our case.
Efficiency of atom comparison
Atoms are much more efficient than strings. When two atoms are compared with each other, only the index value in the immediate term is compared as opposed to comparing the whole content of the atom. If the index values in the immediate terms are equal, then it essentially means that the two atoms being compared have the same atom content. This is much more efficient(O(1)) than the content comparison in strings. And this is why atoms are used widely as tags and keys in structures such as tuples and maps, thus enabling a very fast comparison when pattern matching different structures.