Elixir : Internal data representation of Big Integers
Integers in elixir are internally represented as two types such as small integers and big integers based on the size that they would take up for representation. Elixir’s Integer datatype follows Arbitrary-precision arithmetic to determine the maximum value that can be stored, same as the underlying mechanism in erlang. In order to understand the internal working of erlang’s data representation, please go through this article that gives you an overview of the erlang’s tagging system, which is required to understand this article. This article explains the internal data representation of Big Integers in elixir.
Big Integers
Big integers are whole numbers whose values take up space that are not small enough to be stored inside a single word. Big integers are represented as boxed terms, where the actual data and its metadata is stored in the heap and can take up to any number of words to accommodate the value, as long as there is memory to spare. This is how erlang implements Arbitrary-precision arithmetic. Big integers, unlike the small integers, do not use 2’s complement to represent the signed integers. Instead they use a sign-magnitude approach where the magnitude of the number is stored directly as the binary representation, irrespective of the sign. The sign of the number is identified from the secondary header tag in the header word.
Bit representation
Whenever a signed integer falling outside the small integer range is defined, a boxed term is created with the primary tag 10
as its two least significant bits. The rest of the bits in the boxed term contain a pointer to a header word created in the heap. The header word contains its assigned primary tag value 00
in the least two significant bits. The next four least significant bits would contain the secondary header tag that determines the type of data represented. In our case, 001x
where x is the sign bit. For a positive big integer the secondary tag will be 0010
and for a negative big integer the tag would be 0011
. The rest of the bits contain the binary representation of the arity of data words that it requires to store the magnitude of the number.
The header word will be followed by the data words which are contiguously stored in memory right after one another. The arity value present in the header will directly determine how many data words contain the actual data. The magnitude of the integer will be represented as its binary value with its bits distributed across the data words. The magnitude’s least significant word will come first followed by its most significant words.
Positive big integer representation
Let us take the number 280_379_743_316_650 which falls outside the range of small integers for a 32-bit system. The binary value of the above number would be 11111111 00000000 11111111 00000000 10101010 10101010
which takes up 48 bits. Hence, it would take 2 words(64 bits) to store the digits of the binary value of the magnitude. The least significant 4 bytes will be stored in the same order in the first data word, followed by the next 4 bytes in the next word and so on. In our case the least significant 4 bytes containing the bits 11111111 00000000 10101010 10101010
, will be stored in the first data word and the next 2 remaining bytes 11111111 00000000
will be stored in the next data word, thus distributing all the digit bits in the two data words.
Negative big integer representation
Let us take the same number, but negative, -280_379_743_316_650. Since, big integers use the sign-magnitude approach, the only difference in the bit representation would be the secondary header tag, which would be 0011
, denoting a negative big integer. The rest of the bits would be the same as the positive big integer of the same magnitude displayed above.
In order to decode this bit representation back to the decimal value, first, you have the boxed term with the primary tag 10
. The boxed term has a pointer in the rest of its bits, that points to a header word in the heap. Once you get to the header word that has 00
as the primary tag, you check the four bit secondary tag for the sign of the big integer. If the tag is 0010
, then the number is positive and if the tag is 0011
, then the number is negative. Then, you check the rest of the bits in the header word to find the arity. The arity determines how many data words containing the digit bits of the magnitude, are stored contiguously after the header word. Once you get the arity, you combine the digit bits of all the data words present, by considering the fact that the most significant word would be at the end. Once you have combined and arranged the digit bits from all the data words, you can convert the binary representation into decimal to get the magnitude of the big integer.
A big integer in erlang takes up a minimum of 3 words of memory i.e. a boxed term, a header word and a data word. Please note that the boxed term pointing to the header word can be present in either the stack or the heap. When the big integer is directly bound to a local variable in a function, the boxed term will be created in the stack. But, when a big integer is a part of or an element of another data structure such as a list or a tuple, the boxed term for the big integer will be created in the heap.