Elixir : Basics of Bitstring datatype
The bitstring datatype in elixir represents contiguous bits and can be used to store and represent many kinds of data such as integers, floats, strings etc since these data types are internally only a bunch of bits. This is why bitstrings are widely used to store and process many complex data such as encoded and raw file data. Bitstrings give you fine control and the ability to construct, deconstruct and manipulate data in a bit level which is a very powerful feature. This article explains the basics of bitstring datatype in elixir.
Syntax
Bitstrings can be constructed using the <<segment, segment::specifier>>
syntax which will include comma separated segments or units. Each segment can be associated with a specifier using the segment::specifier
syntax. The specifier can be used to determine the properties of the bits in the segment such as the data type by which the bits will be read as, the size that determines the number of bits to include in the segment and attributes like the endianness that determines the order in which the bits will be read etc. There are defaults for all these attributes and the defaults will be used in case a specific specifier is not being provided for a segment. A specifier associated with a segment can have multiple options or modifiers, separated by a hyphen, such as <<segment::option1-option2-option3, segment>>
. Each option can be related to a particular attribute of the segment such as its size, type, endianness etc.
<<4>> # bitstring comprising an integer 4
<<4.5>> # bitstring comprising a float 4.5
<<"hello">> # bitstring comprising a string "hello"
<<4, 4.5, "hello">> # bitstring comprising of 3 segments of 4, 4.5 and "hello"
Types
The various types supported inside a bitstring construct are integer, float, bits or bitstring, bytes or binary, utf8, utf16 and utf32. Each of these types can be used as an option in the specifier and each type has its own defaults such as size and units. Certain options or modifiers are specific to a particular type and cannot be used with other types. Some of the specifiers can only be used when deconstructing a bitstring using pattern matching.
Integer
The bitstring construct supports both positive and negative integers. The default size for an integer segment is 8 bits or a single byte. If a segment contains an integer that cannot be represented using the default 8 bits and if no specific size option is used in the specifier, then the value will be truncated to the available 8 bits. The required size must be explicitly provided for segments in order to accommodate integers. On the contrary the size can also be provided in terms of bits to store integers in less than the default 8 bits.
<<8,4,5>>
The above bitstring comprises three positive integers. It takes a total of 3*8 = 24 bits.
<<1000,8>> # <<232,8>>
The above bitstring contains two segments. Each segment has an integer that takes up a default space of 8 bits. But the integer 1000 cannot be represented using 8 bits, hence it will be truncated to 232.
1000 -> 0b11_1110_1000 which takes up 10 bits. But the default size of the integer segment is 8 bits and hence only the last 8 significant bits will be stored, hence producing 232 -> 0b1110_1000. This can be avoided by providing the size option in the specifier for the segment.
<<1000::size(10),8>> # <<250, 2, 0::size(2)>>
The reason why you see <<250, 2, 0::size(2)>>
in iex when you create the above bitstring or when you perform an IO.inspect on the above binary, is that the data will be read as bytes from the most significant end and printed as the respective integer. Any partial bits at the end will be expressed using the size specifier option. The binary representation of the above bitstring will combine the binary representation of 1000 in 10 bits and the integer 8 in 8 bits which will be 11_11101000_00001000
. When we read it byte by byte from the most significant end, then the representation will be 11111010_00000010_00
which equals <<250, 2, 0::size(2)>>
.
<<4::size(3)>> # <<4::size(3)>>
The above bitstring will have the 3 bits 100
representing the integer 4 instead of using the default 8 bits.
<<-1>> # <<255>>
Negative integers in bitstrings are stored using the 2’s complement which is computed by reversing all the bits in the number and then adding 1 to it. The binary representation of 1 in 8 bits is 0000_0001
. If you reverse the bits you get 1111_1110
and when you finally add 1, you get 1111_1111
which equals the value 255 which is expressed when printing the bitstring. But this 255 could mean either a positive integer 255 or a negative integer -1. This ambiguity can be avoided when reading the number 255 by using specific sign options signed
and unsigned
in the specifier. The default sign option when reading an integer is unsigned
. So, if you are reading the integer as a signed integer, the signed option must be explicitly used in the pattern matching which will in turn use 2’s complement to extract the sign and magnitude of the stored integer.
a = <<-1>> # <<255>>
<<int::integer>> = a # int = 255
<<int::integer-signed>> = a # int = -1
Similar to the sign specifier options, the specifier options related to endianness such as the big, little and native can also be used when reading or pattern matching an integer from a bitstring that has more than a byte. Endianness is the byte order in which the data is read. Big endian order refers to reading the bytes from left to right i.e. the first byte is stored first and the last byte is stored last. This is the default order used in bitstrings. The little endian order refers to reading the bytes from right to left. These endian order specifier options only matter when the bitstring takes up more than one byte. Even if there are multiple segments with partial bits, the endian order works only on the bytes and not on the segments. The native specifier option uses the default endian order used by the machine which could be either big or little. Endianness is essential when reading data written by other systems. Data written in little endian order will not make sense if it is read using big endian order.
<<int::integer-size(16)>> = <<0,1>> # int = 1
<<int::integer-little-size(16)>> = <<0,1>> # int = 256
The bitstring used on the right side has two integers, each talking up a default of 8 bits, 00000000_00000001
. Now, in the first pattern matching, both the bytes are read in the big endian order which equals 1. In the second pattern matching, the order used is little and hence the bytes will be read from right to left which produces 000000001_00000000
= 256.
<<int::integer-size(3)>> = <<1::size(1),0::size(1),0::size(1)>> # int = 4
<<int::integer-little-size(3)>> = <<1::size(1),0::size(1),0::size(1)>> # int = 4
In the above code, both the big and little endian order produces the same result. This is because the endian order works on a byte by byte basis and the above bitstring has only 3 bits which will be considered as a single byte 00000100
. There is not more than one byte to read and hence the int is bound to 4 in both the cases.
<<int::integer-size(9)>> = <<1::size(2),0::size(3),0::size(3),1::size(1)>> #int = 129
<<int::integer-size(9)-little>> = <<1::size(2),0::size(3),0::size(3),1::size(1)>> # int = 320
In the above code, there are four segments in the bitstring, which will be represented as 01_000_000_1
, which when read as bytes will be represented as 01000000_1
. The first pattern matching reads the integer in big endian order, which as such equals 129. The second pattern matching reads the integer in little endian order, where the bytes are read from right to left gives us 1_01000000
, which equals 320. Please note that the endian order does not affect the order of bits within a single byte and it only affects the reading order of bytes as a whole.
Another thing about the size specifier option is that a short syntax can be used directly without the size() syntax.
<<int::integer-size(16)>> can also be written as <<int::integer-16>>.
<<int::size(16)>> can also be written as <<int::16>>. Please note that integer is the default type for segments when no specific type is provided as a specifier option. A variable bound to an integer can also be used to indicate the size by using the variable inside the size()
specifier option.
a = 2
<<int::size(^a)>> = <<0,1>>
Float
The bitstring construct allows only positive floating point numbers when directly constructing them as floats. The default size of a float segment in bitstrings is 64 bits or 8 bytes that represents IEEE 754 double precision float. Other than the default size of 64, either 16 or 32 bits can be used, which would represent IEEE 754 half precision and single precision floating point numbers respectively. Hence the only values for the size specifier option that float supports are 64, 32 or 16. Anything other than this will lead to an error.
<<4.5>> # <<64, 18, 0, 0, 0, 0, 0, 0>>
<<4.5::32>> # <<64, 144, 0, 0>>
<<4.5::16>> # <<68, 128>>
<<-4.5>>
** (ArgumentError) construction of binary failed: segment 1 of type 'integer': expected an integer but got: -4.5
<<4.5::8>>
error: float requires size*unit to be 16, 32, or 64 (default), got: 8
All of the first three expressions are valid and the printed bitstring is just made up of the byte by byte representation of its respective IEEE 754 bit representation. In order to understand how float is represented by IEEE 754 bit representation, check out this article and this tool for playing around with float values.
Similar to the integer segments, the endian order specifier options, big, little and native can be used for float segments while reading a float value. The default endian specifier option for float is big.
<<f::float>> = <<64, 18, 0, 0, 0, 0, 0, 0>> # f = 4.5
<<f::float-little>> = <<64, 18, 0, 0, 0, 0, 0, 0>> # f = 2.3083e-320
Even though only positive float numbers are allowed for constructing the bitstrings directly, when reading floats from the bitstrings using pattern matching, the sign of the float will be directly read from the bytes of the IEEE 754 representation.
<<f::float>> = <<192, 18, 0, 0, 0, 0, 0, 0>> # f = -4.5
Bits or Bitstring
The type bits or bitstring lets you deal with bits in general. They are read as just integers representing the bits and not as any other data type. They are used mainly in pattern matching to denote the last segment of variable size, similar to the [hd | tl] deconstruction of list, where you don’t need to know the length of the list. Unless it is the last segment, the bits type must be associated with a size specifier option.
<<f::bits-1, rest::bits>> = <<64,8>> # f = <<1::size(1)>>, rest = <<0, 8::size(7)>>
<<f::bits-4, rest::bits>> = <<64,8>> # f = <<4::size(4)>>, rest = <<0, 8::size(4)>>
<<f::integer, rest::bits>> = <<64,8,7>> # f = 64, rest = <<8,7>>
<<flt::float-16, int::integer, rest::bits>> = <<68, 128, 64, 8, 5>> # flt = 4.5, int = 64, rest = <<8, 7>>
<<int::integer, rest::bits>> = <<68>> # int = 68, rest = <<>>
Bytes or Binary
The bytes or binary is similar to the bits data type and the only difference is that they only deal with complete bytes and not with partial bytes, i.e. the total size of the binary segment must be in terms of bytes. The total number of bits must be divisible by 8. Unlike the other types such as integer, float and bits, the unit for the binary segment’s size is 8 bits, i.e. whenever the size specifier option is mentioned for the binary type, it represents bytes and not bits. <<bin::binary-1>> means a binary segment of 1 byte which equals 8 bits of size. Similar to the bits datatype, binary can also be used without the size specifier option for the last segment of the bitstring when pattern matching. But the size of the last segment of variable length must be complete bytes and must not contain partial bits.
<<f::binary-1,rest::binary>> = <<64,8,9>> # f = <<64>>, rest = <<8,9>>
<<f, s, rest::binary>> = <<64,8,9>> # f = 64, s = 8, rest = <<9>>
<<f::bits-8, rest::binary>> = <<64,8,9>> # f = <<64>>, rest = <<8,9>>
<<f::bits-1, rest::binary>> = <<64,8,9>> # the size of rest is 23(partial)
** (MatchError) no match of right hand side value
Some of the bitstrings in iex may be printed as strings with characters instead of the byte by byte representation and the reason behind this is that strings in elixir are internally utf-8 encoded binaries. If a binary segment contains bytes whose values are valid codepoints, then iex will print the binary as string instead of the byte by byte representation. We will be looking into this in detail in the later part of the article.
<<65,66,67>> # prints "ABC" since the codepoint 65 = A, 66 = B and 67 = C and all are printable
<<0,65,66>> # prints <<0,65,66>> since 0 is not a printable codepoint
"ABC" == <<65,66,67>> # true
Most of the time when we are dealing with bitstrings, we only deal with complete bytes such as in the string data type. And because of this, binaries are regarded as a special data type in Elixir. Whenever the word binary is used in Elixir, it just means bitstrings whose total bit sizes are divisible by 8. There are certain functions and operations that can be performed only on binaries and not on bitstrings. All binaries are bitstrings but not all bitstrings are binaries.
utf8, utf16 and utf32
utf8, utf16 and utf32 are encoding formats used for strings. Let us first understand the basics of these encoding formats. It all began with ASCII which mapped around 128 characters and symbols with a respective number that was used to encode strings before storing it in computers. These numbers were then decoded later to their respective characters to produce the strings. ASCII was then succeeded by Unicode which included not only the English alphabets and common symbols, but characters from all languages, emojis etc., having a total of up to 149,878 character mappings in its recent version. Each character will be mapped with an integer and this integer is called the codepoint. In Elixir the codepoint of any character can be obtained by using the ?character
syntax.
?A # 65
?a # 97
?á # 225
?😂 # 128514
So whenever a string is stored in a computer, the equivalent codepoint of all the characters in the string are stored as integers. Now let us dive into the size of these codepoints. Some of these codepoints such as 65 for the letter A can be stored within a byte while codepoints such as 128514 for the emoji 😂 needs more than a byte to be stored. This is where the utf formats come in. utf8 format uses variable length to store these codepoints. It uses either 1 byte, 2 bytes, 3 bytes or 4 bytes depending on the size of the codepoint. utf16 format is also variable length but uses only either 2 bytes or 4 bytes to store the codepoints and finally utf32 is fixed length and uses 4 bytes to store all of the codepoints. utf8 uses the least amount of memory, but it is the slowest of all three formats since it may require additional computation and transformation to determine the correct byte size of the codepoint being used while encoding and decoding . utf32 takes up the most memory but it is the fastest of all the three encoding formats since no additional computation and transformation is required to differentiate the number of bytes used for different characters while encoding and decoding.
"aá😂" # <<97, 195, 161, 240, 159, 152, 130>>
# where a = <<97>>, á = <<195, 161>>, 😂 = <<240, 159, 152, 130>>
<<"aá😂"::utf16>> # <<0, 97, 0, 225, 216, 61, 222, 2>>
# where a = <<0, 97>>, á = <<0, 225>>, 😂 = <<216, 61, 222, 2>>
<<"aá😂"::utf32>> # <<0, 0, 0, 97, 0, 0, 0, 225, 0, 1, 246, 2>>
# where a = <<0, 0, 0, 97>>, á = <<0, 0, 0, 225>>, 😂 = <<0, 1, 246, 2>>
Please note that in utf8’s 2, 3 and 4 byte encodings and utf16’s 4 byte encodings, the byte value may not directly correspond to the codepoint and this is because these formats have different tag bits and additional transformation for different byte sized codepoints. The transformation will be done on these multibyte encodings in order to derive the actual codepoint. E.g. for two byte characters in utf8, the first byte starts with 110
, and the second byte starts with 10
. The remaining bits in both bytes are used to represent the codepoint. The tag bits and transformation technique varies for different byte size encodings such as for the 3 byte utf8, 4 byte utf8 and the 4 byte utf16.
utf8 -> <<195, 161>> = ?á = 225
195 = 110_00011
161 = 10_100001
After removal of tag bits we get 00011_100001 = 225
When reading utf types in bitstrings using pattern matching, they do not support size specifier options like the other types and the data is read as the codepoint integer for a single character. utf16 and utf32 also support the endian order specifier options when reading bitstrings using pattern matching.
<<f::utf8, s::utf8, rest::binary>> = "aá😂aa" # f = 97, s = 225, rest = "😂aa"
<<f::utf16, s::utf16, rest::binary>> = <<"aá😂aa"::utf16>> # f = 97, s = 225, rest = <<216, 61, 222, 2, 0, 97, 0, 97>>
<<f::utf32, s::utf32, rest::binary>> = <<"aá😂aa"::utf32>> # f = 97, s = 225, rest = <<0, 1, 246, 2, 0, 0, 0, 97, 0, 0, 0, 97>>
<<f::utf16-little>> = <<97, 0>> # f = 97
<<f::utf16>> = <<0, 97>> # f = 97
<<f::utf32-little>> = <<97, 0, 0, 0>> # f = 97
<<f::utf32>> = <<0, 0, 0, 97>> # f = 97
Reference modules
The kernel module has some useful functions that can work on bitstrings and binaries. The bit_size/1
and byte_size/1
guard functions would return the total number of bits and bytes used by the bitstrings. The binary_part/3
guard function works on binaries to extract a part of the binary. The is_bitstring/1
and is_binary/1
guard functions return boolean. All strings are internally binaries. The concatenation operator <>
can be used on binaries to concatenate segments from two binaries into one. The String module can be used especially for the utf8 encoded binaries. The :binary module of erlang provides useful functions that can operate on binaries such as bin_to_list
. Any term in elixir can be converted into a binary and vice versa using :erlang.term_to_binary/1
and the :erlang.binary_to_term/1
functions which gives you the power to read and manipulate any term in Elixir with a bit and byte level precision using bitstrings, which is very powerful.
<<1,2,3>> <> <<1,2,3>> # <<1,2,3,1,2,3>>
"hello" <> "world" # "helloworld"
"abc" <> <<0,1,2>> # <<97, 98, 99, 0, 1, 2>>
<<1::2>> <> <<1>> # concatenation is not operable on non binaries
error: expected <<1::integer-size(2)>> to be a binary but its number of bits is not divisible by 8
bit_size(<<1::2,1>>) # 10
byte_size(<<1::2,1>>) # 2 - partial bits are calculated as 1 byte
binary_part("foo", 1, 2) # "00" - (binary, start_index(zero based index), length)
binary_part(<<0,1,2,3>>, 0, 3) # <<0,1,2>>
is_bitstring(<<1::2,1>>) # true
is_bitstring("hello") # true
is_binary(<<1::2,1>>) # false
is_binary(<<1,2>>) # true
is_binary("hello") # true
:binary.bin_to_list(<<1,2,3>>) # [1,2,3]
:erlang.term_to_binary([1,2,3]) # <<131, 107, 0, 3, 1, 2, 3>>
:erlang.binary_to_term(<<131, 107, 0, 3, 1, 2, 3>>) # [1,2,3]
The rules by which the Elixir terms are converted into binaries are based on the external term format. This gives you the ability to easily store and transfer any Elixir term across systems and networks by making use of the binary data type.