Elixir : Basics of String datatype
Strings in elixir are internally built as utf8 encoded binaries. They can be constructed using the "string"
syntax and also using the internal bitstring syntax <<67,68,69>>
with byte values encoded using the utf8 encoding format. It supports escape characters such as \n, \t, \r etc and requires double quotes to be escaped inside the string. It also supports the \u syntax for directly using the hexadecimal representation of the Unicode codepoints.
"string"
"string with \"quotes\" and line feed \n"
"\u0061\u0062\u0063" # "abc"
Concatenation
Strings in elixir can be concatenated using the same <>
syntax used for the internal bitstrings.
"hello" <> "world" # "helloworld"
Interpolation
Strings in elixir offer interpolation using the #{}
syntax. The result of the expression within the #{}
will be concatenated to the existing string, provided that the resulting term of the expression implements the String.Chars protocol used by the IO.puts function. If the result of the expression is a term that does not implement the String.Chars protocol, then an error will be thrown when interpolating.
a = 5
"hello #{a * a}" # "hello 25"
"hello #{%{1=>1}}" # error
** (Protocol.UndefinedError) protocol String.Chars not implemented for %{1 => 1} of type Map
Codepoints and graphemes
A codepoint is the integer mapped with a particular character according to the Unicode standard. Each character in the string will be stored as the transformed codepoint using 1–4 bytes per character. UTF 8 either uses 1, 2, 3 or 4 bytes to represent a codepoint and it transforms the code point using tag bits before storing it in the bytes. The transformation rules for utf8 are as follows
- ASCII characters (0–127): ASCII characters are represented directly using a single byte where the most significant bit is always
0
. - Two-byte characters (128–2047): The first byte starts with
110
, and the second byte starts with10
as tag bits. The remaining bits in both bytes are combined to represent the code point. - Three-byte characters (2048-65535): The first byte starts with
1110
, and the subsequent second and third bytes start with10
as tag bits. The remaining bits in each byte are combined to represent the code point. - Four-byte characters (65536–1114111): The first byte starts with
11110
, and the subsequent second, third and fourth bytes start with10
as tag bits. The remaining bits in each byte are combined to represent the code point.
You can access the codepoints as integers using the ?character
syntax for single characters or as a list of codepoint integers using the String.to_charlist/1
function. You can get the codepoints as individual character strings using the String.codepoints/1
function. The codepoint list can be converted back to string using the List.to_string/1
function.
"C" # codepoint - 67, utf8 - <<67>> -> <<0b01000011>>
"Ā" # codepoint - 256, utf8 - <<196, 128>> -> <<0b110_00100, 0b10_000000>>
"馚" # codepoint - 39322, utf8 - <<233, 166, 154>> -> <<0b1110_1001, 0b10_100110, 0b10_011010>>
"😂" # codepoint - 128514, utf8 - <<240, 159, 152, 130>> -> <<0b11110_000, 0b10_011111, 0b10_011000, 0b10_000010>>
?C # 67
?😂 # 128514
String.to_charlist("abc😂") # [97, 98, 99, 128514]
String.codepoints("abc😂") # ["a", "b", "c", "😂"]
List.to_string([97, 98, 99, 128514]) # "abc😂"
List.to_string(["a", "b", "c", "😂"]) # "abc😂"
Graphemes are a collection of multiple existing codepoints that can be combined to form a single printable character. E.g. three codepoints such as man emoji, zero width joiner and fire truck emoji can be combined to form a man firefighter emoji. Incase of graphemes, both String.to_charlist/1
and String.codepoints/1
will result in a list of the separate codepoints present in an individual grapheme. In order to get the list of graphemes as a single individual unit, the String.graphemes/1
function can be used. Graphemes allow you to compose new characters and provide different variations for characters such as different skin tones for the thumbs up emoji.
"👨\u200D🚒" # "👨🚒"
String.to_charlist("👨🚒") # [128104, 8205, 128658]
String.codepoints("👨🚒") # ["👨", "", "🚒"]
String.graphemes("hello👨🚒") # ["h", "e", "l", "l", "o", "👨🚒"]
String.to_charlist("👍") # [128077]
String.to_charlist("👍🏼") # [128077,127996]
String.to_charlist("👍🏾") # [128077,127998]
String.codepoints("é") # ["e", "́"]
String.to_charlist("é") # [101, 769]
String.graphemes("é") # ["é"]
Heredocs
Heredocs are multiline strings that are constructed using the """
syntax. The newline character is implicitly added for the lines. This syntax is commonly used for the documentation.
"""
hello
world
"""
# "hello\nworld\n"
Reference modules
The String
module provides numerous functions that can operate on strings such as at/2
, length/1
, upcase/1
, trim
and slice
etc. Since strings are internally binaries, any function that operates on binary data can operate on strings. Kernel module’s guards such as binary_part/3
, byte_size/1
and is_binary/1
can also be used on strings. Please note that functions that operate on binaries work on a byte by byte basis while the functions that operate explicitly on strings work on a codepoint by codepoint basis. Hence they may not produce the same result unless all the codepoints in the string are made up a of single byte each. For an efficient index based byte value access for a string, you can use the :binary.at/2 function which is more efficient than the String.at/2
function since it does not need to decode bytes into utf8 codepoints.
String.length("hello👍") # 6
byte_size("hello👍") # 9
String.at("hello👍", 5) # "👍"
:binary.at("hello👍", 5) # 240
:binary.at("hello👍", 6) # 159
String.trim(" hello ") # "hello"
String.upcase("hello") # "HELLO"
is_binary("hello") # true