Elixir : Basics of String datatype

Arunmuthuram M
4 min readOct 21, 2023

--

Strings in elixir are internally built as utf8 encoded binaries. They can be constructed using the "string" syntax and also using the internal bitstring syntax <<67,68,69>> with byte values encoded using the utf8 encoding format. It supports escape characters such as \n, \t, \r etc and requires double quotes to be escaped inside the string. It also supports the \u syntax for directly using the hexadecimal representation of the Unicode codepoints.

"string"
"string with \"quotes\" and line feed \n"
"\u0061\u0062\u0063" # "abc"

Concatenation

Strings in elixir can be concatenated using the same <> syntax used for the internal bitstrings.

"hello" <> "world" # "helloworld"

Interpolation

Strings in elixir offer interpolation using the #{} syntax. The result of the expression within the #{} will be concatenated to the existing string, provided that the resulting term of the expression implements the String.Chars protocol used by the IO.puts function. If the result of the expression is a term that does not implement the String.Chars protocol, then an error will be thrown when interpolating.

a = 5
"hello #{a * a}" # "hello 25"

"hello #{%{1=>1}}" # error
** (Protocol.UndefinedError) protocol String.Chars not implemented for %{1 => 1} of type Map

Codepoints and graphemes

A codepoint is the integer mapped with a particular character according to the Unicode standard. Each character in the string will be stored as the transformed codepoint using 1–4 bytes per character. UTF 8 either uses 1, 2, 3 or 4 bytes to represent a codepoint and it transforms the code point using tag bits before storing it in the bytes. The transformation rules for utf8 are as follows

  • ASCII characters (0–127): ASCII characters are represented directly using a single byte where the most significant bit is always 0.
  • Two-byte characters (128–2047): The first byte starts with 110, and the second byte starts with 10 as tag bits. The remaining bits in both bytes are combined to represent the code point.
  • Three-byte characters (2048-65535): The first byte starts with 1110, and the subsequent second and third bytes start with 10 as tag bits. The remaining bits in each byte are combined to represent the code point.
  • Four-byte characters (65536–1114111): The first byte starts with 11110, and the subsequent second, third and fourth bytes start with 10 as tag bits. The remaining bits in each byte are combined to represent the code point.

You can access the codepoints as integers using the ?character syntax for single characters or as a list of codepoint integers using the String.to_charlist/1 function. You can get the codepoints as individual character strings using the String.codepoints/1 function. The codepoint list can be converted back to string using the List.to_string/1 function.

"C" # codepoint - 67, utf8 - <<67>> -> <<0b01000011>>
"Ā" # codepoint - 256, utf8 - <<196, 128>> -> <<0b110_00100, 0b10_000000>>
"馚" # codepoint - 39322, utf8 - <<233, 166, 154>> -> <<0b1110_1001, 0b10_100110, 0b10_011010>>
"😂" # codepoint - 128514, utf8 - <<240, 159, 152, 130>> -> <<0b11110_000, 0b10_011111, 0b10_011000, 0b10_000010>>

?C # 67
?😂 # 128514

String.to_charlist("abc😂") # [97, 98, 99, 128514]
String.codepoints("abc😂") # ["a", "b", "c", "😂"]
List.to_string([97, 98, 99, 128514]) # "abc😂"
List.to_string(["a", "b", "c", "😂"]) # "abc😂"

Graphemes are a collection of multiple existing codepoints that can be combined to form a single printable character. E.g. three codepoints such as man emoji, zero width joiner and fire truck emoji can be combined to form a man firefighter emoji. Incase of graphemes, both String.to_charlist/1 and String.codepoints/1 will result in a list of the separate codepoints present in an individual grapheme. In order to get the list of graphemes as a single individual unit, the String.graphemes/1 function can be used. Graphemes allow you to compose new characters and provide different variations for characters such as different skin tones for the thumbs up emoji.

"👨\u200D🚒" # "👨‍🚒"
String.to_charlist("👨‍🚒") # [128104, 8205, 128658]
String.codepoints("👨‍🚒") # ["👨", "‍", "🚒"]
String.graphemes("hello👨‍🚒") # ["h", "e", "l", "l", "o", "👨‍🚒"]

String.to_charlist("👍") # [128077]
String.to_charlist("👍🏼") # [128077,127996]
String.to_charlist("👍🏾") # [128077,127998]

String.codepoints("é") # ["e", "́"]
String.to_charlist("é") # [101, 769]
String.graphemes("é") # ["é"]

Heredocs

Heredocs are multiline strings that are constructed using the """ syntax. The newline character is implicitly added for the lines. This syntax is commonly used for the documentation.

""" 
hello
world
"""
# "hello\nworld\n"

Reference modules

The String module provides numerous functions that can operate on strings such as at/2, length/1, upcase/1, trim and slice etc. Since strings are internally binaries, any function that operates on binary data can operate on strings. Kernel module’s guards such as binary_part/3, byte_size/1 and is_binary/1 can also be used on strings. Please note that functions that operate on binaries work on a byte by byte basis while the functions that operate explicitly on strings work on a codepoint by codepoint basis. Hence they may not produce the same result unless all the codepoints in the string are made up a of single byte each. For an efficient index based byte value access for a string, you can use the :binary.at/2 function which is more efficient than the String.at/2 function since it does not need to decode bytes into utf8 codepoints.

String.length("hello👍") # 6
byte_size("hello👍") # 9

String.at("hello👍", 5) # "👍"
:binary.at("hello👍", 5) # 240
:binary.at("hello👍", 6) # 159

String.trim(" hello ") # "hello"
String.upcase("hello") # "HELLO"

is_binary("hello") # true

--

--

No responses yet