Character sets and encodings

Character set

A character set is the set of character images an operating system uses and how they are encoded. There are two types: single-byte character sets and multi-byte character sets. Examples of single-byte character sets are ASCII, ISO-8859-N (where N varies by geographic or cultural region), or WINDOWS-NNNN (where NNNN has the same value as N in the ISO-8859 standard). In these single-byte encodings, any specific character image is encoded as a byte, so byte 65 is the numeric code (or better yet, code point) for the character ‘A’ in ASCII (and other encodings, such as ISO-8859 and Windows, even CP-NNN). Since only one byte is used, these sets can represent a maximum of 256 characters in theory – in practice, there are some “non-printable” characters, such as the literal ‘\n’ (code point 10), ‘\r’ (13), ‘\t’ (9), and the NUL character with code point 0.

Windows

Windows has TWO single-byte character sets and one multibyte character set. For the US, there’s what Microsoft called the “ANSI” character set—in fact, for the US and Western countries, it’s called WINDOWS-1252 (WINDOWS-1251 for Russian). This character set is the same as ASCII with code points from 0 to 127. However, ASCII uses only 7 bits instead of 8… But a byte is 8 bits long, so there are an additional 128 bytes available. WINDOWS-1252 extends the ASCII set with these characters.

If we look at ISO-8859-1 (also known as the “Latin-1” character set), we see similarities. The difference between ISO-8859-1 and WINDOWS-1252 is that ISO-8859-1 does not have code points between 128 (0x80) and 159 (the first two lines above).

The second single-byte Windows character set, CP-437 (CP-866 for Russian), is used primarily in the terminal to maintain compatibility with older versions of MS-DOS. CP-437 characters from 0x80 to 0xFF differ from WINDOWS-1252. For example, code point 176 is a “non-breaking space” in WINDOWS-1252 and ISO-8859-1, but it is “á” in CP-437. In GUI applications, you’ll use WINDOWS-1252, but when creating terminal applications, you’ll use CP-437.

Unicode

Unicode is a multi-byte character set, with each code point being 32 bits long. With a range of 4 billion possible characters, it can represent all possible symbols. In fact, only a small subset of code points is currently implemented, allowing us to represent Latin, Cyrillic, hieroglyphs, even Babylonian cuneiform symbols, Aramaic, and more. There’s still room for emojis and technical symbols.

Because Unicode code points are 32 bits long, they are typically specified in hexadecimal format and prefixed with U+. So, ‘A’ is U+00000041. Since only a subset is used, this hexadecimal value is typically specified with 4–6 digits, for example, U+0041. Some characters require more, such as ‘𝈡’ (the Greek symbol for epsilon), which has the code point U+1D221.

You might think that a single character would always require 4 bytes, but there are different encodings for the same code point. It is possible to “transform” a code point into a shorter representation. These transformations are called UTF (short for Unicode Transformation Format ). UNIX systems currently use UTF-8 by default. Windows uses UTF-16 in its multibyte encoding.

UTF-8

For the 8-bit Unicode code point representation format, if we have an ASCII code point (7 bits), then only 1 byte is used, and it maps directly to an ASCII code point. So, “A” is 0x41 in ASCII and UTF-8. This works well for languages ​​that use the Latin alphabet and don’t have additional characters like diacritics in French or umlauts in German. However, for any code point above 0x7F, 2 or more bytes are used to represent a single character. The rules for this encoding are simple:

  • If a code point is only 7 bits long (ASCII), this means that bit 7 is 0, so: 0b0xxxxxxx is a single-byte character
  • For code points with 8 bits or more, the first digits of the first byte indicate how many bytes will be used (the number of high-order bits equal to 1, followed by a bit equal to 0). The remaining bits of the first byte contain the upper bits of the code point—other bytes in the format 0b10xxxxxx contain the next 6 bits.

For example, let’s look at the encoding of the symbol ‘𝈡’, which has the code U+1D221. This code in the decimal system is 119329, and in the binary system:

0001 1101 0010 0010 0001

So, this code point is 17 bits long. How many bytes are needed to transmit this value? You might assume 3 bytes would be needed (since 3 bytes = 24 bits, which is more than 17 bits). However, in this case, we would have 4 bits in the first byte and 6 bits in each of the following bytes. The total is 16 bits—1 bit is not enough. The format for 4-byte UTF-8:

1
0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx

where xxx are filled with code point bits.

With 4 bytes, we can use code points up to 21 bits, so the 4 most significant bits will be set to zero. The character U+1D221will be encoded as:

0b11110000, 0b10000001, 0b10110100, 0b10100010

What in hexadecimal notation corresponds to the sequence

0xF0, 0x9B, 0x88, 0xA1

Let’s take a closer look at how the bit distribution for UTF-8 will look. So, we need to distribute the 21 bits of Unicode code according to the following patterns:

1
0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
  • First byte: 11110+ 3 bits
  • Second byte: 10+ 6 bits
  • Third byte: 10+ 6 bits
  • Fourth byte: 10+ 6 bits

Let’s break the binary representation 0001 1101 0010 0010 0001into the corresponding parts, starting from the least significant bit:

  • Last 6 bits:100001
  • Next 6 bits:001000
  • Next 6 bits:011101
  • The remaining 3 bits: 000(Since we have 21 bits, not 24, we fill them with zeros in front)

Forming UTF-8 bytes:**

  • Byte 1 : 1111000011110000(hexadecimal: F0)
  • Byte 2 : 1001110110011101(hexadecimal: 9D)
  • Byte 3 : 1000100010001000(hexadecimal: 88)
  • Byte 4 : 1010000110100001(hexadecimal: A1)

Thus, the character’s𝈡( U+1D221) will be encoded in UTF-8 as a sequence of bytes0xf09d88a1

To check on Linux, we could use the following command

$ echo -n "𝈡" | hd
00000000 f0 9d 88 a1 |....|

UTF-16

UTF-16 is a 16-bit version of this transformation that works similarly. If a code point is 16 bits long, it fits into a single 16-bit word; otherwise, two 16-bit words are required. There are also some additional rules:

  • Code points U+0000 to U+D7FF and U+E000 to U+FFFF can be encoded directly using only one word.
  • To encode code points from U+10000 to U+10FFFF, two 16-bit words are used: 0b110110xxxxxxxxxxand 0b110111yyyyyyyyyy, where the code point is 0bxxxxxxxxxxyyyyyyyyyy(20 bits long).

There are two limitations:

  • Code points U+D800 to U+DFFF are reserved in Unicode due to UTF-16.
  • This transformation allows code points to be represented in a maximum of 20 bits.
  • Because this encoding uses WORDS (16-bit sequences) instead of bytes, it introduces a byte ordering issue. This forces us to specify the byte ordering in the conversion: UTF-16 typically uses UTF-16LE (little endian) ordering, but UTF-16BE (big endian) ordering can also be used.

 


Explore More IT Terms

Leave a Reply

Your email address will not be published. Required fields are marked *