ASCII & Unicode

Control (0–31, 127)

Space (32)

Digits (48–57)

Uppercase (65–90)

Lowercase (97–122)

Punctuation & symbols

Hover over a character to see details.

1,114,112

total code points · U+000000 to U+10FFFF

What is Unicode?

Unicode is a universal standard that assigns a unique number, called a code point, to every character in every human writing system. U+0041 is always the letter A. U+4E2D is always 中. U+1F600 is always 😀. There are 1,114,112 possible code points (U+000000 to U+10FFFF), of which around 150,000 are currently assigned. A code point is just a number, it says nothing about how to store it as bytes. That is the job of an encoding such as UTF-8.

Why it was introduced

By the late 1980s, the world had hundreds of incompatible 8-bit code pages, CP437, Windows-1252, Latin-1, KOI8-R, Shift-JIS, Big5, GB2312, each covering a different region. The same byte value meant something completely different depending on which code page the software assumed. A document created on one system displayed as garbage on another. Email and the early web were broken by design. Unicode replaced all of them with a single standard: one number per character, agreed globally, forever.

UTF-8 encoding, how bits are shared between signalling and data

UTF-8 is variable-length: each character uses 1 to 4 bytes depending on its code point. But how does the decoder know how many bytes to read? The leading bits of the first byte carry the length signal, which means fewer bits are available for the actual character data.

U+0000 – U+007F

0x x x x x x x

byte 1

7 data bits · 128 characters · identical to ASCII · e.g. A = 0x41

U+0080 – U+07FF

1 1 0x x x x x

byte 1

1 0x x x x x x

byte 2

5 + 6 = 11 data bits · 1,920 characters · e.g. é = 0xC3 0xA9

U+0800 – U+FFFF

1 1 1 0x x x x

byte 1

1 0x x x x x x

byte 2

1 0x x x x x x

byte 3

4 + 6 + 6 = 16 data bits · 61,440 characters · e.g. 中 = 0xE4 0xB8 0xAD

U+10000 – U+10FFFF

1 1 1 1 0x x x

byte 1

1 0x x x x x x

byte 2

1 0x x x x x x

byte 3

1 0x x x x x x

byte 4

3 + 6 + 6 + 6 = 21 data bits · emoji & rare scripts · e.g. 😀 = 0xF0 0x9F 0x98 0x80

signalling bits, consumed by the encoding, not available for data data bits, carry the actual code point value

A continuation byte always begins with 10, which means it can never be mistaken for a leading byte. Any byte that does not start with 10 is the start of a new character. This lets a decoder instantly resynchronise after a corrupted byte, it simply skips forward until it finds a non-continuation byte.