ASCII & Unicode

Control (0–31, 127)

Space (32)

Digits (48–57)

Uppercase (65–90)

Lowercase (97–122)

Punctuation & symbols

Hover over a character to see details.

1,114,112

total code points · U+000000 to U+10FFFF

What is Unicode?

Unicode is a universal standard that assigns a unique number — called a code point — to every character in every human writing system. U+0041 is always the letter A. U+4E2D is always 中. U+1F600 is always 😀. There are 1,114,112 possible code points (U+000000 to U+10FFFF), of which around 150,000 are currently assigned. A code point is just a number — it says nothing about how to store it as bytes. That is the job of an encoding such as UTF-8.

Why it was introduced

By the late 1980s, the world had hundreds of incompatible 8-bit code pages — CP437, Windows-1252, Latin-1, KOI8-R, Shift-JIS, Big5, GB2312 — each covering a different region. The same byte value meant something completely different depending on which code page the software assumed. A document created on one system displayed as garbage on another. Email and the early web were broken by design. Unicode replaced all of them with a single standard: one number per character, agreed globally, forever.

UTF-8 encoding — how bits are shared between signalling and data

UTF-8 is variable-length: each character uses 1 to 4 bytes depending on its code point. But how does the decoder know how many bytes to read? The leading bits of the first byte carry the length signal — which means fewer bits are available for the actual character data.

U+0000 – U+007F

0x x x x x x x

byte 1

7 data bits · 128 characters · identical to ASCII · e.g. A = 0x41

U+0080 – U+07FF

1 1 0x x x x x

byte 1

1 0x x x x x x

byte 2

5 + 6 = 11 data bits · 1,920 characters · e.g. é = 0xC3 0xA9

U+0800 – U+FFFF

1 1 1 0x x x x

byte 1

1 0x x x x x x

byte 2

1 0x x x x x x

byte 3

4 + 6 + 6 = 16 data bits · 61,440 characters · e.g. 中 = 0xE4 0xB8 0xAD

U+10000 – U+10FFFF

1 1 1 1 0x x x

byte 1

1 0x x x x x x

byte 2

1 0x x x x x x

byte 3

1 0x x x x x x

byte 4

3 + 6 + 6 + 6 = 21 data bits · emoji & rare scripts · e.g. 😀 = 0xF0 0x9F 0x98 0x80

signalling bits — consumed by the encoding, not available for data data bits — carry the actual code point value

A continuation byte always begins with 10, which means it can never be mistaken for a leading byte. Any byte that does not start with 10 is the start of a new character. This lets a decoder instantly resynchronise after a corrupted byte — it simply skips forward until it finds a non-continuation byte.