ASCII & Unicode
Control (0–31, 127)
Space (32)
Digits (48–57)
Uppercase (65–90)
Lowercase (97–122)
Punctuation & symbols
Extended ASCII adds 128 characters (codes 128–255) on top of standard ASCII. These codes are not universal — every manufacturer used them differently. Switch between encodings below to see exactly the same byte values mapped to completely different characters.
Hover over a character to see details.
1,114,112
total code points · U+000000 to U+10FFFF
What is Unicode?
Unicode is a universal standard that assigns a unique number — called a code point — to every character in every human writing system. U+0041 is always the letter A. U+4E2D is always 中. U+1F600 is always 😀. There are 1,114,112 possible code points (U+000000 to U+10FFFF), of which around 150,000 are currently assigned. A code point is just a number — it says nothing about how to store it as bytes. That is the job of an encoding such as UTF-8.
Why it was introduced
By the late 1980s, the world had hundreds of incompatible 8-bit code pages — CP437, Windows-1252, Latin-1, KOI8-R, Shift-JIS, Big5, GB2312 — each covering a different region. The same byte value meant something completely different depending on which code page the software assumed. A document created on one system displayed as garbage on another. Email and the early web were broken by design. Unicode replaced all of them with a single standard: one number per character, agreed globally, forever.
UTF-8 encoding — how bits are shared between signalling and data
UTF-8 is variable-length: each character uses 1 to 4 bytes depending on its code point. But how does the decoder know how many bytes to read? The leading bits of the first byte carry the length signal — which means fewer bits are available for the actual character data.
U+0000 – U+007F
0x x x x x x x
byte 1
7 data bits · 128 characters · identical to ASCII · e.g. A = 0x41
U+0080 – U+07FF
1 1 0x x x x x
byte 1
1 0x x x x x x
byte 2
5 + 6 = 11 data bits · 1,920 characters · e.g. é = 0xC3 0xA9
U+0800 – U+FFFF
1 1 1 0x x x x
byte 1
1 0x x x x x x
byte 2
1 0x x x x x x
byte 3
4 + 6 + 6 = 16 data bits · 61,440 characters · e.g. 中 = 0xE4 0xB8 0xAD
U+10000 – U+10FFFF
1 1 1 1 0x x x
byte 1
1 0x x x x x x
byte 2
1 0x x x x x x
byte 3
1 0x x x x x x
byte 4
3 + 6 + 6 + 6 = 21 data bits · emoji & rare scripts · e.g. 😀 = 0xF0 0x9F 0x98 0x80
signalling bits — consumed by the encoding, not available for data
data bits — carry the actual code point value
A continuation byte always begins with 10, which means it can never be mistaken for a leading byte. Any byte that does not start with 10 is the start of a new character. This lets a decoder instantly resynchronise after a corrupted byte — it simply skips forward until it finds a non-continuation byte.
© 2026 Neil KendallMore @ www.korovatron.co.uk