Unicode: Concepts, Planes & 5-Layer Architecture (ACR/CCS/CEF/CES/TES)
Unicode and Character Encoding
Unicode is the global standard that unifies all text characters across languages and writing systems. It assigns every symbol — letters, digits, punctuation, emojis, and control marks — a unique number called a *code point*. This allows all languages to coexist in a single consistent system.
Why Unicode was needed
Before Unicode, computers used many different and incompatible encodings such as ASCII, ISO-8859, and Windows code pages. Each could only represent a subset of all written languages. As a result:
- Text from one system often appeared garbled on another.
- Mixing multiple languages in one document was impossible.
- Software had to guess or specify which encoding a file used.
Unicode solved this by defining a single, universal character set.
Unicode basics
- **Code point** – The unique number assigned to each character (e.g. “A” = U+0041, “Ω” = U+03A9, “你” = U+4F60).
- **Glyph** – The visible shape of a character (how it is drawn in a font).
- **Encoding form** – The way code points are represented in memory as bytes (UTF-8, UTF-16, UTF-32).
- **Plane** – A group of 65,536 code points. Unicode currently defines 17 planes (Plane 0 = Basic Multilingual Plane).
Unicode defines over 150,000 characters, including scripts, emojis, and special symbols.
The five-layer Unicode model
Unicode defines text processing in several conceptual layers:
- **Abstract Character Repertoire (ACR):**
Defines which characters exist — for example “A”, “é”, “Ω”, “你”.
- **Coded Character Set (CCS):**
Assigns each character a numeric code point (e.g., “A” → 65 or U+0041).
- **Character Encoding Form (CEF):**
Specifies how code points are stored using code units (8, 16, or 32 bits).
- **Character Encoding Scheme (CES):**
Describes how code units become a byte stream (e.g., UTF-8, UTF-16BE, UTF-16LE).
- **Transfer Encoding Syntax (TES):**
Optional layer for transmission (e.g., Base64, gzip compression).
Unicode planes
Unicode divides its code space into planes, each containing 65,536 (2¹⁶) code points:
Plane | Name | Range | Examples
|------|--------|----------
0 | Basic Multilingual Plane (BMP) | U+0000–U+FFFF | Latin, Cyrillic, Arabic, Chinese 1 | Supplementary Multilingual Plane (SMP) | U+10000–U+1FFFF | Ancient scripts, musical notation 2 | Supplementary Ideographic Plane (SIP) | U+20000–U+2FFFF | Chinese/Japanese/Korean extensions 15–16 | Supplementary Private Use Areas | U+F0000–U+10FFFF | Reserved for custom use
UTF encodings
Unicode data must be stored or transmitted as bytes. Three major encoding forms exist:
Encoding | Unit size | Variable length | Description
|------------|----------------|-------------
UTF-8 | 8 bits | Yes (1–4 bytes) | Compact for ASCII, dominant in web & Unix systems UTF-16 | 16 bits | Yes (2 or 4 bytes using surrogate pairs) | Used in Windows, Java, .NET UTF-32 | 32 bits | No (fixed width) | Simple but inefficient in size
Example:
- “A” (U+0041): UTF-8 → 0x41, UTF-16 → 0x00 41, UTF-32 → 0x00 00 00 41
- “你” (U+4F60): UTF-8 → 0xE4 0xBD 0xA0, UTF-16 → 0x4F 60, UTF-32 → 0x00 00 4F 60
Byte order and BOM
UTF-16 and UTF-32 use multi-byte sequences, so byte order (endianness) matters:
- **UTF-16BE** – Big endian (most significant byte first)
- **UTF-16LE** – Little endian
- **BOM (Byte Order Mark):** Special prefix U+FEFF to indicate endianness.
Example:
- Big endian BOM → 0xFE 0xFF
- Little endian BOM → 0xFF 0xFE
UTF-8 in detail
UTF-8 is the most used encoding today because:
- Compatible with ASCII
- Self-synchronizing (can recover from errors)
- Efficient for Western text
- Supports all Unicode characters
Byte patterns:
- 1-byte: 0xxxxxxx (ASCII)
- 2-byte: 110xxxxx 10xxxxxx
- 3-byte: 1110xxxx 10xxxxxx 10xxxxxx
- 4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Example: The letter “Ä” (U+00C4) → 0xC3 0x84 in UTF-8.
Common pitfalls
- **Mojibake:** Garbled text caused by interpreting bytes in the wrong encoding (e.g., “Zürich” → “Zürich”).
- **Mixed encodings:** Some old files contain both UTF-8 and legacy code page bytes.
- **Hidden control characters:** Different newline conventions (CR, LF, CRLF) can cause cross-platform issues.
Why UTF-8 won
- Backward compatible with ASCII
- Works across languages and systems
- Standard for HTML, XML, JSON, Linux, and the web
- Supported natively in most modern programming languages
Summary
- Unicode assigns unique code points to all characters.
- UTF-8, UTF-16, and UTF-32 are ways to store those code points.
- UTF-8 is compact, universal, and dominant today.
- Encoding mismatches still occur, but standards like UTF-8 largely solved global text interoperability.