Unicode: Concepts, Planes & 5-Layer Architecture (ACR/CCS/CEF/CES/TES)

From MediaWiki
Jump to navigation Jump to search

Unicode and Character Encoding

Unicode is the global standard that unifies all text characters across languages and writing systems. It assigns every symbol — letters, digits, punctuation, emojis, and control marks — a unique number called a *code point*. This allows all languages to coexist in a single consistent system.

Why Unicode was needed

Before Unicode, computers used many different and incompatible encodings such as ASCII, ISO-8859, and Windows code pages. Each could only represent a subset of all written languages. As a result:

  • Text from one system often appeared garbled on another.
  • Mixing multiple languages in one document was impossible.
  • Software had to guess or specify which encoding a file used.

Unicode solved this by defining a single, universal character set.

Unicode basics

  • **Code point** – The unique number assigned to each character (e.g. “A” = U+0041, “Ω” = U+03A9, “你” = U+4F60).
  • **Glyph** – The visible shape of a character (how it is drawn in a font).
  • **Encoding form** – The way code points are represented in memory as bytes (UTF-8, UTF-16, UTF-32).
  • **Plane** – A group of 65,536 code points. Unicode currently defines 17 planes (Plane 0 = Basic Multilingual Plane).

Unicode defines over 150,000 characters, including scripts, emojis, and special symbols.

The five-layer Unicode model

Unicode defines text processing in several conceptual layers:

  1. **Abstract Character Repertoire (ACR):**
  Defines which characters exist — for example “A”, “é”, “Ω”, “你”.
  1. **Coded Character Set (CCS):**
  Assigns each character a numeric code point (e.g., “A” → 65 or U+0041).
  1. **Character Encoding Form (CEF):**
  Specifies how code points are stored using code units (8, 16, or 32 bits).
  1. **Character Encoding Scheme (CES):**
  Describes how code units become a byte stream (e.g., UTF-8, UTF-16BE, UTF-16LE).
  1. **Transfer Encoding Syntax (TES):**
  Optional layer for transmission (e.g., Base64, gzip compression).

Unicode planes

Unicode divides its code space into planes, each containing 65,536 (2¹⁶) code points:

Plane | Name | Range | Examples


|------|--------|----------

0 | Basic Multilingual Plane (BMP) | U+0000–U+FFFF | Latin, Cyrillic, Arabic, Chinese 1 | Supplementary Multilingual Plane (SMP) | U+10000–U+1FFFF | Ancient scripts, musical notation 2 | Supplementary Ideographic Plane (SIP) | U+20000–U+2FFFF | Chinese/Japanese/Korean extensions 15–16 | Supplementary Private Use Areas | U+F0000–U+10FFFF | Reserved for custom use

UTF encodings

Unicode data must be stored or transmitted as bytes. Three major encoding forms exist:

Encoding | Unit size | Variable length | Description


|------------|----------------|-------------

UTF-8 | 8 bits | Yes (1–4 bytes) | Compact for ASCII, dominant in web & Unix systems UTF-16 | 16 bits | Yes (2 or 4 bytes using surrogate pairs) | Used in Windows, Java, .NET UTF-32 | 32 bits | No (fixed width) | Simple but inefficient in size

Example:

  • “A” (U+0041): UTF-8 → 0x41, UTF-16 → 0x00 41, UTF-32 → 0x00 00 00 41
  • “你” (U+4F60): UTF-8 → 0xE4 0xBD 0xA0, UTF-16 → 0x4F 60, UTF-32 → 0x00 00 4F 60

Byte order and BOM

UTF-16 and UTF-32 use multi-byte sequences, so byte order (endianness) matters:

  • **UTF-16BE** – Big endian (most significant byte first)
  • **UTF-16LE** – Little endian
  • **BOM (Byte Order Mark):** Special prefix U+FEFF to indicate endianness.

Example:

  • Big endian BOM → 0xFE 0xFF
  • Little endian BOM → 0xFF 0xFE

UTF-8 in detail

UTF-8 is the most used encoding today because:

  • Compatible with ASCII
  • Self-synchronizing (can recover from errors)
  • Efficient for Western text
  • Supports all Unicode characters

Byte patterns:

  • 1-byte: 0xxxxxxx (ASCII)
  • 2-byte: 110xxxxx 10xxxxxx
  • 3-byte: 1110xxxx 10xxxxxx 10xxxxxx
  • 4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: The letter “Ä” (U+00C4) → 0xC3 0x84 in UTF-8.

Common pitfalls

  • **Mojibake:** Garbled text caused by interpreting bytes in the wrong encoding (e.g., “Zürich” → “Zürich”).
  • **Mixed encodings:** Some old files contain both UTF-8 and legacy code page bytes.
  • **Hidden control characters:** Different newline conventions (CR, LF, CRLF) can cause cross-platform issues.

Why UTF-8 won

  • Backward compatible with ASCII
  • Works across languages and systems
  • Standard for HTML, XML, JSON, Linux, and the web
  • Supported natively in most modern programming languages

Summary

  • Unicode assigns unique code points to all characters.
  • UTF-8, UTF-16, and UTF-32 are ways to store those code points.
  • UTF-8 is compact, universal, and dominant today.
  • Encoding mismatches still occur, but standards like UTF-8 largely solved global text interoperability.