Unicode and Character Encoding

Unicode is the global standard that unifies all text characters across languages and writing systems. It assigns every symbol — letters, digits, punctuation, emojis, and control marks — a unique number called a *code point*. This allows all languages to coexist in a single consistent system.

Why Unicode was needed

Before Unicode, computers used many different and incompatible encodings such as ASCII, ISO-8859, and Windows code pages. Each could only represent a subset of all written languages. As a result:

Text from one system often appeared garbled on another.
Mixing multiple languages in one document was impossible.
Software had to guess or specify which encoding a file used.

Unicode solved this by defining a single, universal character set.

Unicode basics

**Code point** – The unique number assigned to each character (e.g. “A” = U+0041, “Ω” = U+03A9, “你” = U+4F60).
**Glyph** – The visible shape of a character (how it is drawn in a font).
**Encoding form** – The way code points are represented in memory as bytes (UTF-8, UTF-16, UTF-32).
**Plane** – A group of 65,536 code points. Unicode currently defines 17 planes (Plane 0 = Basic Multilingual Plane).

Unicode defines over 150,000 characters, including scripts, emojis, and special symbols.

The five-layer Unicode model

Unicode defines text processing in several conceptual layers:

**Abstract Character Repertoire (ACR):**

  Defines which characters exist — for example “A”, “é”, “Ω”, “你”.

**Coded Character Set (CCS):**

  Assigns each character a numeric code point (e.g., “A” → 65 or U+0041).

**Character Encoding Form (CEF):**

  Specifies how code points are stored using code units (8, 16, or 32 bits).

**Character Encoding Scheme (CES):**

  Describes how code units become a byte stream (e.g., UTF-8, UTF-16BE, UTF-16LE).

**Transfer Encoding Syntax (TES):**

  Optional layer for transmission (e.g., Base64, gzip compression).

Unicode planes

Unicode divides its code space into planes, each containing 65,536 (2¹⁶) code points:

Plane | Name | Range | Examples

|------|--------|----------

UTF encodings

Unicode data must be stored or transmitted as bytes. Three major encoding forms exist:

Encoding | Unit size | Variable length | Description

|------------|----------------|-------------

Example:

“A” (U+0041): UTF-8 → 0x41, UTF-16 → 0x00 41, UTF-32 → 0x00 00 00 41
“你” (U+4F60): UTF-8 → 0xE4 0xBD 0xA0, UTF-16 → 0x4F 60, UTF-32 → 0x00 00 4F 60

Byte order and BOM

UTF-16 and UTF-32 use multi-byte sequences, so byte order (endianness) matters:

**UTF-16BE** – Big endian (most significant byte first)
**UTF-16LE** – Little endian
**BOM (Byte Order Mark):** Special prefix U+FEFF to indicate endianness.

Example:

Big endian BOM → 0xFE 0xFF
Little endian BOM → 0xFF 0xFE

UTF-8 in detail

UTF-8 is the most used encoding today because:

Compatible with ASCII
Self-synchronizing (can recover from errors)
Efficient for Western text
Supports all Unicode characters

Byte patterns:

1-byte: 0xxxxxxx (ASCII)
2-byte: 110xxxxx 10xxxxxx
3-byte: 1110xxxx 10xxxxxx 10xxxxxx
4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: The letter “Ä” (U+00C4) → 0xC3 0x84 in UTF-8.

Common pitfalls

**Mojibake:** Garbled text caused by interpreting bytes in the wrong encoding (e.g., “Zürich” → “ZÃ¼rich”).
**Mixed encodings:** Some old files contain both UTF-8 and legacy code page bytes.
**Hidden control characters:** Different newline conventions (CR, LF, CRLF) can cause cross-platform issues.

Why UTF-8 won

Backward compatible with ASCII
Works across languages and systems
Standard for HTML, XML, JSON, Linux, and the web
Supported natively in most modern programming languages

Summary

Unicode assigns unique code points to all characters.
UTF-8, UTF-16, and UTF-32 are ways to store those code points.
UTF-8 is compact, universal, and dominant today.
Encoding mismatches still occur, but standards like UTF-8 largely solved global text interoperability.

Unicode: Concepts, Planes & 5-Layer Architecture (ACR/CCS/CEF/CES/TES)

Contents

Unicode and Character Encoding

Why Unicode was needed

Unicode basics

The five-layer Unicode model

Unicode planes

UTF encodings

Byte order and BOM

UTF-8 in detail

Common pitfalls

Why UTF-8 won

Summary

Navigation menu

Unicode: Concepts, Planes & 5-Layer Architecture (ACR/CCS/CEF/CES/TES)

Unicode and Character Encoding

Why Unicode was needed

Unicode basics

The five-layer Unicode model

Unicode planes

UTF encodings

Byte order and BOM

UTF-8 in detail

Common pitfalls

Why UTF-8 won

Summary

Navigation menu

Search