Legacy Encodings & Code Pages (EBCDIC, ISO-8859, CP1252)
Legacy Encodings & Code Pages (EBCDIC, ISO-8859, CP1252)
ASCII was revolutionary but limited to English. As computing spread internationally, new encodings extended ASCII to support more characters. This page explains these historical encodings, how they evolved, and why Unicode replaced them.
The problem with ASCII
ASCII uses 7 bits per character, giving only 128 symbols. This excludes letters with accents (é, ä, ñ), non-Latin alphabets (Cyrillic, Greek, Hebrew), and scripts like Arabic or Thai.
When 8-bit bytes became standard, the extra bit was used to add 128 new symbols (128–255). Unfortunately, there was no single global standard — every region and vendor created its own **code page**.
EBCDIC
- EBCDIC (Extended Binary Coded Decimal Interchange Code)** was developed by IBM in 1963.
It was primarily used on IBM mainframes and punched cards.
- 8-bit encoding, unrelated to ASCII
- Designed to minimize long runs of punched holes
- Not compatible with ASCII — letter codes differ completely
- Still occasionally found in legacy financial and government systems
Example (partial comparison):
| Character | ASCII (decimal) | EBCDIC (decimal) |
|---|---|---|
| A | 65 | 193 |
| B | 66 | 194 |
| a | 97 | 129 |
| 0 | 48 | 240 |
EBCDIC’s incompatibility made text exchange difficult between IBM systems and others.
ISO-8859 series (Latin-1 and friends)
To address regional diversity, ISO standardized several 8-bit extensions to ASCII. Each supported specific language groups.
- Positions 0–127 identical to ASCII
- Positions 128–255 defined for regional characters
Common variants:
Code Page | Alias | Region / Languages
| ------ | -----------------
ISO-8859-1 | Latin-1 | Western Europe (English, French, German, Spanish) ISO-8859-2 | Latin-2 | Central/Eastern Europe (Czech, Polish, Hungarian) ISO-8859-3 | Latin-3 | Southern Europe (Turkish, Maltese, Esperanto) ISO-8859-4 | Latin-4 | Northern Europe (Estonian, Latvian, Lithuanian) ISO-8859-5 | Cyrillic | Russian, Bulgarian, Serbian ISO-8859-7 | Greek | Greek alphabet ISO-8859-8 | Hebrew | Hebrew script ISO-8859-11 | Thai | Thai alphabet
Windows Code Pages (CP125x)
Microsoft extended the ISO series with its own encodings called **Windows Code Pages**. The most common was **CP1252 (Windows Latin-1)**, which added extra characters such as typographic quotes and the Euro symbol.
Differences:
- ISO-8859-1 leaves positions 128–159 unused (reserved for control codes)
- CP1252 uses these for printable symbols like “€”, “–”, “‘”, “’”, ““”, “””, etc.
CP1252 example (hex → character): ```
0x80 € 0x91 ‘ 0x92 ’ 0x93 “ 0x94 ” 0x95 • 0x96 – 0x97 —
```
Code page chaos
Because every region used different mappings for the same byte values:
- A document encoded in CP1252 might display gibberish if opened as ISO-8859-2.
- Multi-language systems (e.g., English + Russian) could not mix characters safely.
- File exchange required explicit knowledge of the code page used.
This lack of interoperability led to **encoding mismatches** and the need for a universal standard.
Transition to Unicode
Unicode unified all scripts under one standard:
- Every character has a unique code point (e.g. U+00E9 for “é”).
- Unicode includes ASCII (0–127) and can represent characters from all code pages simultaneously.
- UTF-8 became the universal encoding for modern systems.
Today:
- Legacy code pages persist in old software and file archives.
- Modern systems default to UTF-8 or UTF-16.
- Tools like `iconv` or `recode` can convert between encodings.
Summary
- EBCDIC was IBM’s early alternative to ASCII.
- ISO-8859 extended ASCII for regional languages.
- Windows CP125x added additional characters for specific markets.
- Code pages were incompatible with each other.
- Unicode and UTF-8 replaced them with a universal, consistent model.