ISO 8859
针对不同语言组的8位单字节编码系列,ISO 8859-1(Latin-1)是Unicode前256个码位的基础。
What is ISO 8859?
ISO 8859 is a family of 15 8-bit single-byte character encoding standards published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Each standard in the family covers a specific language group or region, extending the 128-character ASCII base into the upper 128 positions (0x80–0xFF) with characters needed for that region's scripts and languages.
The ISO 8859 family was the dominant encoding infrastructure for non-ASCII text on the internet and personal computers throughout the 1980s and 1990s. Even today, understanding ISO 8859 is essential for working with legacy data, email systems, and pre-Unicode content.
The Family Members
| Standard | Name | Languages Covered |
|---|---|---|
| ISO 8859-1 | Latin-1 | Western European (French, German, Spanish, Portuguese, Italian) |
| ISO 8859-2 | Latin-2 | Central European (Czech, Polish, Hungarian, Croatian) |
| ISO 8859-3 | Latin-3 | Southern European (Turkish, Maltese, Esperanto) |
| ISO 8859-4 | Latin-4 | Northern European (Estonian, Latvian, Lithuanian) |
| ISO 8859-5 | Cyrillic | Russian, Bulgarian, Serbian, Macedonian |
| ISO 8859-6 | Arabic | Arabic |
| ISO 8859-7 | Greek | Modern Greek |
| ISO 8859-8 | Hebrew | Hebrew |
| ISO 8859-9 | Latin-5 | Turkish (Latin-1 variant) |
| ISO 8859-10 | Latin-6 | Nordic languages |
| ISO 8859-11 | Thai | Thai (essentially TIS 620) |
| ISO 8859-13 | Latin-7 | Baltic languages |
| ISO 8859-14 | Latin-8 | Celtic languages (Irish, Welsh) |
| ISO 8859-15 | Latin-9 | Western European + Euro sign |
| ISO 8859-16 | Latin-10 | South-Eastern European |
Note: ISO 8859-12 was proposed for Devanagari but never finalized.
How ISO 8859 Works
Every member of the family shares the same structure:
- 0x00–0x1F: C0 control characters (identical to ASCII)
- 0x20–0x7E: Printable ASCII characters (identical across all members)
- 0x7F: DEL control character
- 0x80–0x9F: C1 control characters (defined but rarely used in practice)
- 0xA0–0xFF: Region-specific printable characters
The region-specific characters in 0xA0–0xFF are what differ between standards. For example, byte 0xE9 means:
- ISO 8859-1: é (Latin small letter e with acute)
- ISO 8859-5: щ (Cyrillic small letter shcha)
- ISO 8859-7: ι (Greek small letter iota with tonos, in some positions)
ISO 8859-1 and Its Importance
ISO 8859-1 (Latin-1) is the most widely used family member. It covers the characters needed for Western European languages and was adopted as:
- The default encoding of HTTP/1.0 (
text/html; charset=ISO-8859-1) - The lower 256 code points of Unicode (U+0000–U+00FF map exactly to Latin-1)
- The basis for Windows-1252
This Unicode correspondence means that converting a Latin-1 string to Unicode is trivial: each byte value directly becomes the Unicode code point.
# ISO 8859-1 to Unicode: byte values are identical to code points
b'\xe9'.decode('iso-8859-1') # 'é' — U+00E9
b'\xe9'.decode('latin-1') # same (latin-1 is an alias)
b'\xe9'.decode('utf-8') # raises UnicodeDecodeError!
# The difference between Latin-1 and Windows-1252
b'\x80'.decode('iso-8859-1') # '\x80' — a C1 control character
b'\x80'.decode('windows-1252') # '€' — Euro sign
ISO 8859-15: Latin-9
ISO 8859-15 is a revision of Latin-1 that replaced 8 rarely-used characters with more useful ones, most notably adding the Euro sign (€) at 0xA4. Latin-1 was defined in 1987, before the Euro was introduced in 1999. Latin-9 also added characters for French (Œ, œ) and Finnish (Š, š, Ž, ž).
Despite being technically superior, ISO 8859-15 saw limited adoption — most systems had already standardized on Latin-1 or migrated to UTF-8.
Quick Facts
| Property | Value |
|---|---|
| Standards body | ISO/IEC JTC 1 |
| Number of parts | 15 (no ISO 8859-12) |
| Bytes per character | 1 (single-byte) |
| Characters per standard | 256 (191–192 printable) |
| ASCII compatible | Yes (0x00–0x7F identical) |
| Unicode of Latin-1 | U+0000–U+00FF exactly |
| Status | Legacy — superseded by Unicode/UTF-8 |
Common Pitfalls
Confusing Latin-1 with Windows-1252. Windows-1252 adds printable characters in 0x80–0x9F (the C1 control range of Latin-1), including the Euro sign, smart quotes, and em-dashes. Many web browsers historically treated ISO-8859-1 declarations as windows-1252, creating a widespread discrepancy between declared and actual encoding.
Assuming all European text is Latin-1. Polish (ISO 8859-2), Turkish (ISO 8859-9), and Greek (ISO 8859-7) require different standards. A Polish document claiming charset=iso-8859-1 will display ą, ę, ó as wrong characters.
Multi-byte East Asian languages. ISO 8859 standards are single-byte encodings and cannot represent Chinese, Japanese, or Korean characters, which require multi-byte encodings like Shift-JIS, GB2312, or Big5.
相关术语
编码 中的更多内容
美国信息交换标准代码。7位编码,涵盖128个字符(0–127),包括控制字符、数字、拉丁字母和基本符号。
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
主要在台湾和香港使用的繁体中文字符编码,收录约13,000个CJK字符。
扩展二进制编码十进制交换码。IBM大型机编码,字母范围不连续,至今仍用于银行和企业大型机。
基于KS X 1001的韩语字符编码,将韩文音节和汉字映射为双字节序列。
简体中文字符编码系列:GB2312(6,763字)经GBK演化为GB18030,成为与Unicode兼容的中国强制性国家标准。
由IANA维护的字符编码名称官方注册表,用于HTTP Content-Type头和MIME(如charset=utf-8)。
将单字节ASCII/JIS罗马字与双字节JIS X 0208汉字相结合的日语字符编码,仍在传统日语系统中使用。
仅覆盖BMP(U+0000–U+FFFF)的废弃固定2字节编码,是UTF-16的前身,无法表示补充字符。