What is 字符编码?

将字符映射为字节序列以供数字存储和传输的系统。每个文本文件都有编码，关键在于是否正确声明了该编码。

What is Windows-1252?

微软对ISO 8859-1的超集，在0x80–0x9F范围内增加了弯引号、破折号和欧元符号，是最常见的传统拉丁编码。

编码

ISO 8859

针对不同语言组的8位单字节编码系列，ISO 8859-1（Latin-1）是Unicode前256个码位的基础。

2021-03-10 · Updated 2024-09-12

What is ISO 8859?

ISO 8859 is a family of 15 8-bit single-byte character encoding standards published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Each standard in the family covers a specific language group or region, extending the 128-character ASCII base into the upper 128 positions (0x80–0xFF) with characters needed for that region's scripts and languages.

The ISO 8859 family was the dominant encoding infrastructure for non-ASCII text on the internet and personal computers throughout the 1980s and 1990s. Even today, understanding ISO 8859 is essential for working with legacy data, email systems, and pre-Unicode content.

The Family Members

Standard	Name	Languages Covered
ISO 8859-1	Latin-1	Western European (French, German, Spanish, Portuguese, Italian)
ISO 8859-2	Latin-2	Central European (Czech, Polish, Hungarian, Croatian)
ISO 8859-3	Latin-3	Southern European (Turkish, Maltese, Esperanto)
ISO 8859-4	Latin-4	Northern European (Estonian, Latvian, Lithuanian)
ISO 8859-5	Cyrillic	Russian, Bulgarian, Serbian, Macedonian
ISO 8859-6	Arabic	Arabic
ISO 8859-7	Greek	Modern Greek
ISO 8859-8	Hebrew	Hebrew
ISO 8859-9	Latin-5	Turkish (Latin-1 variant)
ISO 8859-10	Latin-6	Nordic languages
ISO 8859-11	Thai	Thai (essentially TIS 620)
ISO 8859-13	Latin-7	Baltic languages
ISO 8859-14	Latin-8	Celtic languages (Irish, Welsh)
ISO 8859-15	Latin-9	Western European + Euro sign
ISO 8859-16	Latin-10	South-Eastern European

Note: ISO 8859-12 was proposed for Devanagari but never finalized.

How ISO 8859 Works

Every member of the family shares the same structure:

0x00–0x1F: C0 control characters (identical to ASCII)
0x20–0x7E: Printable ASCII characters (identical across all members)
0x7F: DEL control character
0x80–0x9F: C1 control characters (defined but rarely used in practice)
0xA0–0xFF: Region-specific printable characters

The region-specific characters in 0xA0–0xFF are what differ between standards. For example, byte 0xE9 means:

ISO 8859-1: é (Latin small letter e with acute)
ISO 8859-5: щ (Cyrillic small letter shcha)
ISO 8859-7: ι (Greek small letter iota with tonos, in some positions)

ISO 8859-1 and Its Importance

ISO 8859-1 (Latin-1) is the most widely used family member. It covers the characters needed for Western European languages and was adopted as:

The default encoding of HTTP/1.0 (text/html; charset=ISO-8859-1)
The lower 256 code points of Unicode (U+0000–U+00FF map exactly to Latin-1)
The basis for Windows-1252

This Unicode correspondence means that converting a Latin-1 string to Unicode is trivial: each byte value directly becomes the Unicode code point.

# ISO 8859-1 to Unicode: byte values are identical to code points
b'\xe9'.decode('iso-8859-1')    # 'é' — U+00E9
b'\xe9'.decode('latin-1')       # same (latin-1 is an alias)
b'\xe9'.decode('utf-8')         # raises UnicodeDecodeError!

# The difference between Latin-1 and Windows-1252
b'\x80'.decode('iso-8859-1')    # '\x80' — a C1 control character
b'\x80'.decode('windows-1252')  # '€' — Euro sign

ISO 8859-15: Latin-9

ISO 8859-15 is a revision of Latin-1 that replaced 8 rarely-used characters with more useful ones, most notably adding the Euro sign (€) at 0xA4. Latin-1 was defined in 1987, before the Euro was introduced in 1999. Latin-9 also added characters for French (Œ, œ) and Finnish (Š, š, Ž, ž).

Despite being technically superior, ISO 8859-15 saw limited adoption — most systems had already standardized on Latin-1 or migrated to UTF-8.

Quick Facts

Property	Value
Standards body	ISO/IEC JTC 1
Number of parts	15 (no ISO 8859-12)
Bytes per character	1 (single-byte)
Characters per standard	256 (191–192 printable)
ASCII compatible	Yes (0x00–0x7F identical)
Unicode of Latin-1	U+0000–U+00FF exactly
Status	Legacy — superseded by Unicode/UTF-8

Common Pitfalls

Confusing Latin-1 with Windows-1252. Windows-1252 adds printable characters in 0x80–0x9F (the C1 control range of Latin-1), including the Euro sign, smart quotes, and em-dashes. Many web browsers historically treated ISO-8859-1 declarations as windows-1252, creating a widespread discrepancy between declared and actual encoding.

Assuming all European text is Latin-1. Polish (ISO 8859-2), Turkish (ISO 8859-9), and Greek (ISO 8859-7) require different standards. A Polish document claiming charset=iso-8859-1 will display ą, ę, ó as wrong characters.

Multi-byte East Asian languages. ISO 8859 standards are single-byte encodings and cannot represent Chinese, Japanese, or Korean characters, which require multi-byte encodings like Shift-JIS, GB2312, or Big5.