Кодировка

ISO 8859

Семейство 8-битных однобайтовых кодировок для разных языковых групп. ISO 8859-1 (Latin-1) послужила основой для первых 256 code points Unicode.

· Updated

What is ISO 8859?

ISO 8859 is a family of 15 8-bit single-byte character encoding standards published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Each standard in the family covers a specific language group or region, extending the 128-character ASCII base into the upper 128 positions (0x80–0xFF) with characters needed for that region's scripts and languages.

The ISO 8859 family was the dominant encoding infrastructure for non-ASCII text on the internet and personal computers throughout the 1980s and 1990s. Even today, understanding ISO 8859 is essential for working with legacy data, email systems, and pre-Unicode content.

The Family Members

Standard Name Languages Covered
ISO 8859-1 Latin-1 Western European (French, German, Spanish, Portuguese, Italian)
ISO 8859-2 Latin-2 Central European (Czech, Polish, Hungarian, Croatian)
ISO 8859-3 Latin-3 Southern European (Turkish, Maltese, Esperanto)
ISO 8859-4 Latin-4 Northern European (Estonian, Latvian, Lithuanian)
ISO 8859-5 Cyrillic Russian, Bulgarian, Serbian, Macedonian
ISO 8859-6 Arabic Arabic
ISO 8859-7 Greek Modern Greek
ISO 8859-8 Hebrew Hebrew
ISO 8859-9 Latin-5 Turkish (Latin-1 variant)
ISO 8859-10 Latin-6 Nordic languages
ISO 8859-11 Thai Thai (essentially TIS 620)
ISO 8859-13 Latin-7 Baltic languages
ISO 8859-14 Latin-8 Celtic languages (Irish, Welsh)
ISO 8859-15 Latin-9 Western European + Euro sign
ISO 8859-16 Latin-10 South-Eastern European

Note: ISO 8859-12 was proposed for Devanagari but never finalized.

How ISO 8859 Works

Every member of the family shares the same structure:

  • 0x00–0x1F: C0 control characters (identical to ASCII)
  • 0x20–0x7E: Printable ASCII characters (identical across all members)
  • 0x7F: DEL control character
  • 0x80–0x9F: C1 control characters (defined but rarely used in practice)
  • 0xA0–0xFF: Region-specific printable characters

The region-specific characters in 0xA0–0xFF are what differ between standards. For example, byte 0xE9 means:

  • ISO 8859-1: é (Latin small letter e with acute)
  • ISO 8859-5: щ (Cyrillic small letter shcha)
  • ISO 8859-7: ι (Greek small letter iota with tonos, in some positions)

ISO 8859-1 and Its Importance

ISO 8859-1 (Latin-1) is the most widely used family member. It covers the characters needed for Western European languages and was adopted as:

  • The default encoding of HTTP/1.0 (text/html; charset=ISO-8859-1)
  • The lower 256 code points of Unicode (U+0000–U+00FF map exactly to Latin-1)
  • The basis for Windows-1252

This Unicode correspondence means that converting a Latin-1 string to Unicode is trivial: each byte value directly becomes the Unicode code point.

# ISO 8859-1 to Unicode: byte values are identical to code points
b'\xe9'.decode('iso-8859-1')    # 'é' — U+00E9
b'\xe9'.decode('latin-1')       # same (latin-1 is an alias)
b'\xe9'.decode('utf-8')         # raises UnicodeDecodeError!

# The difference between Latin-1 and Windows-1252
b'\x80'.decode('iso-8859-1')    # '\x80' — a C1 control character
b'\x80'.decode('windows-1252')  # '€' — Euro sign

ISO 8859-15: Latin-9

ISO 8859-15 is a revision of Latin-1 that replaced 8 rarely-used characters with more useful ones, most notably adding the Euro sign (€) at 0xA4. Latin-1 was defined in 1987, before the Euro was introduced in 1999. Latin-9 also added characters for French (Œ, œ) and Finnish (Š, š, Ž, ž).

Despite being technically superior, ISO 8859-15 saw limited adoption — most systems had already standardized on Latin-1 or migrated to UTF-8.

Quick Facts

Property Value
Standards body ISO/IEC JTC 1
Number of parts 15 (no ISO 8859-12)
Bytes per character 1 (single-byte)
Characters per standard 256 (191–192 printable)
ASCII compatible Yes (0x00–0x7F identical)
Unicode of Latin-1 U+0000–U+00FF exactly
Status Legacy — superseded by Unicode/UTF-8

Common Pitfalls

Confusing Latin-1 with Windows-1252. Windows-1252 adds printable characters in 0x80–0x9F (the C1 control range of Latin-1), including the Euro sign, smart quotes, and em-dashes. Many web browsers historically treated ISO-8859-1 declarations as windows-1252, creating a widespread discrepancy between declared and actual encoding.

Assuming all European text is Latin-1. Polish (ISO 8859-2), Turkish (ISO 8859-9), and Greek (ISO 8859-7) require different standards. A Polish document claiming charset=iso-8859-1 will display ą, ę, ó as wrong characters.

Multi-byte East Asian languages. ISO 8859 standards are single-byte encodings and cannot represent Chinese, Japanese, or Korean characters, which require multi-byte encodings like Shift-JIS, GB2312, or Big5.

Связанные термины

Ещё в Кодировка

ASCII

American Standard Code for Information Interchange. 7-битная кодировка, охватывающая 128 символов (0–127): …

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

Кодировка традиционного китайского, используемая в основном на Тайване и в Гонконге, кодирующая …

EBCDIC

Extended Binary Coded Decimal Interchange Code. Кодировка мейнфреймов IBM с непоследовательными диапазонами …

EUC-KR

Корейская кодировка на основе KS X 1001, отображающая слоги хангыля и ханча …

GB2312 / GB18030

Семейство кодировок упрощённого китайского: GB2312 (6763 символа) эволюционировала в GBK, затем в …

Shift JIS

Японская кодировка, сочетающая однобайтовый ASCII/JIS Roman с двухбайтовыми кандзи JIS X 0208. …

UCS-2

Устаревшая фиксированная 2-байтовая кодировка, охватывающая только BMP (U+0000–U+FFFF). Предшественник UTF-16, не способный …

UTF-16

Многобайтовая кодировка Unicode, использующая 2 или 4 байта (1 или 2 code …