एन्कोडिंग

ISO 8859

विभिन्न भाषा समूहों के लिए 8-bit single-byte एन्कोडिंग का परिवार। ISO 8859-1 (Latin-1) Unicode के पहले 256 code points का आधार था।

· Updated

What is ISO 8859?

ISO 8859 is a family of 15 8-bit single-byte character encoding standards published by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Each standard in the family covers a specific language group or region, extending the 128-character ASCII base into the upper 128 positions (0x80–0xFF) with characters needed for that region's scripts and languages.

The ISO 8859 family was the dominant encoding infrastructure for non-ASCII text on the internet and personal computers throughout the 1980s and 1990s. Even today, understanding ISO 8859 is essential for working with legacy data, email systems, and pre-Unicode content.

The Family Members

Standard Name Languages Covered
ISO 8859-1 Latin-1 Western European (French, German, Spanish, Portuguese, Italian)
ISO 8859-2 Latin-2 Central European (Czech, Polish, Hungarian, Croatian)
ISO 8859-3 Latin-3 Southern European (Turkish, Maltese, Esperanto)
ISO 8859-4 Latin-4 Northern European (Estonian, Latvian, Lithuanian)
ISO 8859-5 Cyrillic Russian, Bulgarian, Serbian, Macedonian
ISO 8859-6 Arabic Arabic
ISO 8859-7 Greek Modern Greek
ISO 8859-8 Hebrew Hebrew
ISO 8859-9 Latin-5 Turkish (Latin-1 variant)
ISO 8859-10 Latin-6 Nordic languages
ISO 8859-11 Thai Thai (essentially TIS 620)
ISO 8859-13 Latin-7 Baltic languages
ISO 8859-14 Latin-8 Celtic languages (Irish, Welsh)
ISO 8859-15 Latin-9 Western European + Euro sign
ISO 8859-16 Latin-10 South-Eastern European

Note: ISO 8859-12 was proposed for Devanagari but never finalized.

How ISO 8859 Works

Every member of the family shares the same structure:

  • 0x00–0x1F: C0 control characters (identical to ASCII)
  • 0x20–0x7E: Printable ASCII characters (identical across all members)
  • 0x7F: DEL control character
  • 0x80–0x9F: C1 control characters (defined but rarely used in practice)
  • 0xA0–0xFF: Region-specific printable characters

The region-specific characters in 0xA0–0xFF are what differ between standards. For example, byte 0xE9 means:

  • ISO 8859-1: é (Latin small letter e with acute)
  • ISO 8859-5: щ (Cyrillic small letter shcha)
  • ISO 8859-7: ι (Greek small letter iota with tonos, in some positions)

ISO 8859-1 and Its Importance

ISO 8859-1 (Latin-1) is the most widely used family member. It covers the characters needed for Western European languages and was adopted as:

  • The default encoding of HTTP/1.0 (text/html; charset=ISO-8859-1)
  • The lower 256 code points of Unicode (U+0000–U+00FF map exactly to Latin-1)
  • The basis for Windows-1252

This Unicode correspondence means that converting a Latin-1 string to Unicode is trivial: each byte value directly becomes the Unicode code point.

# ISO 8859-1 to Unicode: byte values are identical to code points
b'\xe9'.decode('iso-8859-1')    # 'é' — U+00E9
b'\xe9'.decode('latin-1')       # same (latin-1 is an alias)
b'\xe9'.decode('utf-8')         # raises UnicodeDecodeError!

# The difference between Latin-1 and Windows-1252
b'\x80'.decode('iso-8859-1')    # '\x80' — a C1 control character
b'\x80'.decode('windows-1252')  # '€' — Euro sign

ISO 8859-15: Latin-9

ISO 8859-15 is a revision of Latin-1 that replaced 8 rarely-used characters with more useful ones, most notably adding the Euro sign (€) at 0xA4. Latin-1 was defined in 1987, before the Euro was introduced in 1999. Latin-9 also added characters for French (Œ, œ) and Finnish (Š, š, Ž, ž).

Despite being technically superior, ISO 8859-15 saw limited adoption — most systems had already standardized on Latin-1 or migrated to UTF-8.

Quick Facts

Property Value
Standards body ISO/IEC JTC 1
Number of parts 15 (no ISO 8859-12)
Bytes per character 1 (single-byte)
Characters per standard 256 (191–192 printable)
ASCII compatible Yes (0x00–0x7F identical)
Unicode of Latin-1 U+0000–U+00FF exactly
Status Legacy — superseded by Unicode/UTF-8

Common Pitfalls

Confusing Latin-1 with Windows-1252. Windows-1252 adds printable characters in 0x80–0x9F (the C1 control range of Latin-1), including the Euro sign, smart quotes, and em-dashes. Many web browsers historically treated ISO-8859-1 declarations as windows-1252, creating a widespread discrepancy between declared and actual encoding.

Assuming all European text is Latin-1. Polish (ISO 8859-2), Turkish (ISO 8859-9), and Greek (ISO 8859-7) require different standards. A Polish document claiming charset=iso-8859-1 will display ą, ę, ó as wrong characters.

Multi-byte East Asian languages. ISO 8859 standards are single-byte encodings and cannot represent Chinese, Japanese, or Korean characters, which require multi-byte encodings like Shift-JIS, GB2312, or Big5.

संबंधित शब्द

एन्कोडिंग में और

ASCII

American Standard Code for Information Interchange। 7-bit एन्कोडिंग जो 128 अक्षरों (0–127) …

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

पारंपरिक चीनी अक्षर एन्कोडिंग जो मुख्य रूप से ताइवान और हांगकांग में …

EBCDIC

Extended Binary Coded Decimal Interchange Code। IBM mainframe एन्कोडिंग जिसमें असंतत अक्षर …

EUC-KR

KS X 1001 पर आधारित कोरियाई अक्षर एन्कोडिंग, जो Hangul syllables और …

GB2312 / GB18030

सरलीकृत चीनी अक्षर एन्कोडिंग परिवार: GB2312 (6,763 अक्षर) GBK में विकसित हुआ …

IANA कैरेक्टर सेट

IANA द्वारा रखरखाव किया गया अक्षर एन्कोडिंग नामों का आधिकारिक रजिस्ट्री, HTTP …

Shift JIS

जापानी अक्षर एन्कोडिंग जो single-byte ASCII/JIS Roman को double-byte JIS X 0208 …

UCS-2

अप्रचलित निश्चित-लंबाई 2-byte एन्कोडिंग जो केवल BMP (U+0000–U+FFFF) को कवर करती है। …