What is Кодировка символов?

Система сопоставления символов с байтовыми последовательностями для цифрового хранения и передачи. Каждый текстовый файл имеет кодировку — вопрос в том, правильно ли она объявлена.

Многобайтовая кодировка Unicode, использующая 1–4 байта на символ. Доминирующая кодировка веба (98%+ сайтов) с полной обратной совместимостью с ASCII.

Кодировка

Набор символов IANA

Официальный реестр имён кодировок, поддерживаемый IANA, используемый в заголовках HTTP Content-Type и MIME (например, charset=utf-8).

2021-06-03 · Updated 2024-03-11

What is IANA Charset?

IANA charset names are the official, standardized names for character encoding schemes registered with the Internet Assigned Numbers Authority (IANA). These names are used in internet protocols — HTTP Content-Type headers, MIME email headers, XML declarations, HTML <meta charset> tags, and many other contexts — to unambiguously identify which character encoding is in use.

The IANA Maintained Media Types and Charset registry ensures that when a web server says charset=utf-8 and a browser receives that header, both sides agree on exactly what "utf-8" means. Without this standardization, the same encoding could have a dozen different names across different systems, making interoperability impossible.

How IANA Charset Names Work

IANA maintains a registry at https://www.iana.org/assignments/character-sets/ that lists every registered charset. Each entry includes:

Name: The preferred (canonical) IANA name (e.g., UTF-8)
Aliases: Alternative names that should be treated as equivalent (e.g., UTF8, utf8, csUTF8)
MIBenum: A numeric identifier (useful for protocols that prefer numbers over strings)
References: The standards documents that define the encoding
Status: Whether the charset is still recommended

The names are case-insensitive per the IANA registry rules: UTF-8, utf-8, and Utf-8 are all valid references to the same charset.

Common IANA Charset Names

IANA Name	Aliases	Notes
`UTF-8`	`utf8`, `csUTF8`	Web default, strongly preferred
`UTF-16`	`UTF16`	With BOM
`UTF-16BE`	`UTF-16BE`	Big endian, no BOM
`UTF-16LE`	`UTF-16LE`	Little endian, no BOM
`UTF-32`	—	With BOM
`ISO-8859-1`	`latin1`, `latin-1`, `ISO_8859-1`	Latin-1
`ISO-8859-15`	`latin9`, `latin-9`	Latin-1 + Euro sign
`windows-1252`	`cp1252`, `x-cp1252`	ANSI Western European
`Shift_JIS`	`SJIS`, `MS_Kanji`	Japanese
`EUC-JP`	`csEUCPkdFmtJapanese`	Japanese (Unix)
`EUC-KR`	`csEUCKR`	Korean (Unix)
`Big5`	`csBig5`	Traditional Chinese
`GB2312`	`csGB2312`	Simplified Chinese
`KOI8-R`	`csKOI8R`	Russian (Unix)
`US-ASCII`	`ASCII`, `ANSI_X3.4-1968`, `iso-ir-6`	7-bit ASCII

IANA Charsets in Web Contexts

HTTP and HTML both use IANA charset names:

Content-Type: text/html; charset=UTF-8
Content-Type: text/plain; charset=windows-1252
Content-Type: application/json; charset=utf-8

<!-- HTML5 shorthand (preferred): -->
<meta charset="utf-8">

<!-- Legacy HTML4 form: -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<!-- XML declaration: -->
<?xml version="1.0" encoding="UTF-8"?>

WHATWG Encoding Standard and IANA

For web browsers, the authoritative reference is the WHATWG Encoding Standard (encoding.spec.whatwg.org), which defines a specific subset of IANA charsets and their aliases. The WHATWG standard is intentionally restrictive — it excludes obscure or harmful encodings and provides precise byte-level semantics for the encodings it does support.

Notably, the WHATWG standard defines ISO-8859-1 as an alias for windows-1252 (see the windows-1252 entry), which diverges from the strict IANA definition but reflects real-world browser behavior.

# Python: IANA charset names and Python codec names
# Python often uses its own names but accepts many IANA aliases

import codecs

# Look up codec information by IANA-style name
codec = codecs.lookup('utf-8')
print(codec.name)       # 'utf-8'

codec2 = codecs.lookup('shift_jis')
print(codec2.name)      # 'shift_jis'

# Python accepts most IANA aliases
'hello'.encode('US-ASCII')    # Works
'hello'.encode('ANSI_X3.4-1968')  # Also works — IANA alias
'hello'.encode('iso-8859-1')  # Works
'hello'.encode('latin-1')     # Also works — IANA alias

MIBenum: Numeric Identifiers

IANA assigns a numeric MIBenum to each charset for protocols that prefer numbers. This is used in some SNMP, LDAP, and telnet applications:

MIBenum	IANA Name
3	US-ASCII
4	ISO-8859-1
17	Shift_JIS
36	KS_C_5601-1987 (EUC-KR basis)
106	UTF-8
1013	UTF-16BE
1014	UTF-16LE
1015	UTF-16
1017	UTF-32

Quick Facts

Property	Value
Registry	IANA (Internet Assigned Numbers Authority)
Registry URL	iana.org/assignments/character-sets/
Case sensitivity	Case-insensitive
Preferred web charset	UTF-8
Numeric IDs	MIBenum
Web browser reference	WHATWG Encoding Standard
Python codec lookup	`codecs.lookup(name)`

Common Pitfalls

Using non-IANA names in HTTP headers. Some servers emit charset=UTF8 (no hyphen) or charset=utf_8 (underscore). While many clients accept these, they are not canonical IANA names. The correct form is charset=utf-8 or charset=UTF-8.

The IANA vs. WHATWG divergence. IANA lists many encodings that modern browsers no longer support or that map to different behavior. For web development, use the WHATWG Encoding Standard as the authoritative reference, not raw IANA data.

Deprecated charsets. Some IANA-registered charsets are deprecated or have known security issues. For example, UTF-7 (MIBenum 1012) is deprecated and has been used in cross-site scripting (XSS) attacks. Never use UTF-7 on the web. Similarly, BOCU-1 and SCSU are IANA-registered but not web-safe.

Charset vs. encoding vs. codec. These three terms are used interchangeably in many contexts but are technically distinct. "Charset" is the IANA term for a registered encoding specification. "Encoding" is the general term for the byte-to-character mapping scheme. "Codec" is a software implementation of an encoding (encoder + decoder). In practice, the terms are used interchangeably in web and programming contexts.

Связанные термины

Кодировка символов UTF-8

Ещё в Кодировка

ASCII

American Standard Code for Information Interchange. 7-битная кодировка, охватывающая 128 символов (0–127): …

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

Кодировка традиционного китайского, используемая в основном на Тайване и в Гонконге, кодирующая …

EBCDIC

Extended Binary Coded Decimal Interchange Code. Кодировка мейнфреймов IBM с непоследовательными диапазонами …

EUC-KR

Корейская кодировка на основе KS X 1001, отображающая слоги хангыля и ханча …

GB2312 / GB18030

Семейство кодировок упрощённого китайского: GB2312 (6763 символа) эволюционировала в GBK, затем в …

ISO 8859

Семейство 8-битных однобайтовых кодировок для разных языковых групп. ISO 8859-1 (Latin-1) послужила …

Shift JIS

Японская кодировка, сочетающая однобайтовый ASCII/JIS Roman с двухбайтовыми кандзи JIS X 0208. …

UCS-2

Устаревшая фиксированная 2-байтовая кодировка, охватывающая только BMP (U+0000–U+FFFF). Предшественник UTF-16, не способный …

← Вернуться к глоссарию