Cyrillic Script
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most widely used scripts in Unicode with characters spread across several Cyrillic blocks. This guide explores the Cyrillic script's history, its Unicode representation, and the challenges of supporting the full range of Cyrillic-using languages.
Cyrillic is one of the most widely used scripts in the world, serving as the writing system for Russian, Ukrainian, Bulgarian, Serbian, Kazakh, Mongolian, and over 50 other languages across Eastern Europe and Central Asia. With more than 250 million native users and official status in numerous countries, Cyrillic is a script that any developer working with international text will encounter. Its visual similarity to Latin letters makes it a prime vector for homoglyph attacks, while its language-specific variants demand careful font and locale handling. This guide explores Cyrillic's Unicode encoding, its history, and the technical challenges of supporting the full range of Cyrillic-using languages.
History and Origins
The Cyrillic script was developed in the First Bulgarian Empire in the 9th century CE, traditionally attributed to the disciples of Saints Cyril and Methodius. (Despite the name, the "Cyrillic" script was likely created by Clement of Ohrid; Saints Cyril and Methodius created the earlier Glagolitic script.) Cyrillic was derived primarily from the Greek uncial script, with additional letters for Slavic sounds not found in Greek.
The script spread rapidly through the Orthodox Christian world:
| Century | Event |
|---|---|
| 9th | Created in Bulgaria, based on Greek uncial |
| 10th | Adopted by Kievan Rus' (precursor to Russia, Ukraine) |
| 14th | Spread to Serbia, Romania (until 19th c.) |
| 18th | Peter the Great reforms Russian Cyrillic (Civil Script) |
| 20th | Soviet Union extends Cyrillic to Central Asian languages |
| 21st | Kazakhstan begins transition from Cyrillic to Latin |
Peter the Great's reform of 1708 simplified the letter shapes and removed some archaic characters, creating the civil script (гражданский шрифт) that is the basis of modern Russian Cyrillic. Other Cyrillic-using languages underwent their own reforms at various points.
Languages Using Cyrillic
Cyrillic serves an extraordinarily diverse set of languages, each with its own subset of letters:
| Language | Speakers | Letters | Notable Characters |
|---|---|---|---|
| Russian | 258M | 33 | Ё ё, Ъ ъ, Ы ы |
| Ukrainian | 45M | 33 | Ї ї, І і, Ґ ґ, Є є |
| Bulgarian | 8M | 30 | Same base as Russian minus some |
| Serbian | 12M | 30 | Љ љ, Њ њ, Ћ ћ, Ђ ђ, Џ џ |
| Macedonian | 2M | 31 | Ќ ќ, Ѓ ѓ, Ѕ ѕ |
| Belarusian | 5M | 32 | Ў ў (short U) |
| Kazakh | 13M | 42 | Extended with diacritics |
| Mongolian | 6M | 35 | Extended letters |
| Tajik | 9M | 35 | Extended with diacritics |
| Kyrgyz | 5M | 36 | Extended letters |
Unicode Blocks for Cyrillic
| Block | Range | Characters | Purpose |
|---|---|---|---|
| Cyrillic | U+0400–U+04FF | 256 | Core Slavic + common extensions |
| Cyrillic Supplement | U+0500–U+052F | 48 | Minority languages (Komi, Abkhaz) |
| Cyrillic Extended-A | U+2DE0–U+2DFF | 32 | Old Church Slavonic diacritics |
| Cyrillic Extended-B | U+A640–U+A69F | 96 | Historic and Old Slavonic letters |
| Cyrillic Extended-C | U+1C80–U+1C8F | 9 | Old Slavonic letters |
| Cyrillic Extended-D | U+1E030–U+1E08F | 63 | Additional letters for minority languages |
The main Cyrillic block (U+0400–U+04FF) contains everything needed for all major modern languages:
Russian Alphabet (33 Letters)
А Б В Г Д Е Ж З И Й К Л М Н О П
а б в г д е ж з и й к л м н о п
Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
р с т у ф х ц ч ш щ ъ ы ь э ю я
Ё ё (U+0401 / U+0451) — often treated specially
Key Code Points
| Character | Unicode | Name |
|---|---|---|
| А а | U+0410 / U+0430 | A |
| Б б | U+0411 / U+0431 | BE |
| В в | U+0412 / U+0432 | VE |
| Г г | U+0413 / U+0433 | GHE |
| Д д | U+0414 / U+0434 | DE |
| Е е | U+0415 / U+0435 | IE |
| Ё ё | U+0401 / U+0451 | IO |
| Ж ж | U+0416 / U+0436 | ZHE |
| З з | U+0417 / U+0437 | ZE |
| И и | U+0418 / U+0438 | I |
| К к | U+041A / U+043A | KA |
| Л л | U+041B / U+043B | EL |
| М м | U+041C / U+043C | EM |
| Н н | U+041D / U+043D | EN |
| О о | U+041E / U+043E | O |
| П п | U+041F / U+043F | PE |
| Р р | U+0420 / U+0440 | ER |
| С с | U+0421 / U+0441 | ES |
| Т т | U+0422 / U+0442 | TE |
| Ш ш | U+0428 / U+0448 | SHA |
| Щ щ | U+0429 / U+0449 | SHCHA |
| Ъ ъ | U+042A / U+044A | HARD SIGN |
| Ы ы | U+042B / U+044B | YERU |
| Ь ь | U+042C / U+044C | SOFT SIGN |
The Ё Problem
Russian letter Ё (ё) is notoriously under-used. Most Russian text — including newspapers, books, and websites — replaces ё with е (without the dieresis). This creates challenges for:
- Search: Users may search for "ещё" or "еще" — both should match
- Sorting: Ё's position in sort order varies by convention
- NLP: Word frequency counts are skewed when ё and е are conflated
# Handle Russian Ё/Е equivalence in search
import re
def normalize_russian(text: str) -> str:
# Normalize ё → е for case-insensitive Russian search.
return text.replace("\u0451", "\u0435").replace("\u0401", "\u0415")
# Both should match
query = normalize_russian("\u0435\u0449\u0451") # ещё → еще
text = normalize_russian("\u0435\u0449\u0435") # еще → еще
assert query == text
Cyrillic–Latin Confusables
The most significant security issue with Cyrillic is that many uppercase and lowercase letters are visually identical to Latin characters:
| Cyrillic | Latin | Code Points |
|---|---|---|
| А а | A a | U+0410/U+0430 vs U+0041/U+0061 |
| В | B | U+0412 vs U+0042 |
| Е е | E e | U+0415/U+0435 vs U+0045/U+0065 |
| К | K | U+041A vs U+004B |
| М | M | U+041C vs U+004D |
| Н | H | U+041D vs U+0048 |
| О о | O o | U+041E/U+043E vs U+004F/U+006F |
| Р р | P p | U+0420/U+0440 vs U+0050/U+0070 |
| С с | C c | U+0421/U+0441 vs U+0043/U+0063 |
| Т | T | U+0422 vs U+0054 |
| Х х | X x | U+0425/U+0445 vs U+0058/U+0078 |
| у | y | U+0443 vs U+0079 |
This means an attacker can register a domain like аррӏе.com (using Cyrillic
а, р, р, and ӏ) that looks identical to apple.com in many fonts. This is
called a homoglyph attack or IDN homograph attack.
Defenses
- Browsers display Punycode (xn--...) for mixed-script domains
- IDNA 2008 restricts whole-script confusables
- Unicode TR#39 (Security Mechanisms) provides confusable character data
- Application-level checks should detect mixed Cyrillic/Latin text
import unicodedata
def detect_mixed_scripts(text: str) -> set[str]:
# Detect if text mixes Latin and Cyrillic (potential homoglyph attack).
scripts: set[str] = set()
for char in text:
if char.isalpha():
cp = ord(char)
if 0x0400 <= cp <= 0x04FF:
scripts.add("Cyrillic")
elif 0x0041 <= cp <= 0x024F:
scripts.add("Latin")
return scripts
# Suspicious: looks like "apple" but uses Cyrillic
fake = "\u0430\u0440\u0440\u04CF\u0435"
result = detect_mixed_scripts(fake)
print(result) # {'Cyrillic'} — pure Cyrillic, but visually looks Latin
# Real attack: mix Cyrillic and Latin
mixed = "a\u0440\u0440le" # Latin a, Cyrillic рр, Latin le
result = detect_mixed_scripts(mixed)
print(result) # {'Latin', 'Cyrillic'} — mixed! Flag this.
Language-Specific Variants
A critical challenge with Cyrillic is that the same code point may render differently depending on the language. The most well-known examples:
Italic Forms
In Russian italic, the lowercase letters г, д, и, п, т take completely different shapes from their upright forms — shapes that look like Latin italic e, d/g, u, n, m to non-Russian readers. Serbian and Macedonian italics differ even further.
Serbian vs Russian
Serbian Cyrillic uses the same Unicode code points as Russian for shared letters, but several glyphs should render differently:
| Letter | Russian form | Serbian form | Code Point |
|---|---|---|---|
| б | б (curved top) | б (flat top in italic) | U+0431 |
| г | г (like Γ) | г (like ι in italic) | U+0433 |
| д | д (like Δ) | д (like g in italic) | U+0434 |
| п | п (like Π) | п (like и in italic) | U+043F |
| т | т (like T) | т (like m in italic) | U+0442 |
This means font selection and the lang attribute are critical:
<!-- Russian text uses Russian glyph variants -->
<p lang="ru">Београд</p>
<!-- Serbian text uses Serbian glyph variants -->
<p lang="sr">Београд</p>
/* Ensure proper Serbian rendering */
:lang(sr) {
font-family: "Noto Serif", serif;
/* The font must include Serbian locl features */
}
Ukrainian
Ukrainian has four letters not in the Russian alphabet:
| Letter | Unicode | Name | Russian Equivalent |
|---|---|---|---|
| Ґ ґ | U+0490 / U+0491 | GHE WITH UPTURN | (no equivalent) |
| Є є | U+0404 / U+0454 | UKRAINIAN IE | Э? (different sound) |
| І і | U+0406 / U+0456 | BYELORUSSIAN-UKRAINIAN I | И |
| Ї ї | U+0407 / U+0457 | YI | (no equivalent) |
Working with Cyrillic in Code
Python
import unicodedata
# Russian text
text = "\u041F\u0440\u0438\u0432\u0435\u0442" # Привет (Hello)
print(text.upper()) # ПРИВЕТ
print(text.lower()) # привет
# Check script
for ch in text:
print(f"U+{ord(ch):04X} {unicodedata.name(ch)} "
f"cat={unicodedata.category(ch)}")
# Case-insensitive comparison with Ё handling
def russian_casefold(s: str) -> str:
# Case-fold Russian text, treating Ё as Е.
return s.casefold().replace("\u0451", "\u0435")
JavaScript
// Cyrillic regex
const cyrillicPattern = /\p{Script=Cyrillic}/u;
const text = "\u041F\u0440\u0438\u0432\u0435\u0442";
console.log(cyrillicPattern.test(text)); // true
// Locale-aware sorting (Russian vs Ukrainian collation differs)
const russianCollator = new Intl.Collator("ru");
const ukrainianCollator = new Intl.Collator("uk");
const words = ["\u0430", "\u0431", "\u0432", "\u0491"];
console.log(words.sort(russianCollator.compare));
console.log(words.sort(ukrainianCollator.compare));
Transliteration
Cyrillic text is often transliterated to Latin for URLs, filenames, and systems that do not support Cyrillic. Several standards exist:
| Standard | Ш | Щ | Ж | Ч | Purpose |
|---|---|---|---|---|---|
| ISO 9 | š | ŝ | ž | č | Scholarly (1:1 reversible) |
| BGN/PCGN | sh | shch | zh | ch | US/UK geographic names |
| Passport (Russian) | sh | shch | zh | ch | Russian passports |
| Scientific | š | šč | ž | č | Academic transliteration |
# Simple Russian transliteration
TRANSLIT_MAP: dict[str, str] = {
"\u0430": "a", "\u0431": "b", "\u0432": "v", "\u0433": "g",
"\u0434": "d", "\u0435": "e", "\u0451": "yo", "\u0436": "zh",
"\u0437": "z", "\u0438": "i", "\u0439": "y", "\u043A": "k",
"\u043B": "l", "\u043C": "m", "\u043D": "n", "\u043E": "o",
"\u043F": "p", "\u0440": "r", "\u0441": "s", "\u0442": "t",
"\u0443": "u", "\u0444": "f", "\u0445": "kh", "\u0446": "ts",
"\u0447": "ch", "\u0448": "sh", "\u0449": "shch", "\u044A": "",
"\u044B": "y", "\u044C": "", "\u044D": "e", "\u044E": "yu",
"\u044F": "ya",
}
def transliterate_russian(text: str) -> str:
result = []
for ch in text.lower():
result.append(TRANSLIT_MAP.get(ch, ch))
return "".join(result)
print(transliterate_russian("\u041F\u0440\u0438\u0432\u0435\u0442"))
# "privet"
Summary
Cyrillic is a globally significant script whose Unicode support requires attention to language-specific details. Key takeaways:
- Cyrillic–Latin confusables are the single biggest security concern — always detect and flag mixed-script text in security-sensitive contexts
- Language-specific glyph variants (especially Serbian vs Russian italic
forms) require correct
langattributes and OpenTypeloclfeatures - The Ё problem in Russian means search and comparison logic must treat Ё/ё and Е/е as equivalent
- Ukrainian, Serbian, Macedonian, and other languages have unique letters that extend beyond the Russian alphabet — do not assume Cyrillic = Russian
- Transliteration has no single standard — choose the appropriate scheme for your use case (scholarly, geographic, or passport)
- Multiple Unicode blocks cover Cyrillic — the main block (U+0400–U+04FF) suffices for all modern languages, but historic text may need Extended blocks
Script Stories में और
Arabic is the third most widely used writing system in the world, …
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and …
Greek is one of the oldest alphabetic writing systems and gave Unicode …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern …
Thai is an abugida script with no spaces between words, complex vowel …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and …
Hangul was invented in 1443 by King Sejong as a scientific alphabet …
Bengali is an abugida script with over 300 million speakers, used for …
Tamil is one of the oldest living writing systems, with a literary …
The Armenian alphabet was created in 405 AD by the monk Mesrop …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and …
There are hundreds of writing systems in use around the world today, …