📜 Script Stories

Cyrillic Script

Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most widely used scripts in Unicode with characters spread across several Cyrillic blocks. This guide explores the Cyrillic script's history, its Unicode representation, and the challenges of supporting the full range of Cyrillic-using languages.

·

Cyrillic is one of the most widely used scripts in the world, serving as the writing system for Russian, Ukrainian, Bulgarian, Serbian, Kazakh, Mongolian, and over 50 other languages across Eastern Europe and Central Asia. With more than 250 million native users and official status in numerous countries, Cyrillic is a script that any developer working with international text will encounter. Its visual similarity to Latin letters makes it a prime vector for homoglyph attacks, while its language-specific variants demand careful font and locale handling. This guide explores Cyrillic's Unicode encoding, its history, and the technical challenges of supporting the full range of Cyrillic-using languages.

History and Origins

The Cyrillic script was developed in the First Bulgarian Empire in the 9th century CE, traditionally attributed to the disciples of Saints Cyril and Methodius. (Despite the name, the "Cyrillic" script was likely created by Clement of Ohrid; Saints Cyril and Methodius created the earlier Glagolitic script.) Cyrillic was derived primarily from the Greek uncial script, with additional letters for Slavic sounds not found in Greek.

The script spread rapidly through the Orthodox Christian world:

Century Event
9th Created in Bulgaria, based on Greek uncial
10th Adopted by Kievan Rus' (precursor to Russia, Ukraine)
14th Spread to Serbia, Romania (until 19th c.)
18th Peter the Great reforms Russian Cyrillic (Civil Script)
20th Soviet Union extends Cyrillic to Central Asian languages
21st Kazakhstan begins transition from Cyrillic to Latin

Peter the Great's reform of 1708 simplified the letter shapes and removed some archaic characters, creating the civil script (гражданский шрифт) that is the basis of modern Russian Cyrillic. Other Cyrillic-using languages underwent their own reforms at various points.

Languages Using Cyrillic

Cyrillic serves an extraordinarily diverse set of languages, each with its own subset of letters:

Language Speakers Letters Notable Characters
Russian 258M 33 Ё ё, Ъ ъ, Ы ы
Ukrainian 45M 33 Ї ї, І і, Ґ ґ, Є є
Bulgarian 8M 30 Same base as Russian minus some
Serbian 12M 30 Љ љ, Њ њ, Ћ ћ, Ђ ђ, Џ џ
Macedonian 2M 31 Ќ ќ, Ѓ ѓ, Ѕ ѕ
Belarusian 5M 32 Ў ў (short U)
Kazakh 13M 42 Extended with diacritics
Mongolian 6M 35 Extended letters
Tajik 9M 35 Extended with diacritics
Kyrgyz 5M 36 Extended letters

Unicode Blocks for Cyrillic

Block Range Characters Purpose
Cyrillic U+0400–U+04FF 256 Core Slavic + common extensions
Cyrillic Supplement U+0500–U+052F 48 Minority languages (Komi, Abkhaz)
Cyrillic Extended-A U+2DE0–U+2DFF 32 Old Church Slavonic diacritics
Cyrillic Extended-B U+A640–U+A69F 96 Historic and Old Slavonic letters
Cyrillic Extended-C U+1C80–U+1C8F 9 Old Slavonic letters
Cyrillic Extended-D U+1E030–U+1E08F 63 Additional letters for minority languages

The main Cyrillic block (U+0400–U+04FF) contains everything needed for all major modern languages:

Russian Alphabet (33 Letters)

А Б В Г Д Е Ж З И Й К Л М Н О П
а б в г д е ж з и й к л м н о п

Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
р с т у ф х ц ч ш щ ъ ы ь э ю я

Ё ё (U+0401 / U+0451) — often treated specially

Key Code Points

Character Unicode Name
А а U+0410 / U+0430 A
Б б U+0411 / U+0431 BE
В в U+0412 / U+0432 VE
Г г U+0413 / U+0433 GHE
Д д U+0414 / U+0434 DE
Е е U+0415 / U+0435 IE
Ё ё U+0401 / U+0451 IO
Ж ж U+0416 / U+0436 ZHE
З з U+0417 / U+0437 ZE
И и U+0418 / U+0438 I
К к U+041A / U+043A KA
Л л U+041B / U+043B EL
М м U+041C / U+043C EM
Н н U+041D / U+043D EN
О о U+041E / U+043E O
П п U+041F / U+043F PE
Р р U+0420 / U+0440 ER
С с U+0421 / U+0441 ES
Т т U+0422 / U+0442 TE
Ш ш U+0428 / U+0448 SHA
Щ щ U+0429 / U+0449 SHCHA
Ъ ъ U+042A / U+044A HARD SIGN
Ы ы U+042B / U+044B YERU
Ь ь U+042C / U+044C SOFT SIGN

The Ё Problem

Russian letter Ё (ё) is notoriously under-used. Most Russian text — including newspapers, books, and websites — replaces ё with е (without the dieresis). This creates challenges for:

  • Search: Users may search for "ещё" or "еще" — both should match
  • Sorting: Ё's position in sort order varies by convention
  • NLP: Word frequency counts are skewed when ё and е are conflated
# Handle Russian Ё/Е equivalence in search
import re

def normalize_russian(text: str) -> str:
    # Normalize ё → е for case-insensitive Russian search.
    return text.replace("\u0451", "\u0435").replace("\u0401", "\u0415")

# Both should match
query = normalize_russian("\u0435\u0449\u0451")   # ещё → еще
text = normalize_russian("\u0435\u0449\u0435")     # еще → еще
assert query == text

Cyrillic–Latin Confusables

The most significant security issue with Cyrillic is that many uppercase and lowercase letters are visually identical to Latin characters:

Cyrillic Latin Code Points
А а A a U+0410/U+0430 vs U+0041/U+0061
В B U+0412 vs U+0042
Е е E e U+0415/U+0435 vs U+0045/U+0065
К K U+041A vs U+004B
М M U+041C vs U+004D
Н H U+041D vs U+0048
О о O o U+041E/U+043E vs U+004F/U+006F
Р р P p U+0420/U+0440 vs U+0050/U+0070
С с C c U+0421/U+0441 vs U+0043/U+0063
Т T U+0422 vs U+0054
Х х X x U+0425/U+0445 vs U+0058/U+0078
у y U+0443 vs U+0079

This means an attacker can register a domain like аррӏе.com (using Cyrillic а, р, р, and ӏ) that looks identical to apple.com in many fonts. This is called a homoglyph attack or IDN homograph attack.

Defenses

  1. Browsers display Punycode (xn--...) for mixed-script domains
  2. IDNA 2008 restricts whole-script confusables
  3. Unicode TR#39 (Security Mechanisms) provides confusable character data
  4. Application-level checks should detect mixed Cyrillic/Latin text
import unicodedata

def detect_mixed_scripts(text: str) -> set[str]:
    # Detect if text mixes Latin and Cyrillic (potential homoglyph attack).
    scripts: set[str] = set()
    for char in text:
        if char.isalpha():
            cp = ord(char)
            if 0x0400 <= cp <= 0x04FF:
                scripts.add("Cyrillic")
            elif 0x0041 <= cp <= 0x024F:
                scripts.add("Latin")
    return scripts

# Suspicious: looks like "apple" but uses Cyrillic
fake = "\u0430\u0440\u0440\u04CF\u0435"
result = detect_mixed_scripts(fake)
print(result)  # {'Cyrillic'}  — pure Cyrillic, but visually looks Latin

# Real attack: mix Cyrillic and Latin
mixed = "a\u0440\u0440le"  # Latin a, Cyrillic рр, Latin le
result = detect_mixed_scripts(mixed)
print(result)  # {'Latin', 'Cyrillic'} — mixed! Flag this.

Language-Specific Variants

A critical challenge with Cyrillic is that the same code point may render differently depending on the language. The most well-known examples:

Italic Forms

In Russian italic, the lowercase letters г, д, и, п, т take completely different shapes from their upright forms — shapes that look like Latin italic e, d/g, u, n, m to non-Russian readers. Serbian and Macedonian italics differ even further.

Serbian vs Russian

Serbian Cyrillic uses the same Unicode code points as Russian for shared letters, but several glyphs should render differently:

Letter Russian form Serbian form Code Point
б б (curved top) б (flat top in italic) U+0431
г г (like Γ) г (like ι in italic) U+0433
д д (like Δ) д (like g in italic) U+0434
п п (like Π) п (like и in italic) U+043F
т т (like T) т (like m in italic) U+0442

This means font selection and the lang attribute are critical:

<!-- Russian text uses Russian glyph variants -->
<p lang="ru">Београд</p>

<!-- Serbian text uses Serbian glyph variants -->
<p lang="sr">Београд</p>
/* Ensure proper Serbian rendering */
:lang(sr) {
    font-family: "Noto Serif", serif;
    /* The font must include Serbian locl features */
}

Ukrainian

Ukrainian has four letters not in the Russian alphabet:

Letter Unicode Name Russian Equivalent
Ґ ґ U+0490 / U+0491 GHE WITH UPTURN (no equivalent)
Є є U+0404 / U+0454 UKRAINIAN IE Э? (different sound)
І і U+0406 / U+0456 BYELORUSSIAN-UKRAINIAN I И
Ї ї U+0407 / U+0457 YI (no equivalent)

Working with Cyrillic in Code

Python

import unicodedata

# Russian text
text = "\u041F\u0440\u0438\u0432\u0435\u0442"  # Привет (Hello)
print(text.upper())  # ПРИВЕТ
print(text.lower())  # привет

# Check script
for ch in text:
    print(f"U+{ord(ch):04X} {unicodedata.name(ch)} "
          f"cat={unicodedata.category(ch)}")

# Case-insensitive comparison with Ё handling
def russian_casefold(s: str) -> str:
    # Case-fold Russian text, treating Ё as Е.
    return s.casefold().replace("\u0451", "\u0435")

JavaScript

// Cyrillic regex
const cyrillicPattern = /\p{Script=Cyrillic}/u;
const text = "\u041F\u0440\u0438\u0432\u0435\u0442";
console.log(cyrillicPattern.test(text)); // true

// Locale-aware sorting (Russian vs Ukrainian collation differs)
const russianCollator = new Intl.Collator("ru");
const ukrainianCollator = new Intl.Collator("uk");

const words = ["\u0430", "\u0431", "\u0432", "\u0491"];
console.log(words.sort(russianCollator.compare));
console.log(words.sort(ukrainianCollator.compare));

Transliteration

Cyrillic text is often transliterated to Latin for URLs, filenames, and systems that do not support Cyrillic. Several standards exist:

Standard Ш Щ Ж Ч Purpose
ISO 9 š ŝ ž č Scholarly (1:1 reversible)
BGN/PCGN sh shch zh ch US/UK geographic names
Passport (Russian) sh shch zh ch Russian passports
Scientific š šč ž č Academic transliteration
# Simple Russian transliteration
TRANSLIT_MAP: dict[str, str] = {
    "\u0430": "a", "\u0431": "b", "\u0432": "v", "\u0433": "g",
    "\u0434": "d", "\u0435": "e", "\u0451": "yo", "\u0436": "zh",
    "\u0437": "z", "\u0438": "i", "\u0439": "y", "\u043A": "k",
    "\u043B": "l", "\u043C": "m", "\u043D": "n", "\u043E": "o",
    "\u043F": "p", "\u0440": "r", "\u0441": "s", "\u0442": "t",
    "\u0443": "u", "\u0444": "f", "\u0445": "kh", "\u0446": "ts",
    "\u0447": "ch", "\u0448": "sh", "\u0449": "shch", "\u044A": "",
    "\u044B": "y", "\u044C": "", "\u044D": "e", "\u044E": "yu",
    "\u044F": "ya",
}

def transliterate_russian(text: str) -> str:
    result = []
    for ch in text.lower():
        result.append(TRANSLIT_MAP.get(ch, ch))
    return "".join(result)

print(transliterate_russian("\u041F\u0440\u0438\u0432\u0435\u0442"))
# "privet"

Summary

Cyrillic is a globally significant script whose Unicode support requires attention to language-specific details. Key takeaways:

  1. Cyrillic–Latin confusables are the single biggest security concern — always detect and flag mixed-script text in security-sensitive contexts
  2. Language-specific glyph variants (especially Serbian vs Russian italic forms) require correct lang attributes and OpenType locl features
  3. The Ё problem in Russian means search and comparison logic must treat Ё/ё and Е/е as equivalent
  4. Ukrainian, Serbian, Macedonian, and other languages have unique letters that extend beyond the Russian alphabet — do not assume Cyrillic = Russian
  5. Transliteration has no single standard — choose the appropriate scheme for your use case (scholarly, geographic, or passport)
  6. Multiple Unicode blocks cover Cyrillic — the main block (U+0400–U+04FF) suffices for all modern languages, but historic text may need Extended blocks

Script Stories のその他のガイド