📜 Script Stories

Cyrillic Script

Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most widely used scripts in Unicode with characters spread across several Cyrillic blocks. This guide explores the Cyrillic script's history, its Unicode representation, and the challenges of supporting the full range of Cyrillic-using languages.

Published 2023-06-19 · Updated 2024-10-07

Cyrillic is one of the most widely used scripts in the world, serving as the writing system for Russian, Ukrainian, Bulgarian, Serbian, Kazakh, Mongolian, and over 50 other languages across Eastern Europe and Central Asia. With more than 250 million native users and official status in numerous countries, Cyrillic is a script that any developer working with international text will encounter. Its visual similarity to Latin letters makes it a prime vector for homoglyph attacks, while its language-specific variants demand careful font and locale handling. This guide explores Cyrillic's Unicode encoding, its history, and the technical challenges of supporting the full range of Cyrillic-using languages.

History and Origins

The Cyrillic script was developed in the First Bulgarian Empire in the 9th century CE, traditionally attributed to the disciples of Saints Cyril and Methodius. (Despite the name, the "Cyrillic" script was likely created by Clement of Ohrid; Saints Cyril and Methodius created the earlier Glagolitic script.) Cyrillic was derived primarily from the Greek uncial script, with additional letters for Slavic sounds not found in Greek.

The script spread rapidly through the Orthodox Christian world:

Century	Event
9th	Created in Bulgaria, based on Greek uncial
10th	Adopted by Kievan Rus' (precursor to Russia, Ukraine)
14th	Spread to Serbia, Romania (until 19th c.)
18th	Peter the Great reforms Russian Cyrillic (Civil Script)
20th	Soviet Union extends Cyrillic to Central Asian languages
21st	Kazakhstan begins transition from Cyrillic to Latin

Peter the Great's reform of 1708 simplified the letter shapes and removed some archaic characters, creating the civil script (гражданский шрифт) that is the basis of modern Russian Cyrillic. Other Cyrillic-using languages underwent their own reforms at various points.

Languages Using Cyrillic

Cyrillic serves an extraordinarily diverse set of languages, each with its own subset of letters:

Language	Speakers	Letters	Notable Characters
Russian	258M	33	Ё ё, Ъ ъ, Ы ы
Ukrainian	45M	33	Ї ї, І і, Ґ ґ, Є є
Bulgarian	8M	30	Same base as Russian minus some
Serbian	12M	30	Љ љ, Њ њ, Ћ ћ, Ђ ђ, Џ џ
Macedonian	2M	31	Ќ ќ, Ѓ ѓ, Ѕ ѕ
Belarusian	5M	32	Ў ў (short U)
Kazakh	13M	42	Extended with diacritics
Mongolian	6M	35	Extended letters
Tajik	9M	35	Extended with diacritics
Kyrgyz	5M	36	Extended letters

Unicode Blocks for Cyrillic

Block	Range	Characters	Purpose
Cyrillic	U+0400–U+04FF	256	Core Slavic + common extensions
Cyrillic Supplement	U+0500–U+052F	48	Minority languages (Komi, Abkhaz)
Cyrillic Extended-A	U+2DE0–U+2DFF	32	Old Church Slavonic diacritics
Cyrillic Extended-B	U+A640–U+A69F	96	Historic and Old Slavonic letters
Cyrillic Extended-C	U+1C80–U+1C8F	9	Old Slavonic letters
Cyrillic Extended-D	U+1E030–U+1E08F	63	Additional letters for minority languages

The main Cyrillic block (U+0400–U+04FF) contains everything needed for all major modern languages:

Russian Alphabet (33 Letters)

А Б В Г Д Е Ж З И Й К Л М Н О П
а б в г д е ж з и й к л м н о п

Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
р с т у ф х ц ч ш щ ъ ы ь э ю я

Ё ё (U+0401 / U+0451) — often treated specially

Key Code Points

Character	Unicode	Name
А а	U+0410 / U+0430	A
Б б	U+0411 / U+0431	BE
В в	U+0412 / U+0432	VE
Г г	U+0413 / U+0433	GHE
Д д	U+0414 / U+0434	DE
Е е	U+0415 / U+0435	IE
Ё ё	U+0401 / U+0451	IO
Ж ж	U+0416 / U+0436	ZHE
З з	U+0417 / U+0437	ZE
И и	U+0418 / U+0438	I
К к	U+041A / U+043A	KA
Л л	U+041B / U+043B	EL
М м	U+041C / U+043C	EM
Н н	U+041D / U+043D	EN
О о	U+041E / U+043E	O
П п	U+041F / U+043F	PE
Р р	U+0420 / U+0440	ER
С с	U+0421 / U+0441	ES
Т т	U+0422 / U+0442	TE
Ш ш	U+0428 / U+0448	SHA
Щ щ	U+0429 / U+0449	SHCHA
Ъ ъ	U+042A / U+044A	HARD SIGN
Ы ы	U+042B / U+044B	YERU
Ь ь	U+042C / U+044C	SOFT SIGN

The Ё Problem

Russian letter Ё (ё) is notoriously under-used. Most Russian text — including newspapers, books, and websites — replaces ё with е (without the dieresis). This creates challenges for:

Search: Users may search for "ещё" or "еще" — both should match
Sorting: Ё's position in sort order varies by convention
NLP: Word frequency counts are skewed when ё and е are conflated

# Handle Russian Ё/Е equivalence in search
import re

def normalize_russian(text: str) -> str:
    # Normalize ё → е for case-insensitive Russian search.
    return text.replace("\u0451", "\u0435").replace("\u0401", "\u0415")

# Both should match
query = normalize_russian("\u0435\u0449\u0451")   # ещё → еще
text = normalize_russian("\u0435\u0449\u0435")     # еще → еще
assert query == text

Cyrillic–Latin Confusables

The most significant security issue with Cyrillic is that many uppercase and lowercase letters are visually identical to Latin characters:

Cyrillic	Latin	Code Points
А а	A a	U+0410/U+0430 vs U+0041/U+0061
В	B	U+0412 vs U+0042
Е е	E e	U+0415/U+0435 vs U+0045/U+0065
К	K	U+041A vs U+004B
М	M	U+041C vs U+004D
Н	H	U+041D vs U+0048
О о	O o	U+041E/U+043E vs U+004F/U+006F
Р р	P p	U+0420/U+0440 vs U+0050/U+0070
С с	C c	U+0421/U+0441 vs U+0043/U+0063
Т	T	U+0422 vs U+0054
Х х	X x	U+0425/U+0445 vs U+0058/U+0078
у	y	U+0443 vs U+0079

This means an attacker can register a domain like аррӏе.com (using Cyrillic а, р, р, and ӏ) that looks identical to apple.com in many fonts. This is called a homoglyph attack or IDN homograph attack.

Defenses

Browsers display Punycode (xn--...) for mixed-script domains
IDNA 2008 restricts whole-script confusables
Unicode TR#39 (Security Mechanisms) provides confusable character data
Application-level checks should detect mixed Cyrillic/Latin text

import unicodedata

def detect_mixed_scripts(text: str) -> set[str]:
    # Detect if text mixes Latin and Cyrillic (potential homoglyph attack).
    scripts: set[str] = set()
    for char in text:
        if char.isalpha():
            cp = ord(char)
            if 0x0400 <= cp <= 0x04FF:
                scripts.add("Cyrillic")
            elif 0x0041 <= cp <= 0x024F:
                scripts.add("Latin")
    return scripts

# Suspicious: looks like "apple" but uses Cyrillic
fake = "\u0430\u0440\u0440\u04CF\u0435"
result = detect_mixed_scripts(fake)
print(result)  # {'Cyrillic'}  — pure Cyrillic, but visually looks Latin

# Real attack: mix Cyrillic and Latin
mixed = "a\u0440\u0440le"  # Latin a, Cyrillic рр, Latin le
result = detect_mixed_scripts(mixed)
print(result)  # {'Latin', 'Cyrillic'} — mixed! Flag this.

Language-Specific Variants

A critical challenge with Cyrillic is that the same code point may render differently depending on the language. The most well-known examples:

Italic Forms

In Russian italic, the lowercase letters г, д, и, п, т take completely different shapes from their upright forms — shapes that look like Latin italic e, d/g, u, n, m to non-Russian readers. Serbian and Macedonian italics differ even further.

Serbian vs Russian

Serbian Cyrillic uses the same Unicode code points as Russian for shared letters, but several glyphs should render differently:

Letter	Russian form	Serbian form	Code Point
б	б (curved top)	б (flat top in italic)	U+0431
г	г (like Γ)	г (like ι in italic)	U+0433
д	д (like Δ)	д (like g in italic)	U+0434
п	п (like Π)	п (like и in italic)	U+043F
т	т (like T)	т (like m in italic)	U+0442

This means font selection and the lang attribute are critical:

<!-- Russian text uses Russian glyph variants -->
<p lang="ru">Београд</p>

<!-- Serbian text uses Serbian glyph variants -->
<p lang="sr">Београд</p>

/* Ensure proper Serbian rendering */
:lang(sr) {
    font-family: "Noto Serif", serif;
    /* The font must include Serbian locl features */
}

Ukrainian

Ukrainian has four letters not in the Russian alphabet:

Letter	Unicode	Name	Russian Equivalent
Ґ ґ	U+0490 / U+0491	GHE WITH UPTURN	(no equivalent)
Є є	U+0404 / U+0454	UKRAINIAN IE	Э? (different sound)
І і	U+0406 / U+0456	BYELORUSSIAN-UKRAINIAN I	И
Ї ї	U+0407 / U+0457	YI	(no equivalent)

Working with Cyrillic in Code

Python

import unicodedata

# Russian text
text = "\u041F\u0440\u0438\u0432\u0435\u0442"  # Привет (Hello)
print(text.upper())  # ПРИВЕТ
print(text.lower())  # привет

# Check script
for ch in text:
    print(f"U+{ord(ch):04X} {unicodedata.name(ch)} "
          f"cat={unicodedata.category(ch)}")

# Case-insensitive comparison with Ё handling
def russian_casefold(s: str) -> str:
    # Case-fold Russian text, treating Ё as Е.
    return s.casefold().replace("\u0451", "\u0435")

JavaScript

// Cyrillic regex
const cyrillicPattern = /\p{Script=Cyrillic}/u;
const text = "\u041F\u0440\u0438\u0432\u0435\u0442";
console.log(cyrillicPattern.test(text)); // true

// Locale-aware sorting (Russian vs Ukrainian collation differs)
const russianCollator = new Intl.Collator("ru");
const ukrainianCollator = new Intl.Collator("uk");

const words = ["\u0430", "\u0431", "\u0432", "\u0491"];
console.log(words.sort(russianCollator.compare));
console.log(words.sort(ukrainianCollator.compare));

Transliteration

Cyrillic text is often transliterated to Latin for URLs, filenames, and systems that do not support Cyrillic. Several standards exist:

Standard	Ш	Щ	Ж	Ч	Purpose
ISO 9	š	ŝ	ž	č	Scholarly (1:1 reversible)
BGN/PCGN	sh	shch	zh	ch	US/UK geographic names
Passport (Russian)	sh	shch	zh	ch	Russian passports
Scientific	š	šč	ž	č	Academic transliteration

# Simple Russian transliteration
TRANSLIT_MAP: dict[str, str] = {
    "\u0430": "a", "\u0431": "b", "\u0432": "v", "\u0433": "g",
    "\u0434": "d", "\u0435": "e", "\u0451": "yo", "\u0436": "zh",
    "\u0437": "z", "\u0438": "i", "\u0439": "y", "\u043A": "k",
    "\u043B": "l", "\u043C": "m", "\u043D": "n", "\u043E": "o",
    "\u043F": "p", "\u0440": "r", "\u0441": "s", "\u0442": "t",
    "\u0443": "u", "\u0444": "f", "\u0445": "kh", "\u0446": "ts",
    "\u0447": "ch", "\u0448": "sh", "\u0449": "shch", "\u044A": "",
    "\u044B": "y", "\u044C": "", "\u044D": "e", "\u044E": "yu",
    "\u044F": "ya",
}

def transliterate_russian(text: str) -> str:
    result = []
    for ch in text.lower():
        result.append(TRANSLIT_MAP.get(ch, ch))
    return "".join(result)

print(transliterate_russian("\u041F\u0440\u0438\u0432\u0435\u0442"))
# "privet"

Summary

Cyrillic is a globally significant script whose Unicode support requires attention to language-specific details. Key takeaways:

Cyrillic–Latin confusables are the single biggest security concern — always detect and flag mixed-script text in security-sensitive contexts
Language-specific glyph variants (especially Serbian vs Russian italic forms) require correct lang attributes and OpenType locl features
The Ё problem in Russian means search and comparison logic must treat Ё/ё and Е/е as equivalent
Ukrainian, Serbian, Macedonian, and other languages have unique letters that extend beyond the Russian alphabet — do not assume Cyrillic = Russian
Transliteration has no single standard — choose the appropriate scheme for your use case (scholarly, geographic, or passport)
Multiple Unicode blocks cover Cyrillic — the main block (U+0400–U+04FF) suffices for all modern languages, but historic text may need Extended blocks