Writing Systems of the World · Chương 8
Cyrillic: The Script That Spans Continents
Used by over 250 million people across Russia, Eastern Europe, and Central Asia, Cyrillic has many national variations. This chapter explores its linguistic diversity and the security implications of Latin-Cyrillic homoglyphs.
From the cobblestoned streets of Moscow to the steppe cities of Kazakhstan, from the Orthodox churches of Serbia to the mountain villages of Bulgaria, one script serves as the common writing system for a vast sweep of Eurasia: Cyrillic. Named after Saint Cyril — a 9th-century Byzantine scholar who may or may not have personally designed it — the Cyrillic alphabet today serves over a dozen official languages across Russia, Eastern Europe, the Balkans, and Central Asia, written by approximately 250 million people as their primary script. In Unicode, Cyrillic is one of the most complex scripts to encode correctly, partly because of its sheer breadth of languages and variants, and partly because of the security implications of its visual similarity to the Latin alphabet.
The Saints and the Script
In 862 CE, the Byzantine emperor Michael III sent two scholars — brothers Cyril (born Constantine) and Methodius — to the Great Moravian Empire (roughly modern Czech Republic and Slovakia) to introduce Christianity in the local Slavic language. To do this, they needed to translate the Bible and liturgical texts, which in turn required a writing system for Old Church Slavonic.
The script Cyril is said to have created was actually Glagolitic — a highly distinctive, curvilinear alphabet with no clear parallels to existing scripts. After Cyril's death in 869, his disciples in Bulgaria developed Cyrillic based primarily on the Greek alphabet, with additional letters created for sounds in Slavic languages that Greek lacked. This script, named in Cyril's honor, spread rapidly through the Orthodox Slavic world and ultimately supplanted Glagolitic almost everywhere.
The Cyrillic alphabet's Greek foundations are immediately apparent: А, В, Г, Д, Е, З, И, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х — many of these letters are visually identical or near-identical to their Greek counterparts. Letters like Б, Ж, Ц, Ч, Ш, Щ, Ъ, Ы, Ь, Э, Ю, Я were created for Slavic sounds absent from Greek.
Cyrillic Across Languages
As Cyrillic spread through the Russian Empire and later the Soviet Union, it was adapted for dozens of non-Slavic languages. The Soviet language policy of the 1930s–1940s systematically cyrillicized the writing systems of Central Asian and Caucasian peoples, often replacing existing Arabic, Latin, or indigenous scripts. The result was an explosion of Cyrillic variants:
| Language | Unique Cyrillic Letters | Notes |
|---|---|---|
| Russian | — (standard Cyrillic) | 33 letters |
| Ukrainian | Ї, І, Є, Ґ | Four letters absent from Russian Cyrillic |
| Belarusian | Ў | Short U for /w/ sound |
| Bulgarian | — | No Ъ as a vowel; different usage of some letters |
| Serbian | Ђ, Ж, Љ, Њ, Ћ, Ч, Џ | Diaphonemic distinctions |
| Macedonian | Ѓ, Ѕ, Ј, Љ, Њ, Ќ, Џ | Distinct from Serbian Cyrillic |
| Mongolian | No unique; extended usage | Uses Cyrillic since 1940s |
| Kazakh | 9 unique letters | Transitioning to Latin |
| Bashkir, Tatar | Multiple unique | Tat includes Arabic-influenced sounds |
| Chuvash | Ӑ, Ӗ, Ҫ, Ӳ | Distinct Turkic sounds |
The Unicode Cyrillic block (U+0400–U+04FF) and its extensions handle this diversity:
| Block | Range | Count | Content |
|---|---|---|---|
| Cyrillic | U+0400–U+04FF | 256 | Modern + most extended Cyrillic |
| Cyrillic Supplement | U+0500–U+052F | 48 | Languages of the Russian Federation |
| Cyrillic Extended-A | U+2DE0–U+2DFF | 32 | Old Church Slavonic, historical |
| Cyrillic Extended-B | U+A640–U+A69F | 96 | Old Cyrillic, extended Old Slavic |
| Cyrillic Extended-C | U+1C80–U+1C8F | 9 | Lowercase forms of historical letters |
Russian Orthographic Reform
Pre-revolutionary Russian (before 1918) used four additional letters that were abolished by Soviet decree: Ѣ (yat, U+0462), Ѳ (fita, U+0472), І (decimal i, U+0456), and Ѵ (izhitsa, U+0474). These appear in historical texts, pre-revolutionary reprints, and Church Slavonic documents. Unicode encodes them in the Cyrillic Supplement and main blocks, ensuring that digitized pre-1918 Russian texts can be represented faithfully.
The Confusables Problem
The visual overlap between Cyrillic and Latin letters is even more extensive than the Greek-Latin overlap, creating significant security concerns:
| Cyrillic | Unicode | Latin | Unicode | Appearance |
|---|---|---|---|---|
| а | U+0430 | a | U+0061 | Identical (lowercase) |
| е | U+0435 | e | U+0065 | Identical |
| о | U+043E | o | U+006F | Identical |
| р | U+0440 | p | U+0070 | Identical |
| с | U+0441 | c | U+0063 | Identical |
| у | U+0443 | y | U+0079 | Identical |
| х | U+0445 | x | U+0078 | Identical |
| В | U+0412 | B | U+0042 | Identical (uppercase) |
| Е | U+0415 | E | U+0045 | Identical |
| М | U+041C | M | U+004D | Identical |
| Н | U+041D | H | U+0048 | Identical |
| О | U+041E | O | U+004F | Identical |
| Р | U+0420 | P | U+0050 | Identical |
| С | U+0421 | C | U+0043 | Identical |
| Т | U+0422 | T | U+0054 | Identical |
| Х | U+0425 | X | U+0058 | Identical |
This overlap is not coincidental — both scripts ultimately derive from the same Greek ancestor. But it creates real security vulnerabilities:
IDN Homograph Attacks: A domain like рaypal.com (Cyrillic р instead of Latin p) is visually indistinguishable from paypal.com in many fonts. In 2017, a security researcher registered аррlе.com (using Cyrillic а and р) and demonstrated how convincing such an attack could be.
Mitigations: ICANN's guidelines restrict mixing scripts in a single domain label. Modern browsers display punycode (the ACE encoding of IDN labels) when mixed-script domains are detected. Unicode's confusables data (in confusables.txt) documents character pairs and groups that are visually similar, providing a reference for security-sensitive applications.
Cyrillic for Non-Slavic Languages
Some of the most phonologically interesting Cyrillic letters were created for non-Slavic Central Asian and Siberian languages:
- Ғ (U+0492): Used in Kazakh, Uzbek, Tajik — a voiced uvular fricative
- Қ (U+049A): Voiceless uvular stop, common in Turkic languages
- Ң (U+04A2): Velar nasal, for -ng- sounds
- Ү (U+04AE): Close back unrounded vowel
- Ӑ (U+04D0): Short A for Chuvash and Mari
- Ӡ (U+04E1): Abkhaz letter
- Ꚑ (U+A691): Cyrillic Extended-B letters for Caucasian languages
The Post-Soviet Script Shifts
Several former Soviet republics have shifted away from Cyrillic since 1991, motivated by desires to distance their national identity from Russian cultural influence and to improve compatibility with the Latin-dominant internet:
- Moldova: Switched from Cyrillic back to Latin (Romanian uses Latin) in 1989
- Azerbaijan: Switched to Latin in 1991, fully by 2001
- Uzbekistan: Gradually transitioning to Latin since 1993 (still ongoing)
- Turkmenistan: Switched to Latin in 1993
- Kazakhstan: Announced transition to Latin in 2017, ongoing implementation
Mongolia, though not a former Soviet republic per se, still uses Soviet-introduced Cyrillic for standard Mongolian — though the traditional Mongolian script (vertical, encoded in Unicode at U+1800–U+18AF) is officially co-official and seeing a revival.
These geopolitical transitions create encoding challenges for digital archives, legacy databases, and historical documents. Unicode encodes all the necessary characters for both the Cyrillic and Latin forms of these languages, but accurate representation requires knowing which orthographic era a document comes from.