Writing Systems of the World · Capítulo 8

Cyrillic: The Script That Spans Continents

Used by over 250 million people across Russia, Eastern Europe, and Central Asia, Cyrillic has many national variations. This chapter explores its linguistic diversity and the security implications of Latin-Cyrillic homoglyphs.

~3 500 palabras · ~14 min de lectura · · Updated

From the cobblestoned streets of Moscow to the steppe cities of Kazakhstan, from the Orthodox churches of Serbia to the mountain villages of Bulgaria, one script serves as the common writing system for a vast sweep of Eurasia: Cyrillic. Named after Saint Cyril — a 9th-century Byzantine scholar who may or may not have personally designed it — the Cyrillic alphabet today serves over a dozen official languages across Russia, Eastern Europe, the Balkans, and Central Asia, written by approximately 250 million people as their primary script. In Unicode, Cyrillic is one of the most complex scripts to encode correctly, partly because of its sheer breadth of languages and variants, and partly because of the security implications of its visual similarity to the Latin alphabet.

The Saints and the Script

In 862 CE, the Byzantine emperor Michael III sent two scholars — brothers Cyril (born Constantine) and Methodius — to the Great Moravian Empire (roughly modern Czech Republic and Slovakia) to introduce Christianity in the local Slavic language. To do this, they needed to translate the Bible and liturgical texts, which in turn required a writing system for Old Church Slavonic.

The script Cyril is said to have created was actually Glagolitic — a highly distinctive, curvilinear alphabet with no clear parallels to existing scripts. After Cyril's death in 869, his disciples in Bulgaria developed Cyrillic based primarily on the Greek alphabet, with additional letters created for sounds in Slavic languages that Greek lacked. This script, named in Cyril's honor, spread rapidly through the Orthodox Slavic world and ultimately supplanted Glagolitic almost everywhere.

The Cyrillic alphabet's Greek foundations are immediately apparent: А, В, Г, Д, Е, З, И, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х — many of these letters are visually identical or near-identical to their Greek counterparts. Letters like Б, Ж, Ц, Ч, Ш, Щ, Ъ, Ы, Ь, Э, Ю, Я were created for Slavic sounds absent from Greek.

Cyrillic Across Languages

As Cyrillic spread through the Russian Empire and later the Soviet Union, it was adapted for dozens of non-Slavic languages. The Soviet language policy of the 1930s–1940s systematically cyrillicized the writing systems of Central Asian and Caucasian peoples, often replacing existing Arabic, Latin, or indigenous scripts. The result was an explosion of Cyrillic variants:

Language Unique Cyrillic Letters Notes
Russian — (standard Cyrillic) 33 letters
Ukrainian Ї, І, Є, Ґ Four letters absent from Russian Cyrillic
Belarusian Ў Short U for /w/ sound
Bulgarian No Ъ as a vowel; different usage of some letters
Serbian Ђ, Ж, Љ, Њ, Ћ, Ч, Џ Diaphonemic distinctions
Macedonian Ѓ, Ѕ, Ј, Љ, Њ, Ќ, Џ Distinct from Serbian Cyrillic
Mongolian No unique; extended usage Uses Cyrillic since 1940s
Kazakh 9 unique letters Transitioning to Latin
Bashkir, Tatar Multiple unique Tat includes Arabic-influenced sounds
Chuvash Ӑ, Ӗ, Ҫ, Ӳ Distinct Turkic sounds

The Unicode Cyrillic block (U+0400–U+04FF) and its extensions handle this diversity:

Block Range Count Content
Cyrillic U+0400–U+04FF 256 Modern + most extended Cyrillic
Cyrillic Supplement U+0500–U+052F 48 Languages of the Russian Federation
Cyrillic Extended-A U+2DE0–U+2DFF 32 Old Church Slavonic, historical
Cyrillic Extended-B U+A640–U+A69F 96 Old Cyrillic, extended Old Slavic
Cyrillic Extended-C U+1C80–U+1C8F 9 Lowercase forms of historical letters

Russian Orthographic Reform

Pre-revolutionary Russian (before 1918) used four additional letters that were abolished by Soviet decree: Ѣ (yat, U+0462), Ѳ (fita, U+0472), І (decimal i, U+0456), and Ѵ (izhitsa, U+0474). These appear in historical texts, pre-revolutionary reprints, and Church Slavonic documents. Unicode encodes them in the Cyrillic Supplement and main blocks, ensuring that digitized pre-1918 Russian texts can be represented faithfully.

The Confusables Problem

The visual overlap between Cyrillic and Latin letters is even more extensive than the Greek-Latin overlap, creating significant security concerns:

Cyrillic Unicode Latin Unicode Appearance
а U+0430 a U+0061 Identical (lowercase)
е U+0435 e U+0065 Identical
о U+043E o U+006F Identical
р U+0440 p U+0070 Identical
с U+0441 c U+0063 Identical
у U+0443 y U+0079 Identical
х U+0445 x U+0078 Identical
В U+0412 B U+0042 Identical (uppercase)
Е U+0415 E U+0045 Identical
М U+041C M U+004D Identical
Н U+041D H U+0048 Identical
О U+041E O U+004F Identical
Р U+0420 P U+0050 Identical
С U+0421 C U+0043 Identical
Т U+0422 T U+0054 Identical
Х U+0425 X U+0058 Identical

This overlap is not coincidental — both scripts ultimately derive from the same Greek ancestor. But it creates real security vulnerabilities:

IDN Homograph Attacks: A domain like рaypal.com (Cyrillic р instead of Latin p) is visually indistinguishable from paypal.com in many fonts. In 2017, a security researcher registered аррlе.com (using Cyrillic а and р) and demonstrated how convincing such an attack could be.

Mitigations: ICANN's guidelines restrict mixing scripts in a single domain label. Modern browsers display punycode (the ACE encoding of IDN labels) when mixed-script domains are detected. Unicode's confusables data (in confusables.txt) documents character pairs and groups that are visually similar, providing a reference for security-sensitive applications.

Cyrillic for Non-Slavic Languages

Some of the most phonologically interesting Cyrillic letters were created for non-Slavic Central Asian and Siberian languages:

  • Ғ (U+0492): Used in Kazakh, Uzbek, Tajik — a voiced uvular fricative
  • Қ (U+049A): Voiceless uvular stop, common in Turkic languages
  • Ң (U+04A2): Velar nasal, for -ng- sounds
  • Ү (U+04AE): Close back unrounded vowel
  • Ӑ (U+04D0): Short A for Chuvash and Mari
  • Ӡ (U+04E1): Abkhaz letter
  • (U+A691): Cyrillic Extended-B letters for Caucasian languages

The Post-Soviet Script Shifts

Several former Soviet republics have shifted away from Cyrillic since 1991, motivated by desires to distance their national identity from Russian cultural influence and to improve compatibility with the Latin-dominant internet:

  • Moldova: Switched from Cyrillic back to Latin (Romanian uses Latin) in 1989
  • Azerbaijan: Switched to Latin in 1991, fully by 2001
  • Uzbekistan: Gradually transitioning to Latin since 1993 (still ongoing)
  • Turkmenistan: Switched to Latin in 1993
  • Kazakhstan: Announced transition to Latin in 2017, ongoing implementation

Mongolia, though not a former Soviet republic per se, still uses Soviet-introduced Cyrillic for standard Mongolian — though the traditional Mongolian script (vertical, encoded in Unicode at U+1800–U+18AF) is officially co-official and seeing a revival.

These geopolitical transitions create encoding challenges for digital archives, legacy databases, and historical documents. Unicode encodes all the necessary characters for both the Cyrillic and Latin forms of these languages, but accurate representation requires knowing which orthographic era a document comes from.