Unicode 指南
横跨 10 个系列的 150 篇深度指南——从编码基础到安全深度解析。
Unicode Fundamentals
Core concepts of Unicode and character encoding
Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete …
UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the …
A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form …
Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic …
The Byte Order Mark (BOM) is a special Unicode character used at the start of a text stream to signal …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit …
ASCII defined 128 characters for the English alphabet and was the foundation of text computing, but it could not handle …
The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. …
Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves …
Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and …
Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, …
Combining characters are Unicode code points that attach to a preceding base character to create accented letters, diacritics, and other …
A single visible character on screen — called a grapheme cluster — can be made up of multiple Unicode code …
Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or …
Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used …
Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible …
Unicode has released major versions regularly since 1.0 in 1991, with each release adding thousands of new characters, emoji, and …
Unicode in Code
Language-specific guides for working with Unicode
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as …
Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working …
Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it …
Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode …
C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading …
Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international …
PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong …
Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than …
HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages …
Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns …
SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and …
URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in …
Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 …
JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still …
Symbol Reference
Comprehensive guides to specific types of Unicode symbols
Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, curved arrows, and specialized mathematical arrows. This complete …
Unicode provides multiple check mark and tick symbols ranging from the classic ✓ (U+2713) to heavy check marks, ballot boxes, …
Unicode includes a rich collection of star shapes — from the simple five-pointed star ★ to outlined stars, six-pointed stars, …
Unicode contains dozens of heart symbols including the classic ♥, black and white hearts, suit hearts, decorative hearts, and a …
Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over 60 world currencies, from the Dollar $ and …
Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric characters covering virtually every symbol used in modern …
Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly brackets, fullwidth brackets, double brackets, and dozens of …
Unicode offers a wide variety of bullet point characters beyond the standard •, including triangular bullets, dashes, arrows, and decorative …
Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, and borders in monospaced text environments like terminals …
Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols block, as well as a dedicated Musical Symbols …
Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ ⅓ and many others, allowing fractions to be …
Unicode provides precomposed superscript and subscript digits and letters — such as ¹ ² ³ and ₁ ₂ ₃ — …
Unicode contains dozens of circle symbols including filled circles, outlined circles, circles with dots, circled letters, and the various combining …
Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, and rectangle variants spread across the Geometric Shapes …
Unicode provides a comprehensive set of triangle symbols in all orientations — up, down, left, right — in both filled …
Unicode includes filled and outline diamond shapes, lozenge characters, and playing card suit diamonds across several blocks including Geometric Shapes …
Unicode provides various cross and X mark characters including the heavy ballot X ✗, multiplication signs, plus signs, and decorative …
The hyphen-minus on your keyboard is just one of Unicode's many dash characters, which also includes the en dash, em …
Unicode defines typographic quotation marks — curly quotes — for dozens of languages, including single and double guillemets, low-9 quotation …
Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, trademark ™, service mark ℠, and other legal …
The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ characters are among the most commonly searched Unicode …
Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and other enclosed digit forms used in lists, annotations, …
Unicode includes a Number Forms block with precomposed Roman numeral characters such as Ⅰ Ⅱ Ⅲ Ⅳ, distinct from the …
Greek letters like α β γ δ π Σ Ω are widely used as mathematical and scientific symbols and are …
The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the Zapf Dingbats typeface, including scissors, pencils, fleurons, and …
Unicode includes a Playing Cards block with characters for all 52 standard playing cards plus jokers, card backs, and the …
Unicode provides characters for all six chess piece types in both white and black variants — ♔♕♖♗♘♙ and ♚♛♜♝♞♟ — …
Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, and other astrological characters used in astrology and …
Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the 8-dot braille cell, enabling braille text to be …
Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, and other polygons in filled, outlined, and various …
The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from letters, such as ℃ ℉ ℕ ℤ ℝ …
Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including keyboard keys, optical character recognition characters, and control …
Diacritics are accent marks and other marks that attach to letters to indicate pronunciation or meaning, represented in Unicode as …
Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width spaces, word joiners, soft hyphens, and various format …
Unicode includes warning and hazard symbols such as the universal caution ⚠ sign, biohazard ☣, radiation ☢, and skull and …
Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow ❄, lightning ⚡, and other weather symbols used …
Unicode includes symbols for many of the world's major religions including the cross ✝, crescent and star ☪, Star of …
Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, as well as newer gender-related characters and the …
Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ Command, ⌥ Option, ⇧ Shift, and ⌃ Control, …
Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds of others are widely used in social media …
Block Explorer
Deep dives into notable Unicode blocks
The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers the 128 original ASCII characters, including the English …
The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for Western European languages and matches the original ISO …
The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and various punctuation characters used in professional typography across …
The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, calculus, and other areas of mathematics used in …
The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, double-headed arrows, curved arrows, and arrows with various …
The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface and contains 192 decorative ornaments, check marks, arrows, …
The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing weather symbols, card suits, zodiac signs, religious symbols, …
The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode blocks, containing over 20,000 Chinese, Japanese, and Korean …
The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically derived from 19 initial consonants, 21 vowels, and …
Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including Emoticons, Miscellaneous Symbols and Pictographs, and Transport and …
The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that lack an ASCII representation, such as the Euro …
The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters for drawing text-based user interfaces and terminal art …
The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, and various other enclosed alphanumeric forms used in …
The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, and other polygons in filled, outlined, and partial …
The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing 233 characters for full western music notation, including …
Script Stories
Writing system deep dives with history and culture
Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending …
Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and many other South Asian languages, with complex conjunct …
Greek is one of the oldest alphabetic writing systems and gave Unicode many of its mathematical symbols, with the Greek …
Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most …
Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern Hebrew, and Yiddish, with optional vowel diacritics called …
Thai is an abugida script with no spaces between words, complex vowel placement above and below consonants, and tone marks …
Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and Kanji (CJK ideographs) — alongside Latin text, making …
Hangul was invented in 1443 by King Sejong as a scientific alphabet where syllable blocks are algorithmically composed from individual …
Bengali is an abugida script with over 300 million speakers, used for Bengali and Assamese, featuring complex conjunct consonant forms …
Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script …
The Armenian alphabet was created in 405 AD by the monk Mesrop Mashtots and has remained largely unchanged for over …
Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — all encoded in Unicode, with modern Georgian using …
The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, Oromo, and many other languages of the Horn …
Unicode encodes dozens of historic and extinct scripts — from Cuneiform and Egyptian Hieroglyphs to Linear B and Gothic — …
There are hundreds of writing systems in use around the world today, from alphabets and syllabaries to abjads and abugidas, …
Practical Unicode
How-to guides for real-world Unicode tasks
Windows provides several methods for typing special characters and Unicode symbols, including Alt codes, the Character Map app, keyboard shortcuts, …
macOS makes it easy to type special characters and Unicode symbols through the Character Viewer, Option key shortcuts, the emoji …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by a hex code point, compose key sequences, IBus …
Typing special Unicode characters on smartphones requires different techniques than on desktop — long-pressing keys, using third-party keyboards, or copy-pasting …
Mojibake is the garbled text you see when a file encoded in one character set is interpreted as another, producing …
Storing Unicode text in a database requires choosing the right charset, collation, and column type — a wrong choice can …
Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems …
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, and addresses requires encoding schemes like Quoted-Printable, Base64, …
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from any Unicode script, converted to ASCII-compatible encoding using …
Using Unicode symbols, special characters, and emoji in web content has important accessibility implications for screen readers, which may announce …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and explicit directional control characters, enabling correct display of …
A font file only contains glyphs for a subset of Unicode characters, which means characters outside that subset fall back …
Finding the exact Unicode character you need can be challenging given over 140,000 characters spread across 150+ scripts and dozens …
Copying and pasting text between applications can introduce invisible characters, change normalization forms, strip or mangle Unicode characters, and cause …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, Fraktur, monospace, and double-struck variants of Latin letters …
Platform Guides
Unicode usage in specific platforms and applications
Microsoft Word supports the full Unicode character set and provides several methods for inserting special characters, including Alt+X code point …
Google Docs and Sheets use UTF-8 internally and provide a Special Characters panel for inserting Unicode symbols by drawing, searching …
Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters requires a compatible terminal emulator, the right locale …
PDF supports Unicode text through embedded fonts and ToUnicode maps, but many PDFs created from scans or older tools produce …
Microsoft Excel stores text in Unicode but has historically struggled with non-Latin characters in CSV imports, RTL text layout, and …
Social media platforms handle Unicode text with varying degrees of support, affecting how emoji, RTL text, special characters, and invisible …
Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping …
Natural language processing and data science pipelines frequently encounter Unicode issues including encoding errors, normalization mismatches, invisible characters, and language …
QR codes can encode Unicode text using UTF-8, but many QR code generators and scanners default to ISO 8859-1, causing …
Allowing Unicode characters in passwords increases the keyspace and can improve security, but it also introduces normalization ambiguity, where the …
Unicode History & Culture
Historical and cultural stories about Unicode
ASCII was created in 1963 by the American Standards Association to standardize communication between different computers and telegraph systems using …
EBCDIC (Extended Binary Coded Decimal Interchange Code) was IBM's character encoding used on its mainframe computers, incompatible with ASCII and …
The Unicode Consortium is the non-profit organization responsible for developing and maintaining the Unicode standard, with members including Apple, Google, …
Adding a new character to Unicode requires submitting a detailed proposal to the Unicode Consortium that demonstrates the character's need, …
Getting a new emoji into Unicode requires a formal proposal to the Emoji Subcommittee that must meet strict selection criteria …
CJK unification was Unicode's decision to assign the same code points to Chinese, Japanese, and Korean characters with the same …
Mojibake — Japanese for 'character transformation' — is the garbled text that results from encoding mismatches, a problem that plagued …
From the first Unicode draft in 1988 to the addition of emoji, the surpassing of 100,000 characters, and UTF-8 becoming …
Before Unicode became universal, the web was fragmented by incompatible national encodings that made international websites unreliable and cross-language communication …
Unicode is full of surprising, obscure, and occasionally humorous characters — from the interrobang ‽ to the snowman ☃ with …
Unicode Security
Security implications of Unicode
Unicode's vast character set introduces a range of security vulnerabilities including homograph attacks, bidirectional text spoofing, normalization exploits, and invisible …
IDN homograph attacks use look-alike Unicode characters to register domain names that appear identical to legitimate ones — for example, …
Zero-width and other invisible Unicode characters can be used to fingerprint text for tracking, hide malicious payloads in code, or …
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow password bypasses when different normalization forms produce different …
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to create deceptive URLs, spoofed sender addresses, and misleading …
Advanced Topics
Deep technical topics for experienced developers
Sorting text correctly across languages requires the Unicode Collation Algorithm (UCA), which defines a multi-level comparison scheme that handles accents, …
ICU (International Components for Unicode) is the reference implementation library for Unicode and internationalization, providing collation, date/number formatting, transliteration, and …
Unicode 16.0 added thousands of new characters, but there are still hundreds of scripts, historic languages, and symbols waiting for …
Modern programming language designers must decide how to handle Unicode in identifiers, string literals, source file encoding, and operator symbols …
Unicode normalization must often be applied at scale in search engines, databases, and text processing pipelines, where the performance cost …