Unicode 指南

横跨 10 个系列的 150 篇深度指南——从编码基础到安全深度解析。

📚

Unicode Fundamentals

Core concepts of Unicode and character encoding

What is Unicode? A Complete Guide
high

Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete …

UTF-8 Encoding Explained
high

UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each
high

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the …

What is a Unicode Code Point?
mid

A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form …

Unicode Planes and the BMP
mid

Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic …

Understanding Byte Order Mark (BOM)
mid

The Byte Order Mark (BOM) is a special Unicode character used at the start of a text stream to signal …

Surrogate Pairs Explained
mid

Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit …

ASCII to Unicode: The Evolution of Character Encoding
mid

ASCII defined 128 characters for the English alphabet and was the foundation of text computing, but it could not handle …

Unicode Normalization: NFC, NFD, NFKC, NFKD
high

The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string …

The Unicode Bidirectional Algorithm
low

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. …

Unicode General Categories Explained
mid

Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves …

Understanding Unicode Blocks
mid

Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and …

Unicode Scripts: How Writing Systems are Organized
mid

Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, …

What are Combining Characters?
mid

Combining characters are Unicode code points that attach to a preceding base character to create accented letters, diacritics, and other …

Grapheme Clusters vs Code Points
mid

A single visible character on screen — called a grapheme cluster — can be made up of multiple Unicode code …

Unicode Confusables: A Security Guide
high

Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or …

Zero Width Characters: What They Are and Why They Matter
high

Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This …

Unicode Whitespace Characters Guide
mid

Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used …

History of Unicode
low

Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible …

Unicode Versions Timeline
low

Unicode has released major versions regularly since 1.0 in 1991, with each release adding thousands of new characters, emoji, and …

💻

Unicode in Code

Language-specific guides for working with Unicode

Unicode in Python
high

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. …

Unicode in JavaScript
high

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as …

Unicode in Java
mid

Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working …

Unicode in Go
mid

Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it …

Unicode in Rust
mid

Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode …

Unicode in C/C++
mid

C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading …

Unicode in Ruby
low

Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international …

Unicode in PHP
low

PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong …

Unicode in Swift
low

Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than …

Unicode in HTML & CSS
high

HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages …

Unicode in Regular Expressions
high

Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns …

Unicode in SQL
mid

SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and …

Unicode in URLs
mid

URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in …

Unicode Escape Sequences: Cross-Language Reference
high

Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 …

How to Handle Unicode in APIs and JSON
mid

JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still …

🔣

Symbol Reference

Comprehensive guides to specific types of Unicode symbols

Complete Arrow Symbols List
high

Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, curved arrows, and specialized mathematical arrows. This complete …

All Check Mark and Tick Symbols
high

Unicode provides multiple check mark and tick symbols ranging from the classic ✓ (U+2713) to heavy check marks, ballot boxes, …

Star and Asterisk Symbols
high

Unicode includes a rich collection of star shapes — from the simple five-pointed star ★ to outlined stars, six-pointed stars, …

Heart Symbols Complete Guide
high

Unicode contains dozens of heart symbols including the classic ♥, black and white hearts, suit hearts, decorative hearts, and a …

Currency Symbols Around the World
high

Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over 60 world currencies, from the Dollar $ and …

Mathematical Symbols and Operators
high

Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric characters covering virtually every symbol used in modern …

Bracket and Parenthesis Symbols
mid

Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly brackets, fullwidth brackets, double brackets, and dozens of …

Bullet Point Symbols
high

Unicode offers a wide variety of bullet point characters beyond the standard •, including triangular bullets, dashes, arrows, and decorative …

Line and Box Drawing Characters
mid

Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, and borders in monospaced text environments like terminals …

Musical Note Symbols
mid

Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols block, as well as a dedicated Musical Symbols …

Fraction Symbols Guide
mid

Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ ⅓ and many others, allowing fractions to be …

Superscript and Subscript Characters
mid

Unicode provides precomposed superscript and subscript digits and letters — such as ¹ ² ³ and ₁ ₂ ₃ — …

Circle Symbols
mid

Unicode contains dozens of circle symbols including filled circles, outlined circles, circles with dots, circled letters, and the various combining …

Square and Rectangle Symbols
mid

Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, and rectangle variants spread across the Geometric Shapes …

Triangle Symbols
mid

Unicode provides a comprehensive set of triangle symbols in all orientations — up, down, left, right — in both filled …

Diamond Symbols
low

Unicode includes filled and outline diamond shapes, lozenge characters, and playing card suit diamonds across several blocks including Geometric Shapes …

Cross and X Mark Symbols
mid

Unicode provides various cross and X mark characters including the heavy ballot X ✗, multiplication signs, plus signs, and decorative …

Dash and Hyphen Symbols Guide
high

The hyphen-minus on your keyboard is just one of Unicode's many dash characters, which also includes the en dash, em …

Quotation Mark Symbols Complete Guide
mid

Unicode defines typographic quotation marks — curly quotes — for dozens of languages, including single and double guillemets, low-9 quotation …

Copyright, Trademark & Legal Symbols
mid

Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, trademark ™, service mark ℠, and other legal …

Degree and Temperature Symbols
high

The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ characters are among the most commonly searched Unicode …

Circled and Enclosed Number Symbols
mid

Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and other enclosed digit forms used in lists, annotations, …

Roman Numeral Symbols
mid

Unicode includes a Number Forms block with precomposed Roman numeral characters such as Ⅰ Ⅱ Ⅲ Ⅳ, distinct from the …

Greek Alphabet Symbols for Math and Science
high

Greek letters like α β γ δ π Σ Ω are widely used as mathematical and scientific symbols and are …

Decorative Dingbats
low

The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the Zapf Dingbats typeface, including scissors, pencils, fleurons, and …

Playing Card Symbols
low

Unicode includes a Playing Cards block with characters for all 52 standard playing cards plus jokers, card backs, and the …

Chess Piece Symbols
low

Unicode provides characters for all six chess piece types in both white and black variants — ♔♕♖♗♘♙ and ♚♛♜♝♞♟ — …

Zodiac and Astrological Symbols
mid

Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, and other astrological characters used in astrology and …

Braille Pattern Characters
low

Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the 8-dot braille cell, enabling braille text to be …

Geometric Shapes Complete Guide
mid

Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, and other polygons in filled, outlined, and various …

Letterlike Symbols
low

The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from letters, such as ℃ ℉ ℕ ℤ ℝ …

Technical Symbols Guide
low

Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including keyboard keys, optical character recognition characters, and control …

Combining Characters and Diacritics Guide
mid

Diacritics are accent marks and other marks that attach to letters to indicate pronunciation or meaning, represented in Unicode as …

Whitespace and Invisible Characters Guide
high

Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width spaces, word joiners, soft hyphens, and various format …

Warning and Hazard Signs
mid

Unicode includes warning and hazard symbols such as the universal caution ⚠ sign, biohazard ☣, radiation ☢, and skull and …

Weather Symbols Guide
mid

Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow ❄, lightning ⚡, and other weather symbols used …

Religious Symbols in Unicode
low

Unicode includes symbols for many of the world's major religions including the cross ✝, crescent and star ☪, Star of …

Gender and Identity Symbols
low

Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, as well as newer gender-related characters and the …

Keyboard Shortcut Symbols Guide
mid

Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ Command, ⌥ Option, ⇧ Shift, and ⌃ Control, …

Symbols for Social Media Bios
high

Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds of others are widely used in social media …

🧱

Block Explorer

Deep dives into notable Unicode blocks

Basic Latin (ASCII) Block
mid

The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers the 128 original ASCII characters, including the English …

Latin-1 Supplement Block
low

The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for Western European languages and matches the original ISO …

General Punctuation Block
mid

The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and various punctuation characters used in professional typography across …

Mathematical Operators Block
mid

The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, calculus, and other areas of mathematics used in …

Arrows Block
mid

The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, double-headed arrows, curved arrows, and arrows with various …

Dingbats Block
low

The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface and contains 192 decorative ornaments, check marks, arrows, …

Miscellaneous Symbols Block
low

The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing weather symbols, card suits, zodiac signs, religious symbols, …

CJK Unified Ideographs Overview
mid

The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode blocks, containing over 20,000 Chinese, Japanese, and Korean …

Hangul Block
mid

The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically derived from 19 initial consonants, 21 vowels, and …

Emoji Blocks Overview
mid

Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including Emoticons, Miscellaneous Symbols and Pictographs, and Transport and …

Currency Symbols Block
low

The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that lack an ASCII representation, such as the Euro …

Box Drawing & Block Elements Blocks
low

The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters for drawing text-based user interfaces and terminal art …

Enclosed Alphanumerics Block
low

The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, and various other enclosed alphanumeric forms used in …

Geometric Shapes Blocks
low

The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, and other polygons in filled, outlined, and partial …

Musical Symbols Block
low

The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing 233 characters for full western music notation, including …

📜

Script Stories

Writing system deep dives with history and culture

Arabic Script Deep Dive
mid

Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending …

Devanagari Script Deep Dive
mid

Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and many other South Asian languages, with complex conjunct …

Greek and Coptic
mid

Greek is one of the oldest alphabetic writing systems and gave Unicode many of its mathematical symbols, with the Greek …

Cyrillic Script
mid

Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most …

Hebrew Script
low

Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern Hebrew, and Yiddish, with optional vowel diacritics called …

Thai Script
low

Thai is an abugida script with no spaces between words, complex vowel placement above and below consonants, and tone marks …

Japanese Writing Systems
mid

Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and Kanji (CJK ideographs) — alongside Latin text, making …

Korean Hangul System
mid

Hangul was invented in 1443 by King Sejong as a scientific alphabet where syllable blocks are algorithmically composed from individual …

Bengali Script
low

Bengali is an abugida script with over 300 million speakers, used for Bengali and Assamese, featuring complex conjunct consonant forms …

Tamil Script
low

Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script …

Armenian Script
low

The Armenian alphabet was created in 405 AD by the monk Mesrop Mashtots and has remained largely unchanged for over …

Georgian Script
low

Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — all encoded in Unicode, with modern Georgian using …

Ethiopic Script
low

The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, Oromo, and many other languages of the Horn …

Dead Scripts in Unicode
low

Unicode encodes dozens of historic and extinct scripts — from Cuneiform and Egyptian Hieroglyphs to Linear B and Gothic — …

Writing Systems of the World
mid

There are hundreds of writing systems in use around the world today, from alphabets and syllabaries to abjads and abugidas, …

🔧

Practical Unicode

How-to guides for real-world Unicode tasks

How to Type Special Characters on Windows
high

Windows provides several methods for typing special characters and Unicode symbols, including Alt codes, the Character Map app, keyboard shortcuts, …

How to Type Special Characters on Mac
high

macOS makes it easy to type special characters and Unicode symbols through the Character Viewer, Option key shortcuts, the emoji …

How to Type Special Characters on Linux
mid

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by a hex code point, compose key sequences, IBus …

Special Characters on Mobile (iOS/Android)
mid

Typing special Unicode characters on smartphones requires different techniques than on desktop — long-pressing keys, using third-party keyboards, or copy-pasting …

How to Fix Mojibake (Garbled Text)
mid

Mojibake is the garbled text you see when a file encoded in one character set is interpreted as another, producing …

Unicode in Databases
mid

Storing Unicode text in a database requires choosing the right charset, collation, and column type — a wrong choice can …

Unicode in Filenames
mid

Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems …

Unicode in Email
low

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, and addresses requires encoding schemes like Quoted-Printable, Base64, …

Unicode in Domain Names (IDN)
mid

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from any Unicode script, converted to ASCII-compatible encoding using …

Unicode for Accessibility
mid

Using Unicode symbols, special characters, and emoji in web content has important accessibility implications for screen readers, which may announce …

Unicode Text Direction: LTR vs RTL
mid

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and explicit directional control characters, enabling correct display of …

Unicode Fonts: How Characters Get Rendered
mid

A font file only contains glyphs for a subset of Unicode characters, which means characters outside that subset fall back …

How to Find Any Unicode Character
high

Finding the exact Unicode character you need can be challenging given over 140,000 characters spread across 150+ scripts and dozens …

Unicode Copy and Paste Best Practices
mid

Copying and pasting text between applications can introduce invisible characters, change normalization forms, strip or mangle Unicode characters, and cause …

How to Create Fancy Text with Unicode
high

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, Fraktur, monospace, and double-struck variants of Latin letters …

🖥️

Platform Guides

Unicode usage in specific platforms and applications

Unicode in Microsoft Word
mid

Microsoft Word supports the full Unicode character set and provides several methods for inserting special characters, including Alt+X code point …

Unicode in Google Docs & Sheets
mid

Google Docs and Sheets use UTF-8 internally and provide a Special Characters panel for inserting Unicode symbols by drawing, searching …

Unicode in Terminal / Command Line
mid

Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters requires a compatible terminal emulator, the right locale …

Unicode in PDF Documents
low

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many PDFs created from scans or older tools produce …

Unicode in Excel
mid

Microsoft Excel stores text in Unicode but has historically struggled with non-Latin characters in CSV imports, RTL text layout, and …

Unicode in Social Media
mid

Social media platforms handle Unicode text with varying degrees of support, affecting how emoji, RTL text, special characters, and invisible …

Unicode in XML and JSON
mid

Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping …

Unicode in Data Science and NLP
low

Natural language processing and data science pipelines frequently encounter Unicode issues including encoding errors, normalization mismatches, invisible characters, and language …

Unicode in QR Codes
low

QR codes can encode Unicode text using UTF-8, but many QR code generators and scanners default to ISO 8859-1, causing …

Unicode in Passwords: Security Implications
low

Allowing Unicode characters in passwords increases the keyspace and can improve security, but it also introduces normalization ambiguity, where the …

📖

Unicode History & Culture

Historical and cultural stories about Unicode

The Birth of ASCII (1963)
low

ASCII was created in 1963 by the American Standards Association to standardize communication between different computers and telegraph systems using …

EBCDIC: IBM's Alternative
low

EBCDIC (Extended Binary Coded Decimal Interchange Code) was IBM's character encoding used on its mainframe computers, incompatible with ASCII and …

The Unicode Consortium: Who Decides?
low

The Unicode Consortium is the non-profit organization responsible for developing and maintaining the Unicode standard, with members including Apple, Google, …

How New Characters Get Added to Unicode
mid

Adding a new character to Unicode requires submitting a detailed proposal to the Unicode Consortium that demonstrates the character's need, …

The Emoji Proposal Process
low

Getting a new emoji into Unicode requires a formal proposal to the Emoji Subcommittee that must meet strict selection criteria …

CJK Unification: Controversy and Compromise
low

CJK unification was Unicode's decision to assign the same code points to Chinese, Japanese, and Korean characters with the same …

The Mojibake Problem: A History
low

Mojibake — Japanese for 'character transformation' — is the garbled text that results from encoding mismatches, a problem that plagued …

Unicode Milestones
low

From the first Unicode draft in 1988 to the addition of emoji, the surpassing of 100,000 characters, and UTF-8 becoming …

How Unicode Changed the Internet
low

Before Unicode became universal, the web was fragmented by incompatible national encodings that made international websites unreliable and cross-language communication …

Fun Unicode Facts and Easter Eggs
low

Unicode is full of surprising, obscure, and occasionally humorous characters — from the interrobang ‽ to the snowman ☃ with …

🔒

Unicode Security

Security implications of Unicode

🎓

Advanced Topics

Deep technical topics for experienced developers