抽象文字
Embed This Widget
Add the script tag and a data attribute to embed this widget.
Embed via iframe for maximum compatibility.
<iframe src="https://unicodefyi.com/iframe/glossary/abstract-character/" width="420" height="400" frameborder="0" style="border:0;border-radius:10px;max-width:100%" loading="lazy"></iframe>
Paste this URL in WordPress, Medium, or any oEmbed-compatible platform.
https://unicodefyi.com/glossary/abstract-character/
Add a dynamic SVG badge to your README or docs.
[](https://unicodefyi.com/glossary/abstract-character/)
Use the native HTML custom element.
テキストデータの整理・制御・表現に使われる情報の単位で、コードポイントを受け取る前の概念的な実体です。
What is an Abstract Character?
An abstract character is a unit of information used for the organization, control, or representation of textual data — independent of any particular representation in digital form. The Unicode Standard uses this term to describe what characters fundamentally are: abstract entities defined by their meaning and usage, not by their visual appearance or their encoding.
The abstract character "LATIN SMALL LETTER A" is a single concept shared by its italicized form, its bold form, its Times New Roman rendering, and its Arial rendering. It does not matter which font, size, or style is used — all these are representations of the same abstract character.
Abstract Character vs Glyph vs Code Point
Unicode carefully distinguishes three related but distinct concepts:
| Concept | Definition | Example for "a" |
|---|---|---|
| Abstract character | The meaning/identity of the character | The letter "a" as a concept |
| Code point | The integer assigned to represent this character | U+0061 |
| Glyph | The visual shape as rendered by a font | The pixel pattern for "a" in Arial |
One abstract character corresponds to exactly one code point (in most cases). But one abstract character may have many glyphs across different fonts, sizes, and rendering contexts.
The reverse is also interesting: sometimes multiple abstract characters share visually identical glyphs (homoglyphs). U+006C (l), U+0049 (I), and U+007C (|) can look identical in some fonts — but they are three distinct abstract characters.
Abstract Characters and Grapheme Clusters
Complicating the picture slightly: what a user perceives as a single "character" (a grapheme cluster) may correspond to multiple Unicode code points, each representing a separate abstract character.
For example: - The Swedish letter "ä" can be encoded as a single code point U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) or as two code points: U+0061 (a) + U+0308 (COMBINING DIAERESIS). - In both cases, there is one user-perceived character (grapheme cluster). - In the two-code-point form, there are two abstract characters: the letter "a" and the combining diaeresis.
This is the basis for Unicode normalization: NFD (Canonical Decomposition) breaks composed characters into sequences of abstract characters; NFC (Canonical Composition) recombines them.
Abstract Characters and Encoding
The Unicode Standard defines abstract characters first, then assigns code points to represent them. This ordering is philosophically important: code points are labels for abstract characters, not the characters themselves. The same abstract character could theoretically be assigned a different code point in a hypothetical redesign (though Unicode's stability policies prevent any such change).
import unicodedata
# Two representations of the same abstract character "ä" (ä with diaeresis)
composed = "\u00E4" # Single code point: ä
decomposed = "\u0061\u0308" # Two code points: a + combining diaeresis
print(composed == decomposed) # False — different byte sequences
print(unicodedata.normalize("NFC", decomposed) == composed) # True — same character
print(unicodedata.normalize("NFD", composed) == decomposed) # True — decomposed form
Named vs Unnamed Abstract Characters
Most abstract characters in Unicode have official names (e.g., LATIN SMALL LETTER A). However, some categories of code points represent abstract characters without individual names:
- CJK Unified Ideographs: Named algorithmically (CJK UNIFIED IDEOGRAPH-4E00)
- Hangul Syllables: Named algorithmically (HANGUL SYLLABLE GA)
- Control characters: Have names describing their function (NULL, LINE FEED, etc.)
Historical Significance
The concept of abstract character was central to justifying Unicode's approach to CJK unification. Japanese kanji, Chinese hànzì, and Korean hanja often share the same abstract character (the same ideograph with the same meaning and origin), even when they are rendered differently in national typefaces. Unicode assigns a single code point to each abstract CJK character, regardless of its rendering variants — a decision called Han unification.
Quick Facts
| Property | Value |
|---|---|
| Defined by | Unicode Standard Chapter 3 |
| Relationship to code point | 1:1 (usually) |
| Relationship to glyph | 1:many (one character, many fonts/renderings) |
| Relationship to grapheme cluster | Many abstract characters can form 1 cluster |
| Normalization connection | NFD decomposes to sequences of abstract characters |
| Han unification | Multiple CJK writing systems share abstract characters |
関連用語
Unicode 標準 のその他の用語
中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
Unicodeと同期している国際標準(ISO/IEC 10646)で、同じ文字目録とコードポイントを定義しますが、Unicodeの追加アルゴリズムやプロパティは含みません。
あらゆる文字システムのすべての文字に固有の番号(コードポイント)を割り当てる普遍的文字エンコーディング規格。バージョン16.0には154,998個の割り当て済み文字が含まれます。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
Unicode標準を開発・維持する非営利団体。Apple・Google・Microsoft・Metaなど多くの企業が会員です。
サロゲートコードポイント(U+D800〜U+DFFF)を除くすべてのコードポイント。実際の文字を表すことができる有効な値の集合で、合計1,112,064個です。
新しい文字・文字体系・機能を追加するUnicode標準の主要リリース。現在のバージョンはUnicode 16.0(2025年9月)です。