What is Kod noktası?

Unicode kod alanındaki sayısal değer (U+0000 ile U+10FFFF arası), U+XXXX şeklinde yazılır. Tüm kod noktaları karakterlere atanmış değildir.

Bir karakterin yazı tipi tarafından görsel olarak temsil edilme şekli. Bir karakter birden fazla glyph'e sahip olabilir (bitişik harfler, bağlamsal formlar); bir glyph birden fazla karakteri temsil edebilir.

Unicode Standardı

Soyut karakter

Metinsel verileri düzenlemek, kontrol etmek veya temsil etmek için kullanılan bilgi birimi — kod noktası almadan önceki kavramsal varlık.

2021-11-15 · Updated 2024-06-24

What is an Abstract Character?

An abstract character is a unit of information used for the organization, control, or representation of textual data — independent of any particular representation in digital form. The Unicode Standard uses this term to describe what characters fundamentally are: abstract entities defined by their meaning and usage, not by their visual appearance or their encoding.

The abstract character "LATIN SMALL LETTER A" is a single concept shared by its italicized form, its bold form, its Times New Roman rendering, and its Arial rendering. It does not matter which font, size, or style is used — all these are representations of the same abstract character.

Abstract Character vs Glyph vs Code Point

Unicode carefully distinguishes three related but distinct concepts:

Concept	Definition	Example for "a"
Abstract character	The meaning/identity of the character	The letter "a" as a concept
Code point	The integer assigned to represent this character	U+0061
Glyph	The visual shape as rendered by a font	The pixel pattern for "a" in Arial

One abstract character corresponds to exactly one code point (in most cases). But one abstract character may have many glyphs across different fonts, sizes, and rendering contexts.

The reverse is also interesting: sometimes multiple abstract characters share visually identical glyphs (homoglyphs). U+006C (l), U+0049 (I), and U+007C (|) can look identical in some fonts — but they are three distinct abstract characters.

Abstract Characters and Grapheme Clusters

Complicating the picture slightly: what a user perceives as a single "character" (a grapheme cluster) may correspond to multiple Unicode code points, each representing a separate abstract character.

For example: - The Swedish letter "ä" can be encoded as a single code point U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) or as two code points: U+0061 (a) + U+0308 (COMBINING DIAERESIS). - In both cases, there is one user-perceived character (grapheme cluster). - In the two-code-point form, there are two abstract characters: the letter "a" and the combining diaeresis.

This is the basis for Unicode normalization: NFD (Canonical Decomposition) breaks composed characters into sequences of abstract characters; NFC (Canonical Composition) recombines them.

Abstract Characters and Encoding

The Unicode Standard defines abstract characters first, then assigns code points to represent them. This ordering is philosophically important: code points are labels for abstract characters, not the characters themselves. The same abstract character could theoretically be assigned a different code point in a hypothetical redesign (though Unicode's stability policies prevent any such change).

import unicodedata

# Two representations of the same abstract character "ä" (ä with diaeresis)
composed = "\u00E4"    # Single code point: ä
decomposed = "\u0061\u0308"  # Two code points: a + combining diaeresis

print(composed == decomposed)   # False — different byte sequences
print(unicodedata.normalize("NFC", decomposed) == composed)  # True — same character
print(unicodedata.normalize("NFD", composed) == decomposed)  # True — decomposed form

Named vs Unnamed Abstract Characters

Most abstract characters in Unicode have official names (e.g., LATIN SMALL LETTER A). However, some categories of code points represent abstract characters without individual names:

CJK Unified Ideographs: Named algorithmically (CJK UNIFIED IDEOGRAPH-4E00)
Hangul Syllables: Named algorithmically (HANGUL SYLLABLE GA)
Control characters: Have names describing their function (NULL, LINE FEED, etc.)

Historical Significance

The concept of abstract character was central to justifying Unicode's approach to CJK unification. Japanese kanji, Chinese hànzì, and Korean hanja often share the same abstract character (the same ideograph with the same meaning and origin), even when they are rendered differently in national typefaces. Unicode assigns a single code point to each abstract CJK character, regardless of its rendering variants — a decision called Han unification.

Quick Facts

Property	Value
Defined by	Unicode Standard Chapter 3
Relationship to code point	1:1 (usually)
Relationship to glyph	1:many (one character, many fonts/renderings)
Relationship to grapheme cluster	Many abstract characters can form 1 cluster
Normalization connection	NFD decomposes to sequences of abstract characters
Han unification	Multiple CJK writing systems share abstract characters

İlgili Terimler

Kod noktası Glif

Unicode Standardı içinde daha fazlası

Atanmamış kod noktası

Henüz hiçbir Unicode sürümünde bir karaktere atanmamış kod noktası, Cn (Atanmamış) olarak …

Atanmış karakter

Bir Unicode sürümünde karakter ataması yapılmış kod noktası. Unicode 16.0 itibariyle, 1.114.112 …

Ayrılmış kod noktası

Gelecekteki standardizasyon için ayrılmış kod noktası; kalıcı olarak ayrılan noncharacter'lardan ve kullanıcı …

Basic Multilingual Plane (BMP)

Düzlem 0 (U+0000–U+FFFF), Latin, Yunan, Kiril, CJK, Arap ve çoğu sembol dahil …

CJK

Çince, Japonca ve Korece — Unicode'da birleştirilmiş Han ideograf bloğu ve ilgili …

Düzlem

65.536 kod noktasından oluşan bitişik blok. Unicode'da 17 düzlem vardır (0–16): Düzlem …

Ek düzlem

Düzlem 1–16 (U+10000–U+10FFFF), emoji, tarihi yazılar, CJK uzantıları ve müzik notasyonu içerir. …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing …

ISO 10646 / Universal Character Set

Unicode ile senkronize edilmiş, aynı karakter repertuvarını ve kod noktalarını tanımlayan ancak …

← Sözlüğe Geri Dön