非文字
内部使用のために永久予約されたコードポイント(計66個):各面のU+FDD0〜U+FDEFおよびU+nFFFE/U+nFFFF。テキスト内では有効ですが、外部交換に使用すべきではありません。
What is a Noncharacter?
A noncharacter is one of 66 specific Unicode code points that are permanently reserved and will never be assigned to any character. Unlike unassigned code points (which are available for future use), noncharacters are deliberately and irrevocably excluded from representing text data.
The Unicode Standard explicitly states: "Noncharacters are code points that are permanently reserved for internal use. They are not characters and must not be used for data interchange."
The 66 Noncharacters
The 66 noncharacters fall into two groups:
Group 1 — Arabic Presentation Forms-A block (34 values):
U+FDD0–U+FDEF (32 code points)
Group 2 — Last two positions of each plane (32 values):
U+FFFE, U+FFFF (BMP, Plane 0)
U+1FFFE, U+1FFFF (Plane 1)
U+2FFFE, U+2FFFF (Plane 2)
...
U+10FFFE, U+10FFFF (Plane 16)
Each of the 17 planes contributes 2 noncharacters at its very end: 0xFFFE and 0xFFFF in the
low 16 bits. This gives 17 × 2 = 34 noncharacters. Plus the 32 in U+FDD0–U+FDEF = 66 total.
Why Noncharacters Exist
The most important noncharacters are U+FFFE and U+FFFF:
U+FFFE is the mirror image of U+FEFF (the Byte Order Mark). If a UTF-16 stream opens with
bytes FF FE, the BOM is at U+FEFF (little-endian byte order). If it opens with FE FF, the
BOM would decode as U+FFFE — which signals that the byte order is wrong. Readers can detect and
correct the byte order. This clever design means U+FFFE was never available for character use.
U+FFFF was historically used as a "no character" sentinel in character arrays and file
reading loops, similar to how -1 (EOF) is used in C's fgetc(). Reserving it prevents
confusion between the sentinel and actual text.
The FDD0–FDEF range and the per-plane pairs are reserved to give implementations maximum flexibility for internal sentinels and markers on all planes.
Conformance Requirements
Unicode conformance has specific rules about noncharacters:
- Conformant applications may use noncharacters internally
- Conformant applications should not generate noncharacters in open data interchange
- Conformant applications may accept and pass through noncharacters without error (they are not required to reject them)
This is subtler than it sounds: noncharacters are not invalid UTF-8 or UTF-16 sequences — they are valid encoded values of code points that are simply not intended for data interchange.
Noncharacters vs Surrogates
| Property | Noncharacters | Surrogates |
|---|---|---|
| Code range | U+FDD0–U+FDEF, U+xFFFE/U+xFFFF | U+D800–U+DFFF |
| Count | 66 | 2,048 |
| Valid in UTF-8? | Yes (encodable) | No (ill-formed) |
| Valid in UTF-32? | Yes | Technically no (but tolerated) |
| Can be exchanged? | Not recommended | No |
| Purpose | Internal sentinels | UTF-16 supplementary encoding |
Detecting Noncharacters
def is_noncharacter(cp: int) -> bool:
# FDD0-FDEF range
if 0xFDD0 <= cp <= 0xFDEF:
return True
# xFFFE and xFFFF for each plane
last_two = cp & 0xFFFF
return last_two in (0xFFFE, 0xFFFF) and cp <= 0x10FFFF
# Examples
print(is_noncharacter(0xFDD0)) # True
print(is_noncharacter(0xFFFF)) # True
print(is_noncharacter(0x1FFFF)) # True
print(is_noncharacter(0x10FFFF)) # True
print(is_noncharacter(0x0041)) # False (LATIN CAPITAL LETTER A)
Quick Facts
| Property | Value |
|---|---|
| Total count | 66 |
| First range | U+FDD0–U+FDEF (34 values) |
| Per-plane pairs | xFFFE and xFFFF for each of 17 planes (34 values) |
| General category | Cn (Unassigned) |
| Permanently reserved? | Yes — will never be assigned characters |
| Valid UTF-8? | Yes (they can be encoded) |
| Intended for interchange? | No — internal use only |
| Most notable | U+FFFE (byte order detection), U+FFFF (sentinel) |
関連用語
Unicode 標準 のその他の用語
中国語・日本語・韓国語 — Unicodeにおける統合漢字ブロックと関連スクリプトをまとめた総称。CJK統合漢字は20,992文字以上を含みます。
The process of mapping Chinese, Japanese, and Korean ideographs that share a …
The individual consonant and vowel components (jamo) of the Korean Hangul writing …
Unicodeと同期している国際標準(ISO/IEC 10646)で、同じ文字目録とコードポイントを定義しますが、Unicodeの追加アルゴリズムやプロパティは含みません。
あらゆる文字システムのすべての文字に固有の番号(コードポイント)を割り当てる普遍的文字エンコーディング規格。バージョン16.0には154,998個の割り当て済み文字が含まれます。
Normative or informative documents that are integral parts of the Unicode Standard. …
Informational documents published by the Unicode Consortium covering specific topics like security …
Unicode標準を開発・維持する非営利団体。Apple・Google・Microsoft・Metaなど多くの企業が会員です。
サロゲートコードポイント(U+D800〜U+DFFF)を除くすべてのコードポイント。実際の文字を表すことができる有効な値の集合で、合計1,112,064個です。
新しい文字・文字体系・機能を追加するUnicode標準の主要リリース。現在のバージョンはUnicode 16.0(2025年9月)です。