What is Privatnutzungsbereich?

Reservierte Bereiche, in denen Organisationen eigene Zeichen zuweisen können: BMP-PUA (U+E000–U+F8FF) sowie ergänzende PUAs in den Ebenen 15 und 16.

Unicode-Standard

Nicht-Zeichen

Dauerhaft für den internen Gebrauch reservierte Codepunkte (insgesamt 66): U+FDD0–U+FDEF und U+nFFFE/U+nFFFF für jede Ebene. Im Text gültig, sollten aber nicht extern ausgetauscht werden.

2021-08-18 · Updated 2024-04-15

What is a Noncharacter?

A noncharacter is one of 66 specific Unicode code points that are permanently reserved and will never be assigned to any character. Unlike unassigned code points (which are available for future use), noncharacters are deliberately and irrevocably excluded from representing text data.

The Unicode Standard explicitly states: "Noncharacters are code points that are permanently reserved for internal use. They are not characters and must not be used for data interchange."

The 66 Noncharacters

The 66 noncharacters fall into two groups:

Group 1 — Arabic Presentation Forms-A block (34 values):

U+FDD0–U+FDEF  (32 code points)

Group 2 — Last two positions of each plane (32 values):

U+FFFE,  U+FFFF   (BMP, Plane 0)
U+1FFFE, U+1FFFF  (Plane 1)
U+2FFFE, U+2FFFF  (Plane 2)
...
U+10FFFE, U+10FFFF (Plane 16)

Each of the 17 planes contributes 2 noncharacters at its very end: 0xFFFE and 0xFFFF in the low 16 bits. This gives 17 × 2 = 34 noncharacters. Plus the 32 in U+FDD0–U+FDEF = 66 total.

Why Noncharacters Exist

The most important noncharacters are U+FFFE and U+FFFF:

U+FFFE is the mirror image of U+FEFF (the Byte Order Mark). If a UTF-16 stream opens with bytes FF FE, the BOM is at U+FEFF (little-endian byte order). If it opens with FE FF, the BOM would decode as U+FFFE — which signals that the byte order is wrong. Readers can detect and correct the byte order. This clever design means U+FFFE was never available for character use.

U+FFFF was historically used as a "no character" sentinel in character arrays and file reading loops, similar to how -1 (EOF) is used in C's fgetc(). Reserving it prevents confusion between the sentinel and actual text.

The FDD0–FDEF range and the per-plane pairs are reserved to give implementations maximum flexibility for internal sentinels and markers on all planes.

Conformance Requirements

Unicode conformance has specific rules about noncharacters:

Conformant applications may use noncharacters internally
Conformant applications should not generate noncharacters in open data interchange
Conformant applications may accept and pass through noncharacters without error (they are not required to reject them)

This is subtler than it sounds: noncharacters are not invalid UTF-8 or UTF-16 sequences — they are valid encoded values of code points that are simply not intended for data interchange.

Noncharacters vs Surrogates

Property	Noncharacters	Surrogates
Code range	U+FDD0–U+FDEF, U+xFFFE/U+xFFFF	U+D800–U+DFFF
Count	66	2,048
Valid in UTF-8?	Yes (encodable)	No (ill-formed)
Valid in UTF-32?	Yes	Technically no (but tolerated)
Can be exchanged?	Not recommended	No
Purpose	Internal sentinels	UTF-16 supplementary encoding

Detecting Noncharacters

def is_noncharacter(cp: int) -> bool:
    # FDD0-FDEF range
    if 0xFDD0 <= cp <= 0xFDEF:
        return True
    # xFFFE and xFFFF for each plane
    last_two = cp & 0xFFFF
    return last_two in (0xFFFE, 0xFFFF) and cp <= 0x10FFFF

# Examples
print(is_noncharacter(0xFDD0))    # True
print(is_noncharacter(0xFFFF))    # True
print(is_noncharacter(0x1FFFF))   # True
print(is_noncharacter(0x10FFFF))  # True
print(is_noncharacter(0x0041))    # False (LATIN CAPITAL LETTER A)

Quick Facts

Property	Value
Total count	66
First range	U+FDD0–U+FDEF (34 values)
Per-plane pairs	xFFFE and xFFFF for each of 17 planes (34 values)
General category	Cn (Unassigned)
Permanently reserved?	Yes — will never be assigned characters
Valid UTF-8?	Yes (they can be encoded)
Intended for interchange?	No — internal use only
Most notable	U+FFFE (byte order detection), U+FFFF (sentinel)

Mehr in Unicode-Standard

Abstraktes Zeichen

Eine Informationseinheit zur Organisation, Steuerung oder Darstellung von Textdaten — die konzeptionelle …

Basic Multilingual Plane (BMP)

Ebene 0 (U+0000–U+FFFF) mit den am häufigsten verwendeten Zeichen, darunter Lateinisch, Griechisch, …

CJK

Chinesisch, Japanisch und Koreanisch — der Sammelbegriff für den vereinheitlichten Han-Ideogramm-Block und …

Codeeinheit

Die kleinste Kodierungseinheit: ein 8-Bit-Byte in UTF-8, ein 16-Bit-Wort in UTF-16, ein …

Codepunkt

Ein numerischer Wert im Unicode-Coderaum (U+0000 bis U+10FFFF), geschrieben als U+XXXX. Nicht …

Coderaum

Der vollständige Bereich möglicher Unicode-Codepunkte: U+0000 bis U+10FFFF (insgesamt 1.114.112), aufgeteilt in …

Ebene

Ein zusammenhängender Block von 65.536 Codepunkten. Unicode hat 17 Ebenen (0–16): Ebene …

Ergänzungsebene

Ebenen 1–16 (U+10000–U+10FFFF) mit Emoji, historischen Schriften, CJK-Erweiterungen und Musiknotation. Erfordert Ersatzzeichenpaare …

Ersatzzeichen

Codepunkte U+D800–U+DFFF, ausschließlich für UTF-16-Ersatzzeichenpaare reserviert. Keine gültigen Unicode-Skalarwerte und dürfen nie …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a …

← Zurück zum Glossar