Ersatzzeichen
U+FFFD (�). Wird angezeigt, wenn ein Decoder ungültige Bytesequenzen antrifft — das universelle Symbol für „beim Dekodieren ist etwas schiefgelaufen".
What Is the Replacement Character?
The Replacement Character is U+FFFD (� — displayed as \ufffd or the familiar diamond with a question mark <> symbol). It is the designated substitute character that Unicode and encoding systems insert when they encounter a byte sequence that cannot be decoded, is invalid for the declared encoding, or represents a code point that has no valid mapping.
The character was chosen because its code point 0xFFFD is in the Specials block and has no other assigned meaning — it exists solely as an error sentinel.
When It Appears
- Invalid UTF-8 sequences: A byte in a range reserved for multi-byte sequences but not followed by the correct continuation bytes.
- Truncated sequences: A multi-byte UTF-8 sequence cut off at the end of a buffer.
- Surrogates in UTF-8: Lone or paired surrogates (U+D800–U+DFFF) encoded in UTF-8, which is technically invalid.
- Out-of-range code points: Code points above U+10FFFF are not Unicode.
- Unmappable characters: When converting between encodings and a character has no equivalent.
# UTF-8 decoding with error=replace
b"\xff\xfe".decode("utf-8", errors="replace") # "\ufffd\ufffd"
b"\xe4\xb8".decode("utf-8", errors="replace") # "\ufffd" (truncated CJK)
b"\xed\xa0\x80".decode("utf-8", errors="replace") # "\ufffd\ufffd\ufffd" (surrogate in UTF-8)
# str: U+FFFD as a Python character
REPLACEMENT = "\uFFFD"
REPLACEMENT == "�" # Same character
ord(REPLACEMENT) # 65533
# Check for replacement characters in decoded text
def has_decoding_errors(text: str) -> bool:
return "\uFFFD" in text
JavaScript
const decoder = new TextDecoder("utf-8"); // fatal=false by default → uses U+FFFD
const badBytes = new Uint8Array([0xFF, 0xFE]);
decoder.decode(badBytes); // "\ufffd\ufffd"
// Fatal mode: throws instead of replacing
const strictDecoder = new TextDecoder("utf-8", { fatal: true });
try {
strictDecoder.decode(badBytes); // TypeError: The encoded data was not valid
} catch (e) {
console.log("Invalid bytes");
}
// Check for replacement character
"\ufffd".codePointAt(0); // 65533
"\ufffd" === "\u{FFFD}"; // true
HTML Rendering
In HTML, � and � render as the replacement character glyph. Some fonts render it as a black diamond <>, others as a question mark in a box, or just ?.
<p>Invalid sequence: �</p>
<!-- Browsers display the replacement character glyph -->
Database and File Handling
Replacement characters in stored data indicate an encoding problem that already occurred. They cannot be "fixed" because the original bytes are gone — the information was lost at decode time:
# Once decoded with errors="replace", the original byte is unrecoverable
bad = b"\x80"
replaced = bad.decode("utf-8", errors="replace") # "\ufffd"
# You cannot go back to b"\x80" from "\ufffd" alone
# Solution: store original bytes if you need to recover them
import base64
preserved = base64.b64encode(bad).decode() # "gA==" — recoverable
Normalization and Filtering
In data pipelines, you should decide deliberately whether to keep or remove replacement characters:
# Filter replacement characters (data was corrupted — remove noise)
def clean_text(text: str) -> str:
return text.replace("\uFFFD", "")
# Count corruption severity
def corruption_ratio(text: str) -> float:
if not text:
return 0.0
return text.count("\uFFFD") / len(text)
Quick Facts
| Property | Value |
|---|---|
| Code point | U+FFFD |
| Name | REPLACEMENT CHARACTER |
| Block | Specials (U+FFF0–U+FFFF) |
| HTML entity | � or � |
| Python literal | "\uFFFD" |
| Decimal code point | 65533 |
| Appears when | Invalid/undecodable byte sequences encountered |
| Information recovery | Impossible — original byte is lost |
| Prevention | Strict decoding at input boundaries; validate encoding |
Verwandte Begriffe
Mehr in Programmierung & Entwicklung
Zwei 16-Bit-Codeeinheiten (ein High-Surrogate U+D800–U+DBFF + Low-Surrogate U+DC00–U+DFFF), die zusammen ein ergänzendes …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Kodierung wandelt Zeichen in Bytes um (str.encode('utf-8')); Dekodierung wandelt Bytes in Zeichen …
Unlesbarer Text, der entsteht, wenn Bytes mit der falschen Kodierung dekodiert werden. …
U+0000 (NUL). Das erste Unicode/ASCII-Zeichen, als Zeichenketten-Terminator in C/C++ verwendet. Sicherheitsrisiko: Null-Byte-Injektion …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Regex-Muster, die Unicode-Eigenschaften nutzen: \p{L} (beliebiger Buchstabe), \p{Script=Greek} (griechisches Skript), \p{Emoji}. Die …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
Die „Länge" einer Unicode-Zeichenkette hängt von der Einheit ab: Codeeinheiten (JavaScript .length), …
Syntax zur Darstellung von Unicode-Zeichen im Quellcode. Variiert je nach Sprache: \u2713 …