What is 文字化け?

誤ったエンコーディングでバイトをデコードした際に生じる文字化けテキスト。日本語の用語（文字化け）。例：'café'をUTF-8で保存してLatin-1で読むと'cafÃ©'になります。

What is 文字エンコーディング?

文字をデジタル保存・送信用のバイト列にマッピングするシステム。すべてのテキストファイルにはエンコーディングがあり、正しく宣言されているかどうかが重要です。

プログラミングと開発

置換文字

U+FFFD（）。デコーダが無効なバイトシーケンスに遭遇したときに表示される文字 — 「デコードに何か問題が起きた」の汎用記号。

2024-05-28 · Updated 2025-01-15

What Is the Replacement Character?

The Replacement Character is U+FFFD (� — displayed as \ufffd or the familiar diamond with a question mark <> symbol). It is the designated substitute character that Unicode and encoding systems insert when they encounter a byte sequence that cannot be decoded, is invalid for the declared encoding, or represents a code point that has no valid mapping.

The character was chosen because its code point 0xFFFD is in the Specials block and has no other assigned meaning — it exists solely as an error sentinel.

When It Appears

Invalid UTF-8 sequences: A byte in a range reserved for multi-byte sequences but not followed by the correct continuation bytes.
Truncated sequences: A multi-byte UTF-8 sequence cut off at the end of a buffer.
Surrogates in UTF-8: Lone or paired surrogates (U+D800–U+DFFF) encoded in UTF-8, which is technically invalid.
Out-of-range code points: Code points above U+10FFFF are not Unicode.
Unmappable characters: When converting between encodings and a character has no equivalent.

# UTF-8 decoding with error=replace
b"\xff\xfe".decode("utf-8", errors="replace")  # "\ufffd\ufffd"
b"\xe4\xb8".decode("utf-8", errors="replace")  # "\ufffd" (truncated CJK)
b"\xed\xa0\x80".decode("utf-8", errors="replace")  # "\ufffd\ufffd\ufffd" (surrogate in UTF-8)

# str: U+FFFD as a Python character
REPLACEMENT = "\uFFFD"
REPLACEMENT == "&#xfffd;"  # Same character
ord(REPLACEMENT)           # 65533

# Check for replacement characters in decoded text
def has_decoding_errors(text: str) -> bool:
    return "\uFFFD" in text

JavaScript

const decoder = new TextDecoder("utf-8");  // fatal=false by default → uses U+FFFD
const badBytes = new Uint8Array([0xFF, 0xFE]);
decoder.decode(badBytes);  // "\ufffd\ufffd"

// Fatal mode: throws instead of replacing
const strictDecoder = new TextDecoder("utf-8", { fatal: true });
try {
  strictDecoder.decode(badBytes);  // TypeError: The encoded data was not valid
} catch (e) {
  console.log("Invalid bytes");
}

// Check for replacement character
"\ufffd".codePointAt(0);     // 65533
"\ufffd" === "\u{FFFD}";     // true

HTML Rendering

In HTML, � and � render as the replacement character glyph. Some fonts render it as a black diamond <>, others as a question mark in a box, or just ?.

<p>Invalid sequence: &#xFFFD;</p>
<!-- Browsers display the replacement character glyph -->

Database and File Handling

Replacement characters in stored data indicate an encoding problem that already occurred. They cannot be "fixed" because the original bytes are gone — the information was lost at decode time:

# Once decoded with errors="replace", the original byte is unrecoverable
bad = b"\x80"
replaced = bad.decode("utf-8", errors="replace")  # "\ufffd"
# You cannot go back to b"\x80" from "\ufffd" alone

# Solution: store original bytes if you need to recover them
import base64
preserved = base64.b64encode(bad).decode()  # "gA==" — recoverable

Normalization and Filtering

In data pipelines, you should decide deliberately whether to keep or remove replacement characters:

# Filter replacement characters (data was corrupted — remove noise)
def clean_text(text: str) -> str:
    return text.replace("\uFFFD", "")

# Count corruption severity
def corruption_ratio(text: str) -> float:
    if not text:
        return 0.0
    return text.count("\uFFFD") / len(text)

Quick Facts

Property	Value
Code point	U+FFFD
Name	REPLACEMENT CHARACTER
Block	Specials (U+FFF0–U+FFFF)
HTML entity	`�` or `�`
Python literal	`"\uFFFD"`
Decimal code point	65533
Appears when	Invalid/undecodable byte sequences encountered
Information recovery	Impossible — original byte is lost
Prevention	Strict decoding at input boundaries; validate encoding

プログラミングと開発のその他の用語

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

Unicode エスケープシーケンス

ソースコードでUnicode文字を表す構文。言語によって異なります：\u2713（Python/Java/JS）・\u{2713}（JS/Ruby/Rust）・\U00012345（Python/C）。

Unicode 正規表現

Unicodeプロパティを使う正規表現パターン：\p{L}（任意の文字）・\p{Script=Greek}（ギリシャ語スクリプト）・\p{Emoji}。言語や正規表現エンジンによってサポートが異なります。

エンコーディング / デコーディング

エンコーディングは文字をバイトに変換し（str.encode('utf-8')）、デコーディングはバイトを文字に変換します（bytes.decode('utf-8')）。正しく行えば文字化けを防げます。

サロゲートペア

UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位（上位サロゲートU+D800〜U+DBFF ＋下位サロゲートU+DC00〜U+DFFF）。😀 = D83D DE00。

ヌル文字

U+0000（NUL）。最初のUnicode/ASCII文字で、C/C++では文字列ターミネータとして使われます。セキュリティリスク：ヌルバイト挿入は脆弱なシステムで文字列を切り捨てる可能性があります。

不可視文字

目に見えるグリフを持たない文字：空白・ゼロ幅文字・制御文字・書式文字。スプーフィングやテキスト密輸などのセキュリティ問題を引き起こす可能性があります。

文字列

プログラミング言語における文字のシーケンス。内部表現はさまざまです：UTF-8（Go・Rust・新しいPython）・UTF-16（Java・JavaScript・C#）・UTF-32（Python）。

← 用語集へ