What is ゼロ幅文字?

前進幅がゼロの文字 — レンダリングでは見えませんがテキスト動作に影響します。ZWSP（単語区切り）・ZWJ（結合）・ZWNJ（結合防止）・WJ（改行防止）などがあります。

What is 制御文字?

テキスト処理を制御する非印字文字。C0（U+0000〜U+001F）：NUL・TAB・LF・CR・ESC。C1（U+0080〜U+009F）：現代のUnicodeではほぼ使われません。一般カテゴリ：Cc。

プログラミングと開発

不可視文字

目に見えるグリフを持たない文字：空白・ゼロ幅文字・制御文字・書式文字。スプーフィングやテキスト密輸などのセキュリティ問題を引き起こす可能性があります。

2024-06-17 · Updated 2025-05-06

What Are Invisible Characters?

Invisible characters are Unicode code points that have no visible glyph — they render as nothing in normal circumstances, yet they occupy space in a string and can affect text layout, rendering, and processing. They include format characters, zero-width characters, and various control or separator code points.

Invisible characters are legitimate and useful in proper typography and internationalization, but they are also exploited for obfuscation, invisible text, and bypassing filters.

Categories of Invisible Characters

Zero-Width Characters

These have no advance width in text layout:

Code Point	Name	Abbreviation	Purpose
U+200B	Zero Width Space	ZWSP	Allows line break without visible space
U+200C	Zero Width Non-Joiner	ZWNJ	Prevents ligature/cursive joining
U+200D	Zero Width Joiner	ZWJ	Joins emoji; enables cursive joining
U+2060	Word Joiner	WJ	Prevents line break, zero width
U+FEFF	Zero Width No-Break Space	BOM	Historical no-break; now mainly a BOM

Format Characters (Cf)

Code Point	Name	Effect
U+00AD	Soft Hyphen	Suggested break point; only visible when line breaks
U+2028	Line Separator	Forces line break
U+2029	Paragraph Separator	Forces paragraph break
U+200E	Left-to-Right Mark	LRM: influences bidi algorithm
U+200F	Right-to-Left Mark	RLM: influences bidi algorithm
U+202A–202E	Bidi embedding/override chars	Control text direction
U+2061–2064	Mathematical operators	Invisible function application, etc.

Non-Printing Control Characters

U+0000–U+001F (C0 controls) and U+007F–U+009F (C1 controls) are mostly invisible and have no standard rendering.

Detecting Invisible Characters

import unicodedata

def is_invisible(char: str) -> bool:
    cat = unicodedata.category(char)
    # Cf = Format, Cc = Control, Cs = Surrogate
    return cat in ("Cf", "Cc") or unicodedata.combining(char) != 0

def find_invisible(text: str) -> list[tuple[int, str, str]]:
    return [
        (i, hex(ord(c)), unicodedata.name(c, "UNKNOWN"))
        for i, c in enumerate(text)
        if is_invisible(c)
    ]

text = "Hello\u200BWorld\u200D!"
find_invisible(text)
# [(5, "0x200b", "ZERO WIDTH SPACE"),
#  (11, "0x200d", "ZERO WIDTH JOINER")]

# Stripping all invisible characters
import regex  # pip install regex
def strip_invisible(text: str) -> str:
    return regex.sub(r"\p{Cf}", "", text)

strip_invisible("Hello\u200BWorld")  # "HelloWorld"

JavaScript Detection

// Detect zero-width and format characters
function findInvisible(text) {
  const results = [];
  for (const [i, char] of [...text].entries()) {
    const cp = char.codePointAt(0);
    if (
      (cp >= 0x200B && cp <= 0x200F) ||  // ZW space, joiners, marks
      (cp >= 0x202A && cp <= 0x202E) ||  // bidi controls
      cp === 0x2060 || cp === 0xFEFF ||
      (cp >= 0x2061 && cp <= 0x2064)
    ) {
      results.push({ index: i, codePoint: cp.toString(16), char });
    }
  }
  return results;
}

// Strip format characters using Unicode property
const stripped = text.replace(/\p{Cf}/gu, "");

Legitimate Uses

# ZWJ in emoji sequences (family emoji)
family = "👨\u200D👩\u200D👧"   # 👨‍👩‍👧 one grapheme

# ZWNJ for Persian/Arabic (prevent unwanted ligature)
correct = "می\u200Cکنم"          # correct word separation in Persian

# LRM/RLM for bidi text
mixed = "Hello \u200Eمرحبا"       # force LTR context around Arabic

Security Considerations

Invisible characters are used for:

Text fingerprinting/watermarking: Embedding hidden patterns to track document leaks.
Bypassing content filters: c\u200Ba\u200Bt to write "cat" while evading text matching.
Homograph attacks: Hidden bidi overrides can reverse text direction in filenames or URLs.
Obfuscating malicious strings: Zero-width characters interspersed in code.

# Security: normalize input by stripping Cf characters
import unicodedata
def sanitize(text: str) -> str:
    # Remove format characters
    cleaned = "".join(c for c in text if unicodedata.category(c) != "Cf")
    # NFC normalize
    return unicodedata.normalize("NFC", cleaned)

Quick Facts

Property	Value
Most common invisible chars	U+200B, U+200C, U+200D, U+2060, U+FEFF
Unicode category	Cf (Format), Cc (Control)
Emoji ZWJ	U+200D — joins emoji into multi-person sequences
Python detection	`unicodedata.category(c) == "Cf"`
JS regex removal	`text.replace(/\p{Cf}/gu, "")`
Security risk	Homograph attacks, filter bypass, text fingerprinting
Legitimate uses	Bidi control, emoji sequences, typography, cursive joining

プログラミングと開発のその他の用語

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

Unicode エスケープシーケンス

ソースコードでUnicode文字を表す構文。言語によって異なります：\u2713（Python/Java/JS）・\u{2713}（JS/Ruby/Rust）・\U00012345（Python/C）。

Unicode 正規表現

Unicodeプロパティを使う正規表現パターン：\p{L}（任意の文字）・\p{Script=Greek}（ギリシャ語スクリプト）・\p{Emoji}。言語や正規表現エンジンによってサポートが異なります。

エンコーディング / デコーディング

エンコーディングは文字をバイトに変換し（str.encode('utf-8')）、デコーディングはバイトを文字に変換します（bytes.decode('utf-8')）。正しく行えば文字化けを防げます。

サロゲートペア

UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位（上位サロゲートU+D800〜U+DBFF ＋下位サロゲートU+DC00〜U+DFFF）。😀 = D83D DE00。

ヌル文字

U+0000（NUL）。最初のUnicode/ASCII文字で、C/C++では文字列ターミネータとして使われます。セキュリティリスク：ヌルバイト挿入は脆弱なシステムで文字列を切り捨てる可能性があります。

文字列

プログラミング言語における文字のシーケンス。内部表現はさまざまです：UTF-8（Go・Rust・新しいPython）・UTF-16（Java・JavaScript・C#）・UTF-32（Python）。

文字列長の曖昧さ

Unicodeの文字列の「長さ」は単位によって異なります：コード単位（JavaScript .length）・コードポイント（Python len()）・書記素クラスター。👨‍👩‍👧‍👦 = 7コードポイント、1書記素。

← 用語集へ