حرف غير مرئي
أي حرف بدون شكل رسومي مرئي: مسافات بيضاء، أحرف بعرض صفري، أحرف تحكم، وأحرف تنسيق. قد يسبب مشاكل أمنية مثل الانتحال وتهريب النصوص.
What Are Invisible Characters?
Invisible characters are Unicode code points that have no visible glyph — they render as nothing in normal circumstances, yet they occupy space in a string and can affect text layout, rendering, and processing. They include format characters, zero-width characters, and various control or separator code points.
Invisible characters are legitimate and useful in proper typography and internationalization, but they are also exploited for obfuscation, invisible text, and bypassing filters.
Categories of Invisible Characters
Zero-Width Characters
These have no advance width in text layout:
| Code Point | Name | Abbreviation | Purpose |
|---|---|---|---|
| U+200B | Zero Width Space | ZWSP | Allows line break without visible space |
| U+200C | Zero Width Non-Joiner | ZWNJ | Prevents ligature/cursive joining |
| U+200D | Zero Width Joiner | ZWJ | Joins emoji; enables cursive joining |
| U+2060 | Word Joiner | WJ | Prevents line break, zero width |
| U+FEFF | Zero Width No-Break Space | BOM | Historical no-break; now mainly a BOM |
Format Characters (Cf)
| Code Point | Name | Effect |
|---|---|---|
| U+00AD | Soft Hyphen | Suggested break point; only visible when line breaks |
| U+2028 | Line Separator | Forces line break |
| U+2029 | Paragraph Separator | Forces paragraph break |
| U+200E | Left-to-Right Mark | LRM: influences bidi algorithm |
| U+200F | Right-to-Left Mark | RLM: influences bidi algorithm |
| U+202A–202E | Bidi embedding/override chars | Control text direction |
| U+2061–2064 | Mathematical operators | Invisible function application, etc. |
Non-Printing Control Characters
U+0000–U+001F (C0 controls) and U+007F–U+009F (C1 controls) are mostly invisible and have no standard rendering.
Detecting Invisible Characters
import unicodedata
def is_invisible(char: str) -> bool:
cat = unicodedata.category(char)
# Cf = Format, Cc = Control, Cs = Surrogate
return cat in ("Cf", "Cc") or unicodedata.combining(char) != 0
def find_invisible(text: str) -> list[tuple[int, str, str]]:
return [
(i, hex(ord(c)), unicodedata.name(c, "UNKNOWN"))
for i, c in enumerate(text)
if is_invisible(c)
]
text = "Hello\u200BWorld\u200D!"
find_invisible(text)
# [(5, "0x200b", "ZERO WIDTH SPACE"),
# (11, "0x200d", "ZERO WIDTH JOINER")]
# Stripping all invisible characters
import regex # pip install regex
def strip_invisible(text: str) -> str:
return regex.sub(r"\p{Cf}", "", text)
strip_invisible("Hello\u200BWorld") # "HelloWorld"
JavaScript Detection
// Detect zero-width and format characters
function findInvisible(text) {
const results = [];
for (const [i, char] of [...text].entries()) {
const cp = char.codePointAt(0);
if (
(cp >= 0x200B && cp <= 0x200F) || // ZW space, joiners, marks
(cp >= 0x202A && cp <= 0x202E) || // bidi controls
cp === 0x2060 || cp === 0xFEFF ||
(cp >= 0x2061 && cp <= 0x2064)
) {
results.push({ index: i, codePoint: cp.toString(16), char });
}
}
return results;
}
// Strip format characters using Unicode property
const stripped = text.replace(/\p{Cf}/gu, "");
Legitimate Uses
# ZWJ in emoji sequences (family emoji)
family = "👨\u200D👩\u200D👧" # 👨👩👧 one grapheme
# ZWNJ for Persian/Arabic (prevent unwanted ligature)
correct = "می\u200Cکنم" # correct word separation in Persian
# LRM/RLM for bidi text
mixed = "Hello \u200Eمرحبا" # force LTR context around Arabic
Security Considerations
Invisible characters are used for:
- Text fingerprinting/watermarking: Embedding hidden patterns to track document leaks.
- Bypassing content filters:
c\u200Ba\u200Btto write "cat" while evading text matching. - Homograph attacks: Hidden bidi overrides can reverse text direction in filenames or URLs.
- Obfuscating malicious strings: Zero-width characters interspersed in code.
# Security: normalize input by stripping Cf characters
import unicodedata
def sanitize(text: str) -> str:
# Remove format characters
cleaned = "".join(c for c in text if unicodedata.category(c) != "Cf")
# NFC normalize
return unicodedata.normalize("NFC", cleaned)
Quick Facts
| Property | Value |
|---|---|
| Most common invisible chars | U+200B, U+200C, U+200D, U+2060, U+FEFF |
| Unicode category | Cf (Format), Cc (Control) |
| Emoji ZWJ | U+200D — joins emoji into multi-person sequences |
| Python detection | unicodedata.category(c) == "Cf" |
| JS regex removal | text.replace(/\p{Cf}/gu, "") |
| Security risk | Homograph attacks, filter bypass, text fingerprinting |
| Legitimate uses | Bidi control, emoji sequences, typography, cursive joining |
المصطلحات ذات الصلة
المزيد في البرمجة والتطوير
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
نص مشوّه ناتج عن فك تشفير البايتات بترميز خاطئ. مصطلح ياباني (文字化け). …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
الترميز يحوّل الأحرف إلى بايتات (str.encode('utf-8'))؛ فك الترميز يحوّل البايتات إلى أحرف …
أنماط Regex باستخدام خصائص Unicode: \p{L} (أي حرف)، \p{Script=Greek} (نص يوناني)، \p{Emoji}. …
صيغة لتمثيل أحرف Unicode في الكود المصدري. تختلف حسب اللغة: \u2713 (Python/Java/JS)، …
U+FFFD (�). يُعرض عندما يواجه فاك التشفير تسلسلات بايتات غير صالحة — …
U+0000 (NUL). أول حرف في Unicode/ASCII، يُستخدم كمُنهٍ للنصوص في C/C++. خطر …
وحدتا ترميز 16-بت (بديل علوي U+D800–U+DBFF + بديل سفلي U+DC00–U+DFFF) يُمثلان معاً …