🔧 Practical Unicode

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization forms, strip or mangle Unicode characters, and cause subtle bugs that are hard to diagnose. This guide explains what happens to Unicode text during copy-paste operations and the best practices for preserving character integrity across applications.

Published 2024-06-10 · Updated 2025-02-07

Copy and paste is one of the most fundamental computer interactions, yet it is also one of the most common vectors for invisible Unicode problems. When you paste text from a web page, email, PDF, or spreadsheet into your application, you may be unknowingly introducing invisible characters, wrong encodings, look-alike substitutions, and normalization mismatches that cause bugs, security vulnerabilities, and data corruption. This guide catalogs the most common Unicode copy-paste problems, explains why they happen, and shows you how to detect and fix them.

The Anatomy of a Paste

When you press Ctrl+C (or Cmd+C), the operating system captures the selected content and places it on the clipboard in one or more formats:

Format	Description	Unicode Handling
Plain text	Raw character sequence	UTF-16 (Windows), UTF-8 (macOS/Linux)
Rich text (RTF)	Formatted text with styles	Encoding specified in header
HTML	HTML fragment	Usually UTF-8
Application-specific	Word, Excel, etc.	Varies by application

When you paste, the receiving application chooses which format to accept. A plain text editor takes the plain text; a rich text editor may take the HTML. Each conversion step is an opportunity for characters to be lost, transformed, or silently substituted.

Problem 1: Invisible Characters

The most insidious copy-paste problem is invisible characters — code points that have no visible glyph but affect text processing:

Character	Code Point	Name	Common Source
(invisible)	U+200B	Zero-Width Space	Web pages, CMS editors
(invisible)	U+200C	Zero-Width Non-Joiner	Persian/Arabic text
(invisible)	U+200D	Zero-Width Joiner	Emoji sequences, Indic scripts
(invisible)	U+FEFF	Byte Order Mark (BOM)	File headers, copy from files
(invisible)	U+00AD	Soft Hyphen	Hyphenated web content
(invisible)	U+2060	Word Joiner	Line break prevention
(invisible)	U+200E	Left-to-Right Mark	Bidi text
(invisible)	U+200F	Right-to-Left Mark	Bidi text
(invisible)	U+00A0	No-Break Space	Web pages ( )
(invisible)	U+2028	Line Separator	Rare, but breaks JavaScript
(invisible)	U+2029	Paragraph Separator	Rare, but breaks JavaScript

Real-World Scenario

A developer copies a code snippet from a blog post:

# Looks correct:
name = "hello"
print(name)

# But the pasted version actually contains:
# name = "hello"    (ZWSP after "name")
# print(name)       (soft hyphen in "name")

The first line has a Zero-Width Space between name and =, which may or may not cause a syntax error depending on the language. The second line has a soft hyphen inside the variable name, making it a different identifier entirely.

Detection in Python

import unicodedata

def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
    """Find invisible characters and their positions."""
    invisible_categories = {"Cf", "Zs", "Zl", "Zp", "Cc"}
    results = []
    for i, char in enumerate(text):
        cat = unicodedata.category(char)
        if cat in invisible_categories and char not in ("\n", "\r", "\t", " "):
            name = unicodedata.name(char, f"U+{ord(char):04X}")
            results.append((i, f"U+{ord(char):04X}", name))
    return results

# Example
text = "hello\u200bworld"
find_invisible_chars(text)
# [(5, 'U+200B', 'ZERO WIDTH SPACE')]

Detection in JavaScript

function findInvisibleChars(text) {
  const invisibles = [];
  const pattern = /[\u200B-\u200F\u2028-\u202F\u2060-\u206F\uFEFF\u00AD\u00A0]/g;
  let match;
  while ((match = pattern.exec(text)) !== null) {
    invisibles.push({
      position: match.index,
      codePoint: `U+${match[0].codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}`,
      char: match[0]
    });
  }
  return invisibles;
}

Problem 2: Look-Alike Character Substitutions

Many characters from different Unicode blocks look identical or nearly identical to common ASCII characters. Copy-paste from certain sources silently substitutes these look-alikes:

Typographic Replacements

Word processors and "smart" text editors automatically replace ASCII characters with their typographic Unicode counterparts:

ASCII	Replaced With	Code Point	Name
"	“ or ”	U+201C / U+201D	Left/Right Double Quotation Mark
'	‘ or ’	U+2018 / U+2019	Left/Right Single Quotation Mark
-	– or —	U+2013 / U+2014	En Dash / Em Dash
...	…	U+2026	Horizontal Ellipsis
(space)		U+00A0	No-Break Space

This is catastrophic for code. Pasting a code snippet from Word or Google Docs into a terminal or IDE can produce baffling errors:

# This looks correct but uses smart quotes:
name = “hello”    # SyntaxError: invalid character ‘“’ (U+201C)

Confusable Characters Across Scripts

ASCII	Look-alike	Script	Code Point
A	А	Cyrillic	U+0410
B	В	Cyrillic	U+0412
C	С	Cyrillic	U+0421
H	Н	Cyrillic	U+041D
O	О	Cyrillic	U+041E
P	Р	Cyrillic	U+0420
o	ο	Greek	U+03BF
a	а	Cyrillic	U+0430
e	е	Cyrillic	U+0435

These substitutions can come from copy-pasting text originally composed in a Cyrillic or Greek keyboard layout. The result looks correct but fails string comparisons:

# Visually identical, but different code points
"Hello" == "\u041ce\u04bb\u04bbo"  # False

Fullwidth vs. Halfwidth

CJK text input methods sometimes produce fullwidth versions of ASCII characters:

Halfwidth	Fullwidth	Code Point
A	Ａ	U+FF21
1	１	U+FF11
(	（	U+FF08
+	＋	U+FF0B

Problem 3: Normalization Mismatches

The same visual character can have multiple Unicode representations. Copy-paste can silently switch between them:

Visual	Form	Code Points	Bytes (UTF-8)
é	NFC (precomposed)	U+00E9	2 bytes
é	NFD (decomposed)	U+0065 U+0301	3 bytes

macOS is particularly prone to this: the HFS+ and APFS file systems use NFD normalization, while most other systems use NFC. Copying a filename from Finder and pasting it into a terminal or web form can produce an NFD string that fails equality checks against an NFC version.

import unicodedata

# These look identical but are different byte sequences
nfc = "caf\u00e9"                          # NFC: U+00E9
nfd = "cafe\u0301"                         # NFD: U+0065 U+0301

nfc == nfd                                  # False!
len(nfc), len(nfd)                          # (4, 5)

# Fix: normalize before comparing
unicodedata.normalize("NFC", nfc) == unicodedata.normalize("NFC", nfd)  # True

Problem 4: Encoding Misinterpretation

When text is copied from a source with one encoding and pasted into a context expecting another, mojibake occurs:

Original	Intended	Displayed As
café (UTF-8: `63 61 66 C3 A9`)	café	cafÃ© (Latin-1 interpretation)
üöä (UTF-8)	üöä	Ã¼Ã¶Ã¤ (double encoding)
中文 (UTF-8: `E4 B8 AD E6 96 87`)	中文	ä¸æ–‡ (Latin-1)

The classic pattern is double encoding: text is encoded as UTF-8, then that byte sequence is again interpreted as Latin-1 and re-encoded as UTF-8:

original = "caf\u00e9"

# Double encoding (the bug)
broken = original.encode("utf-8").decode("latin-1").encode("utf-8")
# b'caf\xc3\x83\xc2\xa9'

# Fix: reverse the double encoding
fixed = broken.decode("utf-8").encode("latin-1").decode("utf-8")
# 'caf\u00e9'

Problem 5: Line Endings and Whitespace

Different platforms use different line ending conventions, and copy-paste between them can introduce mixed line endings:

Platform	Line Ending	Code Points
Unix/macOS	LF	U+000A
Windows	CRLF	U+000D U+000A
Classic Mac OS	CR	U+000D
Unicode	LS	U+2028 (Line Separator)
Unicode	PS	U+2029 (Paragraph Separator)

Mixed line endings in a single file can cause parsing failures, diff noise, and version control conflicts.

A Practical Cleaning Pipeline

Here is a comprehensive function that addresses the most common copy-paste issues:

import unicodedata
import re

def clean_pasted_text(text: str) -> str:
    """Clean text that was pasted from external sources."""
    # 1. Normalize to NFC
    text = unicodedata.normalize("NFC", text)

    # 2. Remove zero-width characters
    text = re.sub("[\u200b\u200c\u200d\u2060\ufeff]", "", text)

    # 3. Replace typographic quotes with ASCII
    replacements = {
        "\u201c": '"', "\u201d": '"',  # smart double quotes
        "\u2018": "'", "\u2019": "'",  # smart single quotes
        "\u00ab": '"', "\u00bb": '"',  # guillemets
        "\u2013": "-", "\u2014": "-",  # en/em dash
        "\u2026": "...",                 # ellipsis
        "\u00a0": " ",                   # no-break space
    }
    for old, new in replacements.items():
        text = text.replace(old, new)

    # 4. Normalize line endings to LF
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # 5. Remove soft hyphens
    text = text.replace("\u00ad", "")

    return text

JavaScript Equivalent

function cleanPastedText(text) {
  // 1. Normalize to NFC
  text = text.normalize('NFC');

  // 2. Remove zero-width characters
  text = text.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, '');

  // 3. Replace typographic characters
  text = text
    .replace(/[\u201C\u201D]/g, '"')
    .replace(/[\u2018\u2019]/g, "'")
    .replace(/[\u2013\u2014]/g, '-')
    .replace(/\u2026/g, '...')
    .replace(/\u00A0/g, ' ');

  // 4. Normalize line endings
  text = text.replace(/\r\n?/g, '\n');

  // 5. Remove soft hyphens
  text = text.replace(/\u00AD/g, '');

  return text;
}

Preventing Copy-Paste Issues

For Content Authors

Paste as plain text (Ctrl+Shift+V / Cmd+Shift+V) to strip formatting and reduce hidden character contamination
Use a code editor for code — not Word, Google Docs, or web-based note tools
Run a linter after pasting code from external sources
Check encoding when copying from PDFs — PDF text extraction is notoriously lossy

For Developers

Normalize input — apply NFC normalization to all user input at the application boundary
Validate character ranges — reject or strip characters outside expected Unicode blocks
Display invisible characters in debug mode — show code points for zero-width and format characters
Use clipboard sanitization — strip invisible characters from paste events in web apps:

document.addEventListener('paste', (event) => {
  event.preventDefault();
  const raw = event.clipboardData.getData('text/plain');
  const clean = cleanPastedText(raw);
  document.execCommand('insertText', false, clean);
});

For Database Administrators

Normalize on insert — apply NFC normalization in a database trigger or application layer
Index on normalized form — ensure text searches match regardless of normalization
Audit for invisible characters — periodically scan text columns for zero-width and format characters

Key Takeaways

Copy-paste is not transparent — it can introduce invisible characters, change normalization forms, substitute look-alike characters, and corrupt encodings
Invisible characters (ZWSP, BOM, soft hyphen, no-break space) are the most common and hardest to detect without tooling
Smart quotes and typographic replacements from word processors break code and string comparisons
Normalization mismatches (NFC vs. NFD) cause identical-looking strings to fail equality checks
Always sanitize pasted text at the application boundary — normalize, strip invisible characters, and validate character ranges
"Paste as plain text" is the single most effective habit for avoiding copy-paste Unicode problems

Mehr in Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …

← Zurück zu den Anleitungen