🔧 Practical Unicode

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization forms, strip or mangle Unicode characters, and cause subtle bugs that are hard to diagnose. This guide explains what happens to Unicode text during copy-paste operations and the best practices for preserving character integrity across applications.

·

Copy and paste is one of the most fundamental computer interactions, yet it is also one of the most common vectors for invisible Unicode problems. When you paste text from a web page, email, PDF, or spreadsheet into your application, you may be unknowingly introducing invisible characters, wrong encodings, look-alike substitutions, and normalization mismatches that cause bugs, security vulnerabilities, and data corruption. This guide catalogs the most common Unicode copy-paste problems, explains why they happen, and shows you how to detect and fix them.

The Anatomy of a Paste

When you press Ctrl+C (or Cmd+C), the operating system captures the selected content and places it on the clipboard in one or more formats:

Format Description Unicode Handling
Plain text Raw character sequence UTF-16 (Windows), UTF-8 (macOS/Linux)
Rich text (RTF) Formatted text with styles Encoding specified in header
HTML HTML fragment Usually UTF-8
Application-specific Word, Excel, etc. Varies by application

When you paste, the receiving application chooses which format to accept. A plain text editor takes the plain text; a rich text editor may take the HTML. Each conversion step is an opportunity for characters to be lost, transformed, or silently substituted.

Problem 1: Invisible Characters

The most insidious copy-paste problem is invisible characters — code points that have no visible glyph but affect text processing:

Character Code Point Name Common Source
(invisible) U+200B Zero-Width Space Web pages, CMS editors
(invisible) U+200C Zero-Width Non-Joiner Persian/Arabic text
(invisible) U+200D Zero-Width Joiner Emoji sequences, Indic scripts
(invisible) U+FEFF Byte Order Mark (BOM) File headers, copy from files
(invisible) U+00AD Soft Hyphen Hyphenated web content
(invisible) U+2060 Word Joiner Line break prevention
(invisible) U+200E Left-to-Right Mark Bidi text
(invisible) U+200F Right-to-Left Mark Bidi text
(invisible) U+00A0 No-Break Space Web pages ( )
(invisible) U+2028 Line Separator Rare, but breaks JavaScript
(invisible) U+2029 Paragraph Separator Rare, but breaks JavaScript

Real-World Scenario

A developer copies a code snippet from a blog post:

# Looks correct:
name = "hello"
print(name)

# But the pasted version actually contains:
# name​ = "hello"    (ZWSP after "name")
# print(na­me)       (soft hyphen in "name")

The first line has a Zero-Width Space between name and =, which may or may not cause a syntax error depending on the language. The second line has a soft hyphen inside the variable name, making it a different identifier entirely.

Detection in Python

import unicodedata

def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
    """Find invisible characters and their positions."""
    invisible_categories = {"Cf", "Zs", "Zl", "Zp", "Cc"}
    results = []
    for i, char in enumerate(text):
        cat = unicodedata.category(char)
        if cat in invisible_categories and char not in ("\n", "\r", "\t", " "):
            name = unicodedata.name(char, f"U+{ord(char):04X}")
            results.append((i, f"U+{ord(char):04X}", name))
    return results

# Example
text = "hello\u200bworld"
find_invisible_chars(text)
# [(5, 'U+200B', 'ZERO WIDTH SPACE')]

Detection in JavaScript

function findInvisibleChars(text) {
  const invisibles = [];
  const pattern = /[\u200B-\u200F\u2028-\u202F\u2060-\u206F\uFEFF\u00AD\u00A0]/g;
  let match;
  while ((match = pattern.exec(text)) !== null) {
    invisibles.push({
      position: match.index,
      codePoint: `U+${match[0].codePointAt(0).toString(16).toUpperCase().padStart(4, '0')}`,
      char: match[0]
    });
  }
  return invisibles;
}

Problem 2: Look-Alike Character Substitutions

Many characters from different Unicode blocks look identical or nearly identical to common ASCII characters. Copy-paste from certain sources silently substitutes these look-alikes:

Typographic Replacements

Word processors and "smart" text editors automatically replace ASCII characters with their typographic Unicode counterparts:

ASCII Replaced With Code Point Name
" “ or ” U+201C / U+201D Left/Right Double Quotation Mark
' ‘ or ’ U+2018 / U+2019 Left/Right Single Quotation Mark
- – or — U+2013 / U+2014 En Dash / Em Dash
... U+2026 Horizontal Ellipsis
(space)   U+00A0 No-Break Space

This is catastrophic for code. Pasting a code snippet from Word or Google Docs into a terminal or IDE can produce baffling errors:

# This looks correct but uses smart quotes:
name = “hello”    # SyntaxError: invalid character ‘“’ (U+201C)

Confusable Characters Across Scripts

ASCII Look-alike Script Code Point
A А Cyrillic U+0410
B В Cyrillic U+0412
C С Cyrillic U+0421
H Н Cyrillic U+041D
O О Cyrillic U+041E
P Р Cyrillic U+0420
o ο Greek U+03BF
a а Cyrillic U+0430
e е Cyrillic U+0435

These substitutions can come from copy-pasting text originally composed in a Cyrillic or Greek keyboard layout. The result looks correct but fails string comparisons:

# Visually identical, but different code points
"Hello" == "\u041ce\u04bb\u04bbo"  # False

Fullwidth vs. Halfwidth

CJK text input methods sometimes produce fullwidth versions of ASCII characters:

Halfwidth Fullwidth Code Point
A U+FF21
1 U+FF11
( U+FF08
+ U+FF0B

Problem 3: Normalization Mismatches

The same visual character can have multiple Unicode representations. Copy-paste can silently switch between them:

Visual Form Code Points Bytes (UTF-8)
é NFC (precomposed) U+00E9 2 bytes
é NFD (decomposed) U+0065 U+0301 3 bytes

macOS is particularly prone to this: the HFS+ and APFS file systems use NFD normalization, while most other systems use NFC. Copying a filename from Finder and pasting it into a terminal or web form can produce an NFD string that fails equality checks against an NFC version.

import unicodedata

# These look identical but are different byte sequences
nfc = "caf\u00e9"                          # NFC: U+00E9
nfd = "cafe\u0301"                         # NFD: U+0065 U+0301

nfc == nfd                                  # False!
len(nfc), len(nfd)                          # (4, 5)

# Fix: normalize before comparing
unicodedata.normalize("NFC", nfc) == unicodedata.normalize("NFC", nfd)  # True

Problem 4: Encoding Misinterpretation

When text is copied from a source with one encoding and pasted into a context expecting another, mojibake occurs:

Original Intended Displayed As
café (UTF-8: 63 61 66 C3 A9) café café (Latin-1 interpretation)
üöä (UTF-8) üöä üöä (double encoding)
中文 (UTF-8: E4 B8 AD E6 96 87) 中文 中文 (Latin-1)

The classic pattern is double encoding: text is encoded as UTF-8, then that byte sequence is again interpreted as Latin-1 and re-encoded as UTF-8:

original = "caf\u00e9"

# Double encoding (the bug)
broken = original.encode("utf-8").decode("latin-1").encode("utf-8")
# b'caf\xc3\x83\xc2\xa9'

# Fix: reverse the double encoding
fixed = broken.decode("utf-8").encode("latin-1").decode("utf-8")
# 'caf\u00e9'

Problem 5: Line Endings and Whitespace

Different platforms use different line ending conventions, and copy-paste between them can introduce mixed line endings:

Platform Line Ending Code Points
Unix/macOS LF U+000A
Windows CRLF U+000D U+000A
Classic Mac OS CR U+000D
Unicode LS U+2028 (Line Separator)
Unicode PS U+2029 (Paragraph Separator)

Mixed line endings in a single file can cause parsing failures, diff noise, and version control conflicts.

A Practical Cleaning Pipeline

Here is a comprehensive function that addresses the most common copy-paste issues:

import unicodedata
import re

def clean_pasted_text(text: str) -> str:
    """Clean text that was pasted from external sources."""
    # 1. Normalize to NFC
    text = unicodedata.normalize("NFC", text)

    # 2. Remove zero-width characters
    text = re.sub("[\u200b\u200c\u200d\u2060\ufeff]", "", text)

    # 3. Replace typographic quotes with ASCII
    replacements = {
        "\u201c": '"', "\u201d": '"',  # smart double quotes
        "\u2018": "'", "\u2019": "'",  # smart single quotes
        "\u00ab": '"', "\u00bb": '"',  # guillemets
        "\u2013": "-", "\u2014": "-",  # en/em dash
        "\u2026": "...",                 # ellipsis
        "\u00a0": " ",                   # no-break space
    }
    for old, new in replacements.items():
        text = text.replace(old, new)

    # 4. Normalize line endings to LF
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # 5. Remove soft hyphens
    text = text.replace("\u00ad", "")

    return text

JavaScript Equivalent

function cleanPastedText(text) {
  // 1. Normalize to NFC
  text = text.normalize('NFC');

  // 2. Remove zero-width characters
  text = text.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, '');

  // 3. Replace typographic characters
  text = text
    .replace(/[\u201C\u201D]/g, '"')
    .replace(/[\u2018\u2019]/g, "'")
    .replace(/[\u2013\u2014]/g, '-')
    .replace(/\u2026/g, '...')
    .replace(/\u00A0/g, ' ');

  // 4. Normalize line endings
  text = text.replace(/\r\n?/g, '\n');

  // 5. Remove soft hyphens
  text = text.replace(/\u00AD/g, '');

  return text;
}

Preventing Copy-Paste Issues

For Content Authors

  1. Paste as plain text (Ctrl+Shift+V / Cmd+Shift+V) to strip formatting and reduce hidden character contamination
  2. Use a code editor for code — not Word, Google Docs, or web-based note tools
  3. Run a linter after pasting code from external sources
  4. Check encoding when copying from PDFs — PDF text extraction is notoriously lossy

For Developers

  1. Normalize input — apply NFC normalization to all user input at the application boundary
  2. Validate character ranges — reject or strip characters outside expected Unicode blocks
  3. Display invisible characters in debug mode — show code points for zero-width and format characters
  4. Use clipboard sanitization — strip invisible characters from paste events in web apps:
document.addEventListener('paste', (event) => {
  event.preventDefault();
  const raw = event.clipboardData.getData('text/plain');
  const clean = cleanPastedText(raw);
  document.execCommand('insertText', false, clean);
});

For Database Administrators

  1. Normalize on insert — apply NFC normalization in a database trigger or application layer
  2. Index on normalized form — ensure text searches match regardless of normalization
  3. Audit for invisible characters — periodically scan text columns for zero-width and format characters

Key Takeaways

  1. Copy-paste is not transparent — it can introduce invisible characters, change normalization forms, substitute look-alike characters, and corrupt encodings
  2. Invisible characters (ZWSP, BOM, soft hyphen, no-break space) are the most common and hardest to detect without tooling
  3. Smart quotes and typographic replacements from word processors break code and string comparisons
  4. Normalization mismatches (NFC vs. NFD) cause identical-looking strings to fail equality checks
  5. Always sanitize pasted text at the application boundary — normalize, strip invisible characters, and validate character ranges
  6. "Paste as plain text" is the single most effective habit for avoiding copy-paste Unicode problems

Mehr in Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …