🔧 Practical Unicode

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems especially between macOS (NFD) and Linux (NFC). This guide explains how Unicode filenames work on Windows, macOS, and Linux, and how to safely handle them in Python, shell scripts, and other tools.

·

Unicode filenames are a source of subtle, maddening bugs. A file created on macOS may be invisible to a script on Linux. A ZIP archive from Windows may extract with garbled names on other platforms. Two files that look identical in a directory listing may actually have different byte sequences in their names. This guide explains why, and how to handle Unicode filenames safely across operating systems.

Filesystem Encoding Basics

Every filesystem stores filenames as byte sequences. The question is: what encoding maps those bytes to characters?

Filesystem OS Encoding Normalization
NTFS Windows UTF-16LE None (preserves as-is)
APFS macOS UTF-8 NFD-like (but not strict NFD)
HFS+ macOS (legacy) UTF-8 Strict NFD
ext4 Linux Bytes (usually UTF-8) None
Btrfs Linux Bytes (usually UTF-8) None
ZFS Linux/BSD Bytes (usually UTF-8) Optional (via normalization property)
FAT32 All OEM code page + UTF-16 LFN None
exFAT All UTF-16LE None

The critical distinction is between filesystems that enforce an encoding (NTFS, APFS) and those that treat filenames as opaque byte sequences (ext4, Btrfs).

The NFC vs NFD Problem

The most common Unicode filename issue involves normalization forms. The character "e\u0301" (e with acute accent) can be represented two ways:

Form Code Points Bytes (UTF-8) Description
NFC (Composed) U+00E9 C3 A9 (2 bytes) Single precomposed character
NFD (Decomposed) U+0065 U+0301 65 CC 81 (3 bytes) Base letter + combining accent

Both look identical on screen, but they are different byte sequences. This matters because:

macOS (APFS / HFS+)

macOS automatically converts filenames to a decomposed form similar to NFD. When you create a file named cafe\u0301.txt (NFC, 2 bytes for e\u0301), macOS stores it as cafe\u0301.txt (NFD-like, 3 bytes for e + combining accent).

On APFS (macOS 10.13+), the normalization is slightly different from strict NFD -- Apple calls it "canonically decomposed" but it does not decompose certain characters that strict NFD would. The practical impact is small but can cause issues with edge cases.

On HFS+ (legacy macOS), strict NFD is enforced.

Linux (ext4)

Linux stores whatever bytes you give it. If you create cafe\u0301.txt in NFC, it stays NFC. If you create cafe\u0301.txt in NFD, it stays NFD. Linux does not normalize.

Windows (NTFS)

NTFS preserves filenames exactly as created, with no normalization. It stores filenames as UTF-16LE.

The Cross-Platform Problem

Consider this scenario:

  1. On macOS, you create cafe\u0301.txt. macOS stores it in NFD form.
  2. You push it to a Git repository.
  3. On Linux, you clone the repo. The filename is now in NFD form: cafe\u0301.txt.
  4. A Python script does os.path.exists("cafe\u0301.txt") using NFC -- this returns False on Linux because the bytes do not match.

The same file, the same visible name, but different bytes -- and the comparison fails.

Detecting Normalization Issues

Python

import unicodedata
import os

filename = "cafe\u0301.txt"

# Check what form a string is in
nfc = unicodedata.normalize("NFC", filename)
nfd = unicodedata.normalize("NFD", filename)

print(f"Original bytes: {filename.encode('utf-8').hex()}")
print(f"NFC bytes:      {nfc.encode('utf-8').hex()}")
print(f"NFD bytes:      {nfd.encode('utf-8').hex()}")
print(f"NFC == NFD: {nfc == nfd}")  # False!
print(f"Length NFC: {len(nfc)}, Length NFD: {len(nfd)}")

# Safe file existence check: normalize before comparing
def safe_exists(path: str) -> bool:
    directory = os.path.dirname(path) or "."
    target = unicodedata.normalize("NFC", os.path.basename(path))
    for entry in os.listdir(directory):
        if unicodedata.normalize("NFC", entry) == target:
            return True
    return False

Command Line

# Show the actual bytes of a filename
ls cafe*.txt | xxd | head

# Python one-liner to check normalization
python3 -c "
import os, unicodedata
for f in os.listdir('.'):
    nfc = unicodedata.normalize('NFC', f)
    if f != nfc:
        print(f'NFD detected: {f!r}')
"

Git and Unicode Filenames

Git has a configuration option that controls how it handles Unicode normalization in filenames:

# Check current setting
git config core.precomposeunicode

# macOS: set to true (default on macOS)
# This tells Git to precompose (NFC) filenames on checkout
git config --global core.precomposeunicode true

# Linux: usually not set (Git preserves whatever is in the repo)

The Git Normalization Problem

If a macOS user commits a file with an NFD filename and a Linux user clones the repo, the Linux user gets the NFD filename. This can cause:

  • Build scripts that reference the NFC filename to fail
  • import statements in Python to fail if the module filename is NFD
  • Duplicate-looking files if someone creates a new file with the NFC name

Prevention

Normalize filenames before committing:

# Install convmv (filename encoding converter)
sudo apt install convmv   # Debian/Ubuntu
brew install convmv        # macOS

# Preview NFD-to-NFC conversion (dry run)
convmv -r -f utf-8 -t utf-8 --nfc --notest .

# Actually rename
convmv -r -f utf-8 -t utf-8 --nfc .

Platform-Specific Filename Restrictions

Beyond encoding, each OS has different rules for valid characters in filenames:

Windows (NTFS)

Restriction Details
Forbidden characters < > : " / \\ | ? *
Forbidden names CON, PRN, AUX, NUL, COM1-9, LPT1-9
Trailing dots/spaces Silently stripped
Max path length 260 characters (default), 32,767 with long path enabled
Case Preserving but insensitive

macOS (APFS)

Restriction Details
Forbidden characters : (displayed as / in Finder), NUL
Max filename length 255 UTF-8 bytes
Max path length 1024 bytes
Case Preserving, insensitive by default (configurable)

Linux (ext4)

Restriction Details
Forbidden characters / and NUL only
Max filename length 255 bytes
Max path length 4096 bytes
Case Sensitive

The 255-Byte Limit

The 255-byte limit on ext4 and APFS is measured in bytes, not characters. A filename using only ASCII characters can be 255 characters long. A filename using 4-byte emoji can only be 63 characters long before hitting the byte limit.

# Check if a filename is safe for ext4/APFS
def is_filename_safe(name: str) -> bool:
    encoded = name.encode("utf-8")
    return len(encoded) <= 255

ZIP Archives and Unicode

ZIP archives have a long history of encoding problems because the original ZIP format (1989) predates Unicode adoption. Filename encoding depends on the ZIP creator:

Creator Filename Encoding
Windows Explorer (modern) UTF-8 with language encoding flag (bit 11)
Windows Explorer (old) OEM code page (CP437 or system locale)
macOS Archive Utility UTF-8
Info-ZIP (Linux) UTF-8 (usually)
7-Zip UTF-8 with flag
WinRAR UTF-8 with flag
Java ZipOutputStream UTF-8 if explicitly set, otherwise platform default

Handling ZIP Encoding in Python

import zipfile

# Reading: Python 3.11+ handles UTF-8 flag automatically
with zipfile.ZipFile("archive.zip", "r") as zf:
    for info in zf.infolist():
        # Check if UTF-8 flag is set
        if info.flag_bits & 0x800:
            # Filename is UTF-8
            print(f"UTF-8: {info.filename}")
        else:
            # Filename might be CP437 or another code page
            # Try decoding as CP437 (ZIP default), then as system locale
            raw = info.filename.encode("cp437")
            try:
                name = raw.decode("utf-8")
            except UnicodeDecodeError:
                name = raw.decode("cp949", errors="replace")  # Korean
            print(f"Recoded: {name}")

Cross-Platform Safe Filenames

If you need filenames that work everywhere, follow these rules:

import re
import unicodedata

def safe_filename(name: str) -> str:
    # Normalize to NFC
    name = unicodedata.normalize("NFC", name)
    # Remove characters forbidden on Windows
    name = re.sub(r'[<>:"/\\|?*]', "_", name)
    # Remove control characters
    name = re.sub(r"[\x00-\x1f\x7f]", "", name)
    # Strip trailing dots and spaces (Windows issue)
    name = name.strip(". ")
    # Avoid Windows reserved names
    reserved = {"CON", "PRN", "AUX", "NUL"}
    reserved.update(f"COM{i}" for i in range(1, 10))
    reserved.update(f"LPT{i}" for i in range(1, 10))
    stem = name.split(".")[0].upper()
    if stem in reserved:
        name = f"_{name}"
    # Ensure the UTF-8 byte length is within limits
    encoded = name.encode("utf-8")
    if len(encoded) > 240:  # Leave margin below 255
        name = encoded[:240].decode("utf-8", errors="ignore")
    return name or "unnamed"

Best Practices Summary

Practice Why
Normalize to NFC on creation Prevents NFC/NFD mismatch across platforms
Normalize to NFC on comparison Ensures cafe\u0301 matches cafe\u0301 regardless of source
Limit to ASCII in automated pipelines Avoids all encoding/normalization issues
Set core.precomposeunicode true in Git Consistent filenames across macOS and Linux
Check byte length, not character length 255-byte limit is per-byte on ext4/APFS
Use UTF-8 everywhere Consistent encoding across all platforms
Test with accented characters and emoji Catches normalization and 4-byte issues early

More in Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …