🖥️ Platform Guides

Unicode in PDF Documents

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many PDFs created from scans or older tools produce files where copy-pasting text yields garbled output or missing characters. This guide explains how Unicode is stored in PDF files, how to diagnose text extraction problems, and best practices for creating accessible Unicode PDFs.

·

PDF (Portable Document Format) is the world's standard for fixed-layout documents. Unlike HTML or plain text, PDF embeds precise instructions for rendering text at exact positions on a page. This precision comes at a cost: PDF's relationship with Unicode is complex, involving font embedding, CMap tables, and ToUnicode mappings. Understanding how Unicode works in PDF is essential for anyone generating multilingual documents, extracting text from PDFs, or debugging rendering issues. This guide explains the internals of Unicode in PDF and provides practical guidance for both creating and consuming PDF documents.

How PDF Stores Text

PDF does not store text as a sequence of Unicode code points. Instead, it stores glyph references — numeric IDs that point to specific shapes in an embedded font. The text-rendering pipeline looks like this:

Unicode text ("Hello")
    |
    v
PDF Writer (maps code points to glyph IDs)
    |
    v
PDF File (stores glyph IDs + font program)
    |
    v
PDF Reader (renders glyphs from font program)

A PDF string like <0048 0065 006C 006C 006F> contains glyph IDs (or character codes), not necessarily Unicode code points. The mapping between these IDs and Unicode is stored separately in optional structures called CMap and ToUnicode tables.

Font Embedding

For text to display correctly in a PDF, the font (or a subset of it) must be embedded in the file. PDF supports several font types:

Font Type Description Unicode Support
Type 1 PostScript fonts (legacy) Limited to 256 glyphs
TrueType Common system fonts (.ttf) Full Unicode via cmap table
OpenType/CFF Modern fonts (.otf) Full Unicode via cmap table
Type 0 (Composite) CID-keyed fonts for CJK Full Unicode via CMap
Type 3 User-defined bitmap fonts No standard Unicode mapping

Font subsetting

Most PDF generators subset the font — they include only the glyphs actually used in the document rather than the entire font file. This dramatically reduces file size (a full CJK font can be 10-20 MB, but a subset might be 50 KB).

The trade-off: subsetting complicates text extraction because glyph IDs in a subsetted font are remapped and may not correspond to any standard encoding.

CMap Tables

A CMap (Character Map) defines the mapping from character codes in the PDF content stream to glyph IDs in the font. PDF uses two types of CMaps:

Predefined CMaps (for CJK)

PDF includes predefined CMaps for Chinese, Japanese, and Korean:

CMap Name Language Encoding
UniGB-UCS2-H Chinese (Simplified) GB → Unicode
UniCNS-UCS2-H Chinese (Traditional) Big5 → Unicode
UniJIS-UCS2-H Japanese Shift_JIS → Unicode
UniKS-UCS2-H Korean KS → Unicode
Identity-H Any Identity mapping (code = GID)

The Identity-H CMap is the most common in modern PDFs. It maps character codes directly to glyph IDs, which means the PDF content stream contains glyph indices rather than character codes. This is efficient for rendering but useless for text extraction without a ToUnicode map.

Custom CMaps

PDF generators can embed custom CMaps that define arbitrary mappings. These are necessary when using fonts with non-standard encodings.

The ToUnicode Map

The ToUnicode map is an optional (but critical) structure in a PDF that maps glyph IDs back to Unicode code points. It is what makes text extraction (copy-paste, search, accessibility) possible.

Glyph rendering:  PDF content → CMap → Glyph ID → Font → Rendered glyph
Text extraction:  PDF content → ToUnicode → Unicode code points → Searchable text

ToUnicode CMap format

A ToUnicode map is written in PostScript-like CMap syntax:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0003> <0020>
<0013> <0048>
<0024> <0065>
endbfchar
endcmap
end

In this example: - Glyph ID 0003 maps to Unicode U+0020 (space) - Glyph ID 0013 maps to Unicode U+0048 (H) - Glyph ID 0024 maps to Unicode U+0065 (e)

Without a ToUnicode map, PDF readers cannot extract meaningful text — you get glyph IDs that look like random characters.

Common Problems

Problem 1: Text extraction returns garbage

Cause: The PDF lacks a ToUnicode map, or the map is incorrect. Symptoms: Copy-pasting from the PDF produces random characters or blanks. Solution: Re-generate the PDF with a tool that writes proper ToUnicode maps.

Problem 2: Search does not find words in the PDF

Cause: Same as above — without ToUnicode, the PDF reader cannot match search queries to glyph sequences. Solution: If you cannot regenerate the PDF, use OCR (Optical Character Recognition) to add a searchable text layer.

Problem 3: CJK text displays as boxes or tofu

Cause: The CJK font was not embedded, and the PDF reader cannot find a substitute. Solution: Ensure fonts are fully embedded. In many PDF generators, CJK font embedding must be explicitly enabled because the font files are large.

Problem 4: Right-to-left text is reversed

Cause: The PDF stores glyphs in visual order (left-to-right on the page), but the ToUnicode map should return logical order (reading order). If the PDF generator wrote logical-order glyph codes, the text extracts correctly. If it wrote visual-order codes, the extracted text may be reversed. Solution: Use a PDF library that handles BiDi reordering during extraction (e.g., pdfminer.six with layout analysis).

Generating Unicode-Correct PDFs

Python libraries

Library ToUnicode CJK Support Notes
ReportLab Yes Yes (CID fonts) Commercial + open source
WeasyPrint Yes Yes (via system fonts) HTML/CSS to PDF
FPDF2 Yes Yes Lightweight
Typst Yes Yes Modern alternative to LaTeX
LaTeX (pdflatex) Partial Partial Use LuaLaTeX or XeLaTeX for full Unicode
LaTeX (lualatex) Yes Yes Best Unicode support in LaTeX
wkhtmltopdf Yes Depends on system fonts HTML to PDF via WebKit

Best practices for PDF generation

  1. Always embed fonts: Never rely on the reader having the font installed.
  2. Include ToUnicode maps: Every font in the PDF should have a ToUnicode map. Most modern libraries do this automatically, but verify.
  3. Use Unicode-aware tools: LuaLaTeX instead of pdfLaTeX, WeasyPrint instead of older HTML-to-PDF converters.
  4. Test text extraction: After generating a PDF, extract its text (e.g., with pdftotext or pdfminer) and verify it matches the original.
  5. Subset fonts carefully: Subsetting is fine for file size, but ensure the subsetting tool preserves ToUnicode mappings.

PDF/A and Accessibility

PDF/A is an ISO-standardized subset of PDF designed for long-term archiving. PDF/A-1a and later versions require:

Requirement Purpose
All fonts embedded Ensures rendering without external dependencies
ToUnicode map for all text Ensures text extractability
Structure tags Enables screen reader access
Natural language specification Declares the document's language

If your PDFs need to be accessible (screen readers, text-to-speech), Unicode correctness is not optional — it is a legal requirement in many jurisdictions (WCAG 2.1, Section 508, EN 301 549).

Extracting Text from PDFs (Python)

# Using pdfminer.six
from pdfminer.high_level import extract_text

text = extract_text("document.pdf")
print(text)
# If ToUnicode is present, this returns proper Unicode text
# If not, you get glyph IDs or garbage

# Using PyMuPDF (fitz)
import fitz

doc = fitz.open("document.pdf")
for page in doc:
    text = page.get_text()
    print(text)

For PDFs without ToUnicode maps, OCR is your best option:

# Using pytesseract + pdf2image
from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("document.pdf")
for img in images:
    text = pytesseract.image_to_string(img, lang="eng+jpn+kor")
    print(text)

Key Takeaways

  • PDF stores glyph IDs, not Unicode code points. The ToUnicode map bridges the gap between visual rendering and text extraction.
  • Without a ToUnicode map, copy-paste and search in PDFs produce garbage. Always verify text extraction when generating PDFs.
  • Font embedding is essential. A PDF that references fonts without embedding them will fail on systems where those fonts are not installed.
  • For CJK text, use CID-keyed fonts with proper CMap and ToUnicode mappings.
  • PDF/A mandates font embedding and ToUnicode maps, making it the gold standard for accessible, archivable Unicode documents.
  • When extracting text from poorly-formed PDFs, fall back to OCR with language-specific Tesseract models.

Mais em Platform Guides