Unicode in PDF Documents
PDF supports Unicode text through embedded fonts and ToUnicode maps, but many PDFs created from scans or older tools produce files where copy-pasting text yields garbled output or missing characters. This guide explains how Unicode is stored in PDF files, how to diagnose text extraction problems, and best practices for creating accessible Unicode PDFs.
PDF (Portable Document Format) is the world's standard for fixed-layout documents. Unlike HTML or plain text, PDF embeds precise instructions for rendering text at exact positions on a page. This precision comes at a cost: PDF's relationship with Unicode is complex, involving font embedding, CMap tables, and ToUnicode mappings. Understanding how Unicode works in PDF is essential for anyone generating multilingual documents, extracting text from PDFs, or debugging rendering issues. This guide explains the internals of Unicode in PDF and provides practical guidance for both creating and consuming PDF documents.
How PDF Stores Text
PDF does not store text as a sequence of Unicode code points. Instead, it stores glyph references — numeric IDs that point to specific shapes in an embedded font. The text-rendering pipeline looks like this:
Unicode text ("Hello")
|
v
PDF Writer (maps code points to glyph IDs)
|
v
PDF File (stores glyph IDs + font program)
|
v
PDF Reader (renders glyphs from font program)
A PDF string like <0048 0065 006C 006C 006F> contains glyph IDs (or character codes),
not necessarily Unicode code points. The mapping between these IDs and Unicode is stored
separately in optional structures called CMap and ToUnicode tables.
Font Embedding
For text to display correctly in a PDF, the font (or a subset of it) must be embedded in the file. PDF supports several font types:
| Font Type | Description | Unicode Support |
|---|---|---|
| Type 1 | PostScript fonts (legacy) | Limited to 256 glyphs |
| TrueType | Common system fonts (.ttf) | Full Unicode via cmap table |
| OpenType/CFF | Modern fonts (.otf) | Full Unicode via cmap table |
| Type 0 (Composite) | CID-keyed fonts for CJK | Full Unicode via CMap |
| Type 3 | User-defined bitmap fonts | No standard Unicode mapping |
Font subsetting
Most PDF generators subset the font — they include only the glyphs actually used in the document rather than the entire font file. This dramatically reduces file size (a full CJK font can be 10-20 MB, but a subset might be 50 KB).
The trade-off: subsetting complicates text extraction because glyph IDs in a subsetted font are remapped and may not correspond to any standard encoding.
CMap Tables
A CMap (Character Map) defines the mapping from character codes in the PDF content stream to glyph IDs in the font. PDF uses two types of CMaps:
Predefined CMaps (for CJK)
PDF includes predefined CMaps for Chinese, Japanese, and Korean:
| CMap Name | Language | Encoding |
|---|---|---|
| UniGB-UCS2-H | Chinese (Simplified) | GB → Unicode |
| UniCNS-UCS2-H | Chinese (Traditional) | Big5 → Unicode |
| UniJIS-UCS2-H | Japanese | Shift_JIS → Unicode |
| UniKS-UCS2-H | Korean | KS → Unicode |
| Identity-H | Any | Identity mapping (code = GID) |
The Identity-H CMap is the most common in modern PDFs. It maps character codes
directly to glyph IDs, which means the PDF content stream contains glyph indices rather
than character codes. This is efficient for rendering but useless for text extraction
without a ToUnicode map.
Custom CMaps
PDF generators can embed custom CMaps that define arbitrary mappings. These are necessary when using fonts with non-standard encodings.
The ToUnicode Map
The ToUnicode map is an optional (but critical) structure in a PDF that maps glyph IDs back to Unicode code points. It is what makes text extraction (copy-paste, search, accessibility) possible.
Glyph rendering: PDF content → CMap → Glyph ID → Font → Rendered glyph
Text extraction: PDF content → ToUnicode → Unicode code points → Searchable text
ToUnicode CMap format
A ToUnicode map is written in PostScript-like CMap syntax:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0003> <0020>
<0013> <0048>
<0024> <0065>
endbfchar
endcmap
end
In this example:
- Glyph ID 0003 maps to Unicode U+0020 (space)
- Glyph ID 0013 maps to Unicode U+0048 (H)
- Glyph ID 0024 maps to Unicode U+0065 (e)
Without a ToUnicode map, PDF readers cannot extract meaningful text — you get glyph IDs that look like random characters.
Common Problems
Problem 1: Text extraction returns garbage
Cause: The PDF lacks a ToUnicode map, or the map is incorrect. Symptoms: Copy-pasting from the PDF produces random characters or blanks. Solution: Re-generate the PDF with a tool that writes proper ToUnicode maps.
Problem 2: Search does not find words in the PDF
Cause: Same as above — without ToUnicode, the PDF reader cannot match search queries to glyph sequences. Solution: If you cannot regenerate the PDF, use OCR (Optical Character Recognition) to add a searchable text layer.
Problem 3: CJK text displays as boxes or tofu
Cause: The CJK font was not embedded, and the PDF reader cannot find a substitute. Solution: Ensure fonts are fully embedded. In many PDF generators, CJK font embedding must be explicitly enabled because the font files are large.
Problem 4: Right-to-left text is reversed
Cause: The PDF stores glyphs in visual order (left-to-right on the page), but the
ToUnicode map should return logical order (reading order). If the PDF generator wrote
logical-order glyph codes, the text extracts correctly. If it wrote visual-order codes,
the extracted text may be reversed.
Solution: Use a PDF library that handles BiDi reordering during extraction
(e.g., pdfminer.six with layout analysis).
Generating Unicode-Correct PDFs
Python libraries
| Library | ToUnicode | CJK Support | Notes |
|---|---|---|---|
| ReportLab | Yes | Yes (CID fonts) | Commercial + open source |
| WeasyPrint | Yes | Yes (via system fonts) | HTML/CSS to PDF |
| FPDF2 | Yes | Yes | Lightweight |
| Typst | Yes | Yes | Modern alternative to LaTeX |
| LaTeX (pdflatex) | Partial | Partial | Use LuaLaTeX or XeLaTeX for full Unicode |
| LaTeX (lualatex) | Yes | Yes | Best Unicode support in LaTeX |
| wkhtmltopdf | Yes | Depends on system fonts | HTML to PDF via WebKit |
Best practices for PDF generation
- Always embed fonts: Never rely on the reader having the font installed.
- Include ToUnicode maps: Every font in the PDF should have a ToUnicode map. Most modern libraries do this automatically, but verify.
- Use Unicode-aware tools: LuaLaTeX instead of pdfLaTeX, WeasyPrint instead of older HTML-to-PDF converters.
- Test text extraction: After generating a PDF, extract its text (e.g., with
pdftotextorpdfminer) and verify it matches the original. - Subset fonts carefully: Subsetting is fine for file size, but ensure the subsetting tool preserves ToUnicode mappings.
PDF/A and Accessibility
PDF/A is an ISO-standardized subset of PDF designed for long-term archiving. PDF/A-1a and later versions require:
| Requirement | Purpose |
|---|---|
| All fonts embedded | Ensures rendering without external dependencies |
| ToUnicode map for all text | Ensures text extractability |
| Structure tags | Enables screen reader access |
| Natural language specification | Declares the document's language |
If your PDFs need to be accessible (screen readers, text-to-speech), Unicode correctness is not optional — it is a legal requirement in many jurisdictions (WCAG 2.1, Section 508, EN 301 549).
Extracting Text from PDFs (Python)
# Using pdfminer.six
from pdfminer.high_level import extract_text
text = extract_text("document.pdf")
print(text)
# If ToUnicode is present, this returns proper Unicode text
# If not, you get glyph IDs or garbage
# Using PyMuPDF (fitz)
import fitz
doc = fitz.open("document.pdf")
for page in doc:
text = page.get_text()
print(text)
For PDFs without ToUnicode maps, OCR is your best option:
# Using pytesseract + pdf2image
from pdf2image import convert_from_path
import pytesseract
images = convert_from_path("document.pdf")
for img in images:
text = pytesseract.image_to_string(img, lang="eng+jpn+kor")
print(text)
Key Takeaways
- PDF stores glyph IDs, not Unicode code points. The ToUnicode map bridges the gap between visual rendering and text extraction.
- Without a ToUnicode map, copy-paste and search in PDFs produce garbage. Always verify text extraction when generating PDFs.
- Font embedding is essential. A PDF that references fonts without embedding them will fail on systems where those fonts are not installed.
- For CJK text, use CID-keyed fonts with proper CMap and ToUnicode mappings.
- PDF/A mandates font embedding and ToUnicode maps, making it the gold standard for accessible, archivable Unicode documents.
- When extracting text from poorly-formed PDFs, fall back to OCR with language-specific Tesseract models.
Lainnya di Platform Guides
Microsoft Word supports the full Unicode character set and provides several methods …
Google Docs and Sheets use UTF-8 internally and provide a Special Characters …
Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …
Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …
Social media platforms handle Unicode text with varying degrees of support, affecting …
Both XML and JSON are defined to use Unicode text, but each …
Natural language processing and data science pipelines frequently encounter Unicode issues including …
QR codes can encode Unicode text using UTF-8, but many QR code …
Allowing Unicode characters in passwords increases the keyspace and can improve security, …