Unicode in QR Codes
QR codes can encode Unicode text using UTF-8, but many QR code generators and scanners default to ISO 8859-1, causing non-Latin characters to appear garbled when scanned. This guide explains how QR codes handle Unicode, how to generate QR codes with correct Unicode encoding, and how to verify that your QR code encodes non-ASCII text properly.
QR codes are everywhere — on product packaging, restaurant menus, business cards, and advertising. They were originally designed for tracking automotive parts in Japanese factories (by Denso Wave in 1994), but their ability to encode text has made them a universal data carrier. What many people do not realize is that QR codes can encode full Unicode text, enabling multilingual content, emoji, and characters from any writing system. This guide explains how Unicode works in QR codes, the encoding modes available, capacity trade-offs, and best practices for generating and scanning Unicode QR codes.
QR Code Encoding Modes
QR codes support four encoding modes, each optimized for different types of data:
| Mode | Name | Characters | Bits per Character |
|---|---|---|---|
| 0001 | Numeric | 0-9 | 3.33 (10 bits per 3 digits) |
| 0010 | Alphanumeric | 0-9, A-Z, space, $%*+-./: | 5.5 (11 bits per 2 chars) |
| 0100 | Byte (8-bit) | Any byte (0x00-0xFF) | 8 |
| 1000 | Kanji | Shift JIS double-byte characters | 13 |
Where does Unicode fit?
Unicode text is encoded using Byte mode — the QR code stores raw bytes, and the convention is to encode those bytes as UTF-8. This means any Unicode character can be stored in a QR code, but it consumes more capacity than ASCII-optimized modes.
The flow for Unicode text:
Unicode text ("Hello, world")
|
v
UTF-8 encoding (bytes)
|
v
QR code Byte mode (each byte = 8 bits in the QR symbol)
|
v
Scanner reads bytes
|
v
UTF-8 decoding -> Unicode text
The Kanji mode exception
Kanji mode is a special case for Japanese. It encodes Shift JIS double-byte characters at 13 bits per character, which is more efficient than Byte mode (16 bits for a two-byte Shift JIS character). However, Kanji mode is limited to the Shift JIS character set and cannot encode arbitrary Unicode. For modern multilingual use, Byte mode with UTF-8 is preferred.
Capacity and Unicode
QR code capacity depends on the version (size), error correction level, and encoding mode. Here is the maximum data capacity for the largest QR code (Version 40, 177x177 modules) at the lowest error correction (Level L):
| Mode | Maximum Capacity (Version 40, Level L) |
|---|---|
| Numeric | 7,089 digits |
| Alphanumeric | 4,296 characters |
| Byte | 2,953 bytes |
| Kanji | 1,817 characters |
UTF-8 byte cost per character
Since Unicode in QR codes uses Byte mode with UTF-8 encoding, the cost varies by script:
| Character Type | UTF-8 Bytes | QR Bits (Byte mode) | Example |
|---|---|---|---|
| ASCII (U+0000-U+007F) | 1 | 8 | A, 1, @ |
| Latin Extended (U+0080-U+07FF) | 2 | 16 | e with accent, Cyrillic, Arabic |
| CJK, most scripts (U+0800-U+FFFF) | 3 | 24 | Chinese, Japanese, Korean, Thai |
| Emoji, rare scripts (U+10000+) | 4 | 32 | Emoji, musical symbols |
Practical capacity examples
For a Version 10 QR code (57x57 modules) with medium error correction (Level M):
| Content | Characters | Fits? |
|---|---|---|
| English URL (ASCII) | 213 bytes | Yes |
| Korean text | ~71 characters (3 bytes each) | Yes |
| Chinese text | ~71 characters (3 bytes each) | Yes |
| Emoji message | ~53 emoji (4 bytes each) | Yes |
| Arabic text | ~106 characters (2 bytes each) | Yes |
| Mixed English + emoji | Varies | Calculate per-character |
Error correction levels
| Level | Recovery | Overhead | Use Case |
|---|---|---|---|
| L (Low) | ~7% | Least | Clean environments, screens |
| M (Medium) | ~15% | Moderate | General purpose |
| Q (Quartile) | ~25% | High | Moderate damage expected |
| H (High) | ~30% | Most | Harsh environments, printed labels |
Higher error correction reduces data capacity. For Unicode-heavy QR codes where capacity is tight, Level L or M is recommended.
ECI (Extended Channel Interpretation)
The QR code specification includes ECI (Extended Channel Interpretation) — a mechanism to declare the encoding of Byte mode data:
| ECI Value | Encoding |
|---|---|
| 000003 | ISO-8859-1 (Latin-1) |
| 000020 | Shift JIS |
| 000026 | UTF-8 |
| 000025 | UTF-16 Big Endian |
The ECI problem
In theory, ECI 26 (UTF-8) should be included in any QR code containing UTF-8 data. In practice:
| Scenario | Reality |
|---|---|
| Most QR generators | Do not include ECI |
| Most QR scanners | Assume UTF-8 for Byte mode |
| Specification | ECI recommended but not required |
| Cross-scanner compatibility | Better without ECI (some old scanners ignore it) |
The industry has converged on an informal standard: Byte mode data is assumed to be UTF-8 unless an ECI indicator says otherwise. Most modern scanners (smartphone cameras, barcode scanner apps) handle UTF-8 correctly without an ECI declaration.
However, if you are encoding text for high-reliability applications (logistics, medical), including ECI 26 is a safer choice. Test with your target scanners.
Generating Unicode QR Codes
Python (qrcode library)
import qrcode
# UTF-8 is the default for string input
qr = qrcode.QRCode(
version=None, # Auto-size
error_correction=qrcode.constants.ERROR_CORRECT_M,
box_size=10,
border=4,
)
qr.add_data("Unicode text here")
qr.make(fit=True)
img = qr.make_image(fill_color="black", back_color="white")
img.save("unicode_qr.png")
JavaScript (qrcode.js)
// Browser
const qr = new QRCode(document.getElementById("qrcode"), {
text: "Unicode text here",
width: 256,
height: 256,
correctLevel: QRCode.CorrectLevel.M
});
// Node.js (qrcode package)
const QRCode = require('qrcode');
QRCode.toFile('unicode_qr.png', 'Unicode text here');
Handling capacity overflow
If your Unicode text exceeds the capacity of a given QR version:
- Increase QR version: Version 1 (21x21) to Version 40 (177x177)
- Lower error correction: From H to M or L
- Use a URL shortener: Encode a short URL that redirects to the full content
- Compress the text: For structured data, use a compact format
- Split across multiple QR codes: Some standards (e.g., Structured Append) allow linking multiple QR codes
Scanning and Decoding
How scanners handle Unicode
Modern QR code scanners (smartphone cameras, Google Lens, dedicated apps) typically:
- Detect Byte mode data
- Check for ECI indicator (if present, use declared encoding)
- If no ECI, attempt UTF-8 decoding
- If UTF-8 fails, fall back to ISO-8859-1 or system locale
Scanner compatibility for Unicode
| Scanner | UTF-8 Support | ECI Support |
|---|---|---|
| iOS Camera | Excellent | Yes |
| Android Camera (Google) | Excellent | Yes |
| Google Lens | Excellent | Yes |
| WeChat (built-in scanner) | Excellent | Yes |
| Dedicated barcode apps | Usually good | Varies |
| Older industrial scanners | May need ECI | Partial |
| ZXing library | Excellent | Yes |
Common scanning issues
| Problem | Cause | Fix |
|---|---|---|
| Garbled text after scan | Scanner assumed Latin-1 instead of UTF-8 | Add ECI indicator or test with different scanner |
| Partial text | QR too small / too much data | Increase QR version or reduce content |
| Emoji not displaying | Scanner decoded correctly but display font lacks emoji | Scanner/OS issue, not QR issue |
| Mixed script text broken | BiDi rendering issue in display app | Not a QR encoding problem |
Use Cases for Unicode QR Codes
| Use Case | Content | Script |
|---|---|---|
| Restaurant menu | Menu items in local language | CJK, Thai, Arabic, etc. |
| Business card (vCard) | Name, company in native script | Any |
| Wi-Fi login | SSID with Unicode characters | Any |
| Product label | Multilingual product info | Multiple scripts |
| Event ticket | Attendee name in native script | Any |
| Cryptocurrency | Address + memo in local language | Any |
| Tourism | Landmark description in visitor's language | Multiple |
Wi-Fi QR code with Unicode SSID
The Wi-Fi QR code format supports Unicode SSIDs:
WIFI:T:WPA;S:MyNetwork;P:password123;;
If the SSID contains Unicode characters, they are encoded as UTF-8 bytes in Byte mode. Most smartphone Wi-Fi QR scanners handle this correctly.
Key Takeaways
- QR codes store Unicode text via Byte mode with UTF-8 encoding. Any Unicode character — CJK, Arabic, emoji, and more — can be encoded in a QR code.
- Capacity decreases with character complexity: ASCII uses 1 byte per character, CJK uses 3, and emoji use 4. Plan your QR version and error correction level accordingly.
- ECI indicators formally declare UTF-8 encoding but are optional in practice. Most modern scanners assume UTF-8 by default.
- For maximum compatibility, keep QR content short, use error correction level M, and test with multiple scanners (iOS, Android, dedicated apps).
- When Unicode content exceeds QR capacity, use a URL shortener to redirect to the full content rather than trying to encode everything directly.
Platform Guides 中的更多内容
Microsoft Word supports the full Unicode character set and provides several methods …
Google Docs and Sheets use UTF-8 internally and provide a Special Characters …
Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …
PDF supports Unicode text through embedded fonts and ToUnicode maps, but many …
Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …
Social media platforms handle Unicode text with varying degrees of support, affecting …
Both XML and JSON are defined to use Unicode text, but each …
Natural language processing and data science pipelines frequently encounter Unicode issues including …
Allowing Unicode characters in passwords increases the keyspace and can improve security, …