Unicode in Email
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, and addresses requires encoding schemes like Quoted-Printable, Base64, and the newer Internationalized Email (EAI) standard. This guide explains how Unicode works in email, covering MIME encoding, emoji in subjects, and international email addresses.
Email is one of the oldest Internet protocols, and its relationship with Unicode is a story of gradual evolution from pure ASCII to full international support. The original email standards (RFC 822, 1982) assumed all text would be 7-bit ASCII -- no accents, no CJK characters, no emoji. Over four decades, a series of RFCs have layered Unicode support on top of this ASCII foundation, and understanding these layers is essential for anyone building email systems today.
The ASCII Foundation
The original email system, defined by RFC 822 (1982) and later RFC 2822 (2001), was designed for 7-bit ASCII only. This means:
- Headers (From, To, Subject) could only contain ASCII characters (code points 0-127)
- Body could only contain lines of ASCII text, each no longer than 998 characters
- Email addresses could only use ASCII in both the local part and domain
Any byte with the high bit set (values 128-255) was technically illegal in an email message. This was fine for English but completely inadequate for the rest of the world.
MIME: The First Unicode Layer (1996)
MIME (Multipurpose Internet Mail Extensions), defined in RFC 2045-2049, added support for non-ASCII content in the email body by introducing:
Content-Type and Charset
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
The charset parameter tells the email client which encoding to use when interpreting
the body bytes. Common values include:
| Charset | Description | Status |
|---|---|---|
| us-ascii | 7-bit ASCII only | Default if not specified |
| iso-8859-1 | Latin-1 (Western Europe) | Legacy, common in old emails |
| windows-1252 | Windows Latin (superset of Latin-1) | Legacy, very common |
| iso-2022-jp | Japanese (7-bit compatible) | Legacy, still used in Japan |
| utf-8 | Full Unicode | Recommended for all new email |
Content-Transfer-Encoding
Since the email transport layer is 7-bit, MIME provides encoding schemes to safely transmit 8-bit content:
| Encoding | Description | Overhead | Best For |
|---|---|---|---|
| 7bit | No encoding (ASCII only) | 0% | Plain ASCII text |
| quoted-printable | Encodes non-ASCII as =XX |
Low for mostly-ASCII | European languages |
| base64 | Encodes all bytes as ASCII chars | ~33% | CJK, binary, attachments |
| 8bit | Raw 8-bit (requires SMTP extension) | 0% | Modern servers only |
Quoted-Printable Example
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Caf=C3=A9 is a French word.
Gr=C3=BC=C3=9Fe from Germany.
The sequence =C3=A9 represents the two UTF-8 bytes 0xC3 0xA9, which decode to e\u0301
(U+00E9). Each non-ASCII byte is encoded as = followed by two hex digits.
Base64 Example
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
5pel5pys6Kqe44Gu44Oh44O844Or
This base64 string decodes to the UTF-8 bytes for "\u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb" (Japanese for "Japanese email").
Encoded Words: Unicode in Headers (RFC 2047)
MIME solved the body encoding problem, but headers (Subject, From display name, etc.) remained ASCII-only. RFC 2047 (1996) introduced encoded words to embed non-ASCII text in headers.
Syntax
=?charset?encoding?encoded_text?=
Where encoding is either B (base64) or Q (quoted-printable-like).
Examples
Subject: =?utf-8?B?5pel5pys6Kqe44Gu44Oh44O844Or?=
This decodes to: \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb (Japanese email)
Subject: =?utf-8?Q?Caf=C3=A9_Menu?=
This decodes to: Cafe\u0301 Menu
RFC 2047 Rules
- Encoded words can appear in Subject, From (display name), To (display name), and other structured headers.
- They cannot appear in the address portion of From/To (the
user@domainpart). - Multiple encoded words separated by whitespace are concatenated after decoding.
- Maximum encoded word length is 75 characters. Longer strings must be split across multiple encoded words.
Generating Encoded Headers in Code
from email.header import Header
# Python handles RFC 2047 encoding automatically
subject = Header("Cafe\u0301 Menu \u2014 \u65e5\u672c\u8a9e", "utf-8")
print(subject.encode())
# =?utf-8?b?Q2Fmw6kgTWVudSDigJQg5pel5pys6Kqe?=
from email.mime.text import MIMEText
msg = MIMEText("\u3053\u3093\u306b\u3061\u306f\u4e16\u754c", "plain", "utf-8")
msg["Subject"] = Header("\u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb", "utf-8")
msg["From"] = "=?utf-8?b?5bGx55Sw5aSq6YOO?= <[email protected]>"
msg["To"] = "[email protected]"
print(msg.as_string())
UTF-8 Headers: The Modern Approach (RFC 6532)
RFC 2047 encoded words are ugly, hard to parse, and limited to 75 characters per encoded word. RFC 6532 (2012) introduced a cleaner approach: allow raw UTF-8 bytes directly in email headers.
Before (RFC 2047)
From: =?utf-8?b?5bGx55Sw5aSq6YOO?= <[email protected]>
Subject: =?utf-8?b?5pel5pys6Kqe44Gu44Oh44O844Or?=
After (RFC 6532)
From: \u5c71\u7530\u592a\u90ce <[email protected]>
Subject: \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb
Requirements
RFC 6532 requires:
- The sending server supports the SMTPUTF8 SMTP extension (RFC 6531)
- The receiving server also supports SMTPUTF8
- Both servers negotiate UTF-8 support during the SMTP handshake
EHLO sender.example.com
250-receiver.example.com
250-SMTPUTF8 <-- UTF-8 support advertised
250 OK
MAIL FROM:<[email protected]> SMTPUTF8
250 OK
Adoption Status
As of 2025, SMTPUTF8 support is widespread among major providers:
| Provider | SMTPUTF8 Support |
|---|---|
| Gmail | Yes (sending and receiving) |
| Outlook.com | Yes (receiving), partial (sending) |
| Yahoo Mail | Yes (receiving) |
| Apple Mail (iCloud) | Yes |
| Fastmail | Yes |
| Postfix | Yes (since 3.0) |
| Exim | Yes (since 4.86) |
However, many smaller mail servers and corporate mail systems still do not support SMTPUTF8. Email software should fall back to RFC 2047 encoding when SMTPUTF8 is not available.
Internationalized Email Addresses (EAI)
The most ambitious Unicode-in-email standard is EAI (Email Address Internationalization), defined in RFC 6530-6533. EAI allows Unicode characters in both the local part and the domain part of email addresses.
Before EAI
[email protected] -- ASCII only
[email protected] -- ASCII only
With EAI
\u592a\u90ce@\u65e5\u672c.jp -- Japanese local part + IDN domain
\u043f\u043e\u0447\u0442\u0430@\u043f\u0440\u0438\u043c\u0435\u0440.\u0440\u0444 -- Russian local part + Russian domain
user@\u00e4\u00f6\u00fc.example -- ASCII local part + Unicode domain
How It Works
| Part | Standard | Encoding |
|---|---|---|
| Local part | RFC 6531 | Raw UTF-8 (requires SMTPUTF8) |
| Domain part | IDN (RFC 5891) | Punycode (xn--) in DNS, displayed as Unicode |
The domain part uses Internationalized Domain Names (IDN), which have been supported since 2003. IDN domains are stored in DNS as Punycode (ASCII-compatible encoding):
| Display Form | Punycode (DNS) |
|---|---|
| \u65e5\u672c.jp | xn--wgv71a.jp |
| \u043f\u0440\u0438\u043c\u0435\u0440.\u0440\u0444 | xn--e1afmapc.xn--p1ai |
| \u00e4\u00f6\u00fc.example | xn--4ca0bs.example |
| \ud55c\uad6d.kr | xn--3e0b707e.kr |
Sending Email to EAI Addresses
import smtplib
from email.message import EmailMessage
msg = EmailMessage()
msg["From"] = "[email protected]"
msg["To"] = "\u592a\u90ce@\u65e5\u672c.jp"
msg["Subject"] = "Test EAI"
msg.set_content("Hello from EAI!")
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login("user", "password")
# Use send_message which handles SMTPUTF8 automatically
server.send_message(msg)
EAI Challenges
Despite the standards being in place, EAI adoption faces practical challenges:
- Fallback: If the receiving server does not support SMTPUTF8, the message bounces. There is no graceful ASCII fallback for the local part.
- Web forms: Many websites still validate email addresses with ASCII-only regex
patterns like
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}, rejecting valid internationalized addresses. - Database storage: Email address columns must support UTF-8, and unique constraints should account for Unicode normalization.
- Display: Not all email clients render internationalized addresses correctly.
Emoji in Email
Emoji in email bodies work through standard MIME encoding (UTF-8 + base64 or quoted-printable). Emoji in subject lines work via RFC 2047 encoded words or RFC 6532 raw UTF-8.
Subject Line Emoji
from email.header import Header
# Emoji in subject (RFC 2047 fallback)
subject = Header("\U0001f680 Launch Alert!", "utf-8")
# Encodes to: =?utf-8?b?8J+agCBMYXVuY2ggQWxlcnQh?=
Rendering Differences
Emoji in email subjects render differently across clients:
| Client | Emoji Display |
|---|---|
| Gmail (web) | Google's emoji set (Noto Color Emoji) |
| Apple Mail | Apple's emoji set |
| Outlook (desktop) | Segoe UI Emoji (Windows), sometimes black and white |
| Outlook (web) | Platform-dependent |
| Thunderbird | Platform-dependent |
Some email clients render emoji as color images; others show them in black and white or as empty boxes. For marketing emails, test emoji display across target clients before relying on them.
HTML Email and Unicode
HTML emails add another encoding layer. The HTML document within the MIME part has its own charset declaration:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Bonjour de Paris! \U0001f1eb\U0001f1f7</p>
<p>Price: \u20ac19.99</p>
</body>
</html>
The MIME header charset and the HTML meta charset should match. If they disagree, behavior is undefined and client-dependent. Always use UTF-8 for both.
HTML Entities vs Direct Unicode
You can represent Unicode characters in HTML email two ways:
| Method | Example | Result |
|---|---|---|
| Direct UTF-8 | e\u0301 |
e\u0301 |
| Numeric entity | é |
e\u0301 |
| Named entity | é |
e\u0301 |
Direct UTF-8 is preferred because: - It is more compact - It works in plain text parts as well - Some email clients have incomplete HTML entity support
Best Practices for Email Developers
1. Always Use UTF-8
from email.mime.text import MIMEText
# Always specify UTF-8
msg = MIMEText(body_text, "plain", "utf-8")
2. Handle RFC 2047 Decoding
When reading email, decode RFC 2047 encoded words:
from email.header import decode_header
raw_subject = "=?utf-8?B?5pel5pys6Kqe44Gu44Oh44O844Or?="
parts = decode_header(raw_subject)
subject = ""
for data, charset in parts:
if isinstance(data, bytes):
subject += data.decode(charset or "utf-8")
else:
subject += data
print(subject) # \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb
3. Validate Email Addresses Properly
Do not use ASCII-only regex for email validation. At minimum, allow UTF-8 in the domain part (IDN). For full EAI support, allow UTF-8 in the local part as well:
import re
from email.utils import parseaddr
def validate_email(address: str) -> bool:
# Parse the address
name, addr = parseaddr(address)
if not addr or "@" not in addr:
return False
local, domain = addr.rsplit("@", 1)
# Local part: at least 1 character
if not local:
return False
# Domain: valid IDN or ASCII
if not domain or ".." in domain:
return False
return True
4. Test with International Content
Always test your email system with:
- Subject: Mix of ASCII, accented Latin, CJK, and emoji
- Body: Multi-script content (Latin + CJK + Arabic)
- Sender name: Non-ASCII display names
- Recipient: IDN domain addresses at minimum
5. Fall Back Gracefully
import smtplib
def send_email(msg):
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login("user", "password")
try:
# Try SMTPUTF8 first
server.send_message(msg)
except smtplib.SMTPNotSupportedError:
# Fall back to RFC 2047 encoding
# Re-encode headers with encoded words
for header in ("From", "To", "Subject"):
if msg[header]:
from email.header import Header
value = msg[header]
del msg[header]
msg[header] = Header(value, "utf-8").encode()
server.send_message(msg)
Timeline of Unicode in Email Standards
| Year | RFC | What It Added |
|---|---|---|
| 1982 | RFC 822 | Original email format (ASCII only) |
| 1996 | RFC 2045-2049 | MIME: charset, base64, quoted-printable |
| 1996 | RFC 2047 | Encoded words in headers |
| 2001 | RFC 2822 | Updated email format (still ASCII headers) |
| 2003 | RFC 3490 | Internationalized Domain Names (IDN) |
| 2008 | RFC 5321/5322 | Current SMTP and email format standards |
| 2012 | RFC 6530-6533 | EAI: full UTF-8 in SMTP, headers, addresses |
| 2012 | RFC 6855-6858 | UTF-8 support in IMAP and POP3 |
The journey from ASCII-only email to full Unicode support took 30 years. Today, the standards are complete, but adoption continues to be a work in progress -- especially for internationalized email addresses, which remain the final frontier of Unicode in email.
Practical Unicode のその他のガイド
Windows provides several methods for typing special characters and Unicode symbols, including …
macOS makes it easy to type special characters and Unicode symbols through …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …
Typing special Unicode characters on smartphones requires different techniques than on desktop …
Mojibake is the garbled text you see when a file encoded in …
Storing Unicode text in a database requires choosing the right charset, collation, …
Modern operating systems support Unicode filenames, but different filesystems use different encodings …
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …
Using Unicode symbols, special characters, and emoji in web content has important …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …
A font file only contains glyphs for a subset of Unicode characters, …
Finding the exact Unicode character you need can be challenging given over …
Copying and pasting text between applications can introduce invisible characters, change normalization …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …