XML karakter referansı
XML'in sayısal karakter referansı sürümü: ✓ veya ✓. XML'de yalnızca 5 adlandırılmış varlık vardır (& < > " '), HTML5'te ise 2.231 tanedir.
What Are XML Character References?
XML character references are escape sequences that represent Unicode characters by code point within XML documents. The XML specification defines two forms:
- Decimal:
&#N;— e.g.,©for © - Hexadecimal:
&#xH;— e.g.,©for ©
Unlike HTML, XML does not support named references beyond the five predefined entities (&, <, >, ", '). All other characters must be referenced by number or defined as custom entities in the document's DTD.
XML vs. HTML Character References
HTML inherits XML-style numeric references but extends them with thousands of named references defined in the HTML specification. XML keeps a stricter, self-contained model: a parser needs no external lookup table to handle character references, only the rules for decimal and hex integers.
<!-- Valid XML character references -->
A A
A A
© ©
© ©
😀 😀
<!-- Named references in XML — only these 5 are built-in -->
& &
< <
> >
" "
' '
<!-- All others require a DTD declaration, e.g.: -->
<!DOCTYPE doc [
<!ENTITY copy "©">
]>
<doc>Copyright © 2024</doc>
Well-Formedness Rules
For XML to be well-formed, character references must:
1. Reference a legal XML character — code points U+0009, U+000A, U+000D, U+0020–U+D7FF, U+E000–U+FFFD, and U+10000–U+10FFFF.
2. End with a semicolon ; — the semicolon is always mandatory in XML (no legacy exceptions).
3. Not reference surrogates (U+D800–U+DFFF) or the null character (U+0000).
An XML parser that encounters an invalid reference is required to raise a fatal error and halt parsing.
Encoding Interaction
XML documents declare their encoding in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
Character references bypass encoding: € always means U+20AC (€) regardless of the document's encoding. This makes character references useful in legacy encodings that cannot natively represent the desired characters.
Parsing in Python
import xml.etree.ElementTree as ET
xml_str = "<item>Price: €10 & shipping</item>"
root = ET.fromstring(xml_str)
print(root.text) # "Price: €10 & shipping"
# Python's xml module decodes references automatically
SVG and XML-Based Formats
SVG, MathML, XHTML, RSS, and Atom are all XML vocabularies. Character references work identically in all of them:
<!-- SVG text with Unicode arrows -->
<text>← Left → Right</text>
<!-- Atom feed with special characters -->
<title>Q&A: Unicode — Explained</title>
CDATA Sections
Inside a CDATA section (<![CDATA[ ... ]]>), character references are not processed — the text is treated as raw character data. This is useful for embedding code samples in XML:
<code><![CDATA[
if (a < b && c > d) { return "©"; }
]]></code>
<!-- The & and < above are literal characters, NOT entities -->
Quick Facts
| Property | Value |
|---|---|
| Decimal syntax | &#N; |
| Hex syntax | &#xH; (lowercase x required) |
| Named entities (built-in) | Only 5: & < > " ' |
| Semicolon | Always mandatory in XML |
| Null character U+0000 | Forbidden |
| Surrogates | Forbidden |
| CDATA sections | References are not expanded inside <![CDATA[...]]> |
| Error handling | Fatal error on invalid reference (unlike HTML's lenient parsing) |
İlgili Terimler
Web ve HTML içinde daha fazlası
İnsan tarafından okunabilir ad kullanan HTML entity: © → ©, — → …
Bir yanıtın karakter kodlamasını bildiren HTTP başlık parametresi (Content-Type: text/html; charset=utf-8). Belge …
::before ve ::after pseudo-elementleri aracılığıyla Unicode kaçış dizileri kullanarak üretilmiş içerik ekleyen …
CSS properties (direction, writing-mode, unicode-bidi) controlling text layout direction. Works with Unicode …
Bir karakterin renkli emoji glifi ile gösterilmesi, genellikle Variation Selector 16 (U+FE0F) …
HTML'de bir karakterin metinsel gösterimi. Üç form: adlandırılmış (&), ondalık (&), onaltılık …
ASCII olmayan Unicode karakterler içeren alan adları, dahili olarak Punycode (xn--...) olarak …
ECMAScript Internationalization API providing locale-aware string comparison (Collator), number formatting (NumberFormat), date …
U+2060. Satır kırılmasını önleyen sıfır genişlikli bir karakter. U+FEFF (BOM) yerine sıfır …
Bir karakterin renkli emoji yerine düz tek renkli metin glifi ile gösterilmesi, …