7. International character sets • Default character set: Unicode • Characters correspond to numbers . • Different encodings exist for these numbers. • In Web servers HTTP metadata specifies the coding (so called MIME media type), e.g. HTTP/1.1 ... Content-Type: text/xml; charset=ISO-8859-1 • If the MIME media type is application/xml , then the parser tries to guess the character set from the first few bytes of the document: XML-declaration: <?xml ... encoding=...?> XML-7 J. Teuhola 2013 109
Character sets in external entities • External entities (to be included in a document) may have a different encoding. • An external parsed entity should start with a text declaration (similar to xml declaration, version may be omitted): <?xml encoding=”KOI8-R”?> [denoting Russian characters] • The same holds for external DTD subsets. XML-7 J. Teuhola 2013 110
ISO character sets • 16 different character sets; each with character numbers 0..255: – 0..127: Normal ASCII – 128..159: Control characters – 160..255: Language-specific characters • Examples: – ISO-8859-1 ( = Latin-1 ) Language-specific characters for Danish, Dutch, Finnish, French, German, Spanish, Swedish, ... – ISO-8859-2 ( = Latin-2 ) Eastern-European characters XML-7 J. Teuhola 2013 111
Unicode • International character set for almost all languages (English, Greek, Cyrillic, Han Chinese, Arabic, Hebrew, Thai, Bengali, ...) • Unique numbers (’codepoints’) for all characters • Version 6.0 (2010) contains 109242 graphic characters • Codes are divided into 17 planes , á 65536 chars = 1114112 chars altogether. The first plane ( Basic Multilingual Plane ) covers the chars used in practice. • Five ways of encoding the numbers: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4 • XML parsers are required to understand UTF-8 and UTF-16, but are allowed to understand others, such as ISO-8859-1. XML-7 J. Teuhola 2013 112
Variable-length Unicode encodings • UTF-8 (UCS Transformation Format 8 ) : – Default for XML processors – Characters 0..127 are encoded with 1 byte = ASCII – Characters 128..2047 are encoded with 2 bytes – Characters 2048..65535 are encoded with 3 bytes – Characters 65536..1114111 are encoded with 4 bytes • UTF-16 : – Extended from UCS-2 (incl. big-endian/little-endian options) – Some so called surrogate pairs of 16-bit UCS-2 codes constitute additional 32-bit encodings. • UTF-32 : – Extended from UCS-4; now these two are identical – Fixed-length 4-byte codes XML-7 J. Teuhola 2013 113
Obsolete encodings of Unicode • UCS-2 ( Universal Character Set 2 ): – 2-byte unsigned integer 0..65535, e.g. ”A” = 00000000 01000001 = 65 10 = #x0041 (hex) – Drawbacks: Twice the size of ASCII, not compatible with ASCII, 65536 characters are not enough – Versions: big-endian (most significant byte first), little- endian (least significant byte first) • UCS-4 : – 4 bytes (32 bits) per char – Wasteful for small character sets XML-7 J. Teuhola 2013 114
Miscellaneous • Conversion tools between character sets: – See e.g. http://dataconv.org/apps_unicode_utf8.html http://download.oracle.com/javase/1.5.0/docs/tooldocs/solaris/native2ascii.html • Character sets supported by Java JDK 5.0: http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html • Platform-dependent character sets: – Invented by vendors like Microsoft and Apple, e.g. Cp1252 (”Windows ANSI”), MacRoman, MacGreek – To be used only within a single system, not in transported data XML-7 J. Teuhola 2013 115
How does the parser find out the character set? if external meta-information exists then use it else if the first 4 bytes are = ”<?xm” (= #x3C3F786D in ASCII ) then the code is (superset of) ASCII, and the exact code can be decided from the encoding declaration of the first line (which is pure ASCII, having identical coding in UTF-8). XML-7 J. Teuhola 2013 116
How to type characters outside ASCII • Character references to numeric values, e.g. Greek α : – Either decimal: α – or hexadecimal: α • Character references may be used in element contents, attribute values, comments, DTD attribute defaults, DTD entity replacement text. • Character references may not be used in element and attribute names, processing instruction targets, XML keywords. XML-7 J. Teuhola 2013 117
Examples < αβγ > &# x3B1; &# x3B2; &# x3B3; < / αβγ > Legal, if α , β and γ can be included natively < &# x3B1; &# x3B2; &# x3B3; > &# x3B1; &# x3B2; &# x3B3; < / &# x3B1; &# x3B2; &# x3B3; > Illegal • Codes, see http://www.unicode.org/charts/ XML-7 J. Teuhola 2013 118
Character entities • Character references can be given entity names in DTDs, e.g. <!ENTITY alpha ”α”> <!ENTITY beta ”β”> • Usage: α β • Some DTDs contain only entities (file *.ent). They can be included in the actual DTD as external parameter entities. • Predefined ent-files exist for Latin-1, Greek, etc. and can be included as PUBLIC declarations. XML-7 J. Teuhola 2013 119
Multilingual documents • An element may have an attribute xml:lang . It specifies the language (not character set) used within the element. • The language is useful information to the processor (e.g. spell-checker). • Declaration needed: <!ATTLIST elem xml:lang NMTOKEN #IMPLIED> • Language codes 2-4 letters, defined in ISO-639, altogether 7589 languages (ISO-693-3) • 2-letter example codes (English = ”EN”, Finnish=”FI”, Swedish=”SV”, Greek=”EL”, ...) • Subcodes may be defined for dialects. XML-7 J. Teuhola 2013 120
Recommend
More recommend