7. International character sets Default character set: Unicode - PowerPoint PPT Presentation

7. International character sets • Default character set: Unicode • Characters correspond to numbers . • Different encodings exist for these numbers. • In Web servers HTTP metadata specifies the coding (so called MIME media type), e.g. HTTP/1.1 ... Content-Type: text/xml; charset=ISO-8859-1 • If the MIME media type is application/xml , then the parser tries to guess the character set from the first few bytes of the document: XML-declaration: <?xml ... encoding=...?> XML-7 J. Teuhola 2013 109

Character sets in external entities • External entities (to be included in a document) may have a different encoding. • An external parsed entity should start with a text declaration (similar to xml declaration, version may be omitted): <?xml encoding=”KOI8-R”?> [denoting Russian characters] • The same holds for external DTD subsets. XML-7 J. Teuhola 2013 110

ISO character sets • 16 different character sets; each with character numbers 0..255: – 0..127: Normal ASCII – 128..159: Control characters – 160..255: Language-specific characters • Examples: – ISO-8859-1 ( = Latin-1 ) Language-specific characters for Danish, Dutch, Finnish, French, German, Spanish, Swedish, ... – ISO-8859-2 ( = Latin-2 ) Eastern-European characters XML-7 J. Teuhola 2013 111

Unicode • International character set for almost all languages (English, Greek, Cyrillic, Han Chinese, Arabic, Hebrew, Thai, Bengali, ...) • Unique numbers (’codepoints’) for all characters • Version 6.0 (2010) contains 109242 graphic characters • Codes are divided into 17 planes , á 65536 chars = 1114112 chars altogether. The first plane ( Basic Multilingual Plane ) covers the chars used in practice. • Five ways of encoding the numbers: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4 • XML parsers are required to understand UTF-8 and UTF-16, but are allowed to understand others, such as ISO-8859-1. XML-7 J. Teuhola 2013 112

Variable-length Unicode encodings • UTF-8 (UCS Transformation Format 8 ) : – Default for XML processors – Characters 0..127 are encoded with 1 byte = ASCII – Characters 128..2047 are encoded with 2 bytes – Characters 2048..65535 are encoded with 3 bytes – Characters 65536..1114111 are encoded with 4 bytes • UTF-16 : – Extended from UCS-2 (incl. big-endian/little-endian options) – Some so called surrogate pairs of 16-bit UCS-2 codes constitute additional 32-bit encodings. • UTF-32 : – Extended from UCS-4; now these two are identical – Fixed-length 4-byte codes XML-7 J. Teuhola 2013 113

Obsolete encodings of Unicode • UCS-2 ( Universal Character Set 2 ): – 2-byte unsigned integer 0..65535, e.g. ”A” = 00000000 01000001 = 65 10 = #x0041 (hex) – Drawbacks: Twice the size of ASCII, not compatible with ASCII, 65536 characters are not enough – Versions: big-endian (most significant byte first), little- endian (least significant byte first) • UCS-4 : – 4 bytes (32 bits) per char – Wasteful for small character sets XML-7 J. Teuhola 2013 114

Miscellaneous • Conversion tools between character sets: – See e.g. http://dataconv.org/apps_unicode_utf8.html http://download.oracle.com/javase/1.5.0/docs/tooldocs/solaris/native2ascii.html • Character sets supported by Java JDK 5.0: http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html • Platform-dependent character sets: – Invented by vendors like Microsoft and Apple, e.g. Cp1252 (”Windows ANSI”), MacRoman, MacGreek – To be used only within a single system, not in transported data XML-7 J. Teuhola 2013 115

How does the parser find out the character set? if external meta-information exists then use it else if the first 4 bytes are = ”<?xm” (= #x3C3F786D in ASCII ) then the code is (superset of) ASCII, and the exact code can be decided from the encoding declaration of the first line (which is pure ASCII, having identical coding in UTF-8). XML-7 J. Teuhola 2013 116

How to type characters outside ASCII • Character references to numeric values, e.g. Greek α : – Either decimal: α – or hexadecimal: α • Character references may be used in element contents, attribute values, comments, DTD attribute defaults, DTD entity replacement text. • Character references may not be used in element and attribute names, processing instruction targets, XML keywords. XML-7 J. Teuhola 2013 117

Examples < αβγ > &# x3B1; &# x3B2; &# x3B3; < / αβγ > Legal, if α , β and γ can be included natively < &# x3B1; &# x3B2; &# x3B3; > &# x3B1; &# x3B2; &# x3B3; < / &# x3B1; &# x3B2; &# x3B3; > Illegal • Codes, see http://www.unicode.org/charts/ XML-7 J. Teuhola 2013 118

Character entities • Character references can be given entity names in DTDs, e.g. <!ENTITY alpha ”&#x3B1”> <!ENTITY beta ”&#x3B2”> • Usage: α β • Some DTDs contain only entities (file *.ent). They can be included in the actual DTD as external parameter entities. • Predefined ent-files exist for Latin-1, Greek, etc. and can be included as PUBLIC declarations. XML-7 J. Teuhola 2013 119

Multilingual documents • An element may have an attribute xml:lang . It specifies the language (not character set) used within the element. • The language is useful information to the processor (e.g. spell-checker). • Declaration needed: <!ATTLIST elem xml:lang NMTOKEN #IMPLIED> • Language codes 2-4 letters, defined in ISO-639, altogether 7589 languages (ISO-693-3) • 2-letter example codes (English = ”EN”, Finnish=”FI”, Swedish=”SV”, Greek=”EL”, ...) • Subcodes may be defined for dialects. XML-7 J. Teuhola 2013 120

7. International character sets Default character set: Unicode - PowerPoint PPT Presentation

7. International character sets Default character set: Unicode Characters correspond to numbers . Different encodings exist for these numbers. In Web servers HTTP metadata specifies the coding (so called MIME media type), e.g.

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Strings II Review Strings are stored character by character. Can access each character

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Marshall Ranch Character Management Area Character Statement The boundaries of the Marshall Ranch

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Strings II Review Strings are stored character by character.

Chapter 6B Character Depth The visual appearance of a character is not enough to convey

Character Vectors and Factors STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

On character varieties of 3-manifold groups Misha Kapovich June 22-23, 2015 A character-buildier

Introduction to Programming Bertrand Meyer Last revised 1 December 2003 I ntroduction to

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Text 1. A text is a sequence of characters 2. Each character is taken from a finite alphabete

EC476 Contracts and Organizations, Part III: Lecture 4 Leonardo Felli 32L.G.06 2 February 2015

upTEX Unicode version of pTEX with CJK extensions Takuji Tanaka upTEX project

Basic Types in C Dalhousie University Winter 2019 Basic Types in C Integer types: Floating

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

Introduction II (extended) Radu Nicolescu Department of Computer Science University of Auckland

Sambuz

Useful Links

Newsletter

Mail Us