unicode 4 0 in common unicode 4 0 in common lisp lisp
play

Unicode 4.0 In Common Unicode 4.0 In Common Lisp Lisp Adoption of - PowerPoint PPT Presentation

- - (defun the- -old old- - --fn (a fn (a d) d) (defun the - - (if (eql a d 2) 2 (* a d 2) 2 (* a d (the d (the- -old old- -


  1. थचऊॉ - - थचऊॉ (defun the- -old old- - صﻩطبصﻩطب --fn (a fn (a גףפגףפ d) d) (defun the - थचऊॉ थचऊॉ - (if (eql a גףפגףפ d 2) 2 (* a d 2) 2 (* a גףפגףפ d (the d (the- -old old- - صﻩطبصﻩطب --fn (1 fn (1- - (if (eql a a גףפגףפ d))))) a d))))) Unicode 4.0 In Common Unicode 4.0 In Common Lisp Lisp Adoption of Unicode In CLforJava Adoption of Unicode In CLforJava Jerry Boetje Jerry Boetje ILC 2005 ILC 2005 boetjeg@cofc.edu boetjeg@cofc.edu

  2. ASCII Legacy ASCII Legacy • In the beginning (1983), there was • In the beginning (1983), there was • ASCII (universally recognized) • ASCII (universally recognized) • Everything else - mostly 8-bit encodings • Everything else - mostly 8-bit encodings • ISO-8859-x • ISO-8859-x • Code Pages (IBM PC) • Code Pages (IBM PC) • JIS and some Chinese encodings (16 bit) • JIS and some Chinese encodings (16 bit) • Couldn’t mix encodings • Couldn’t mix encodings • Doc in Hebrew, Kanji, and Serbo-Croation • Doc in Hebrew, Kanji, and Serbo-Croation CLforJava CLforJava

  3. Lisp Response Lisp Response • Agree on a subset of ASCII that works • Agree on a subset of ASCII that works everywhere (standard char) everywhere (standard char) • Add font and bits attributes to characters • Add font and bits attributes to characters (later dropped) (later dropped) • Fuzzy distinction between types of chars • Fuzzy distinction between types of chars • Non-portable method for specifying file • Non-portable method for specifying file encoding encoding • Define functions that would work with ASCII • Define functions that would work with ASCII CLforJava CLforJava

  4. Pretty Good For Its Time Pretty Good For Its Time CLforJava CLforJava

  5. The Rest of the The Rest of the World’s Response World’s Response • Define a uniform encoding for all characters on • Define a uniform encoding for all characters on Earth Earth • Deal with the hard issues • Deal with the hard issues • Collation • Collation Unicode Unicode • Line breaks • Line breaks • Equivalence • Equivalence • Composition • Composition • etc. • etc. CLforJava CLforJava

  6. 20 Years Later 20 Years Later • Globalization requires speaking all languages • Globalization requires speaking all languages • Many vendor-specific solutions • Many vendor-specific solutions • Unicode version 4 has answers to many of the • Unicode version 4 has answers to many of the issues evoked by Common Lisp - and then issues evoked by Common Lisp - and then some some • It’s time to formally integrate Unicode into the • It’s time to formally integrate Unicode into the Common Lisp Standard Common Lisp Standard • But it’s not going to be easy! • But it’s not going to be easy! CLforJava CLforJava

  7. Unicode 4 in Brief Unicode 4 in Brief CLforJava CLforJava

  8. Nature of Characters Nature of Characters • It’s not enough to assign a number to a char • It’s not enough to assign a number to a char • Characters are no longer atomic • Characters are no longer atomic • A run of chars may be equivalent to one char • A run of chars may be equivalent to one char • Some provide information but not content • Some provide information but not content • Direction • Direction • Formatting • Formatting CLforJava CLforJava

  9. Nature of Characters Nature of Characters • Never confuse the encoding with an ordering • Never confuse the encoding with an ordering • Collation is entirely context-dependent • Collation is entirely context-dependent • Does ‘o’ come before, after, or the same as • Does ‘o’ come before, after, or the same as ‘ö’ ‘ö’ • Different if your German or Swedish • Different if your German or Swedish • Chars have a rich set of properties • Chars have a rich set of properties • Simple - digit?, whitespace? • Simple - digit?, whitespace? • Complex - composition, direction, mirrored? • Complex - composition, direction, mirrored? CLforJava CLforJava

  10. Encoding Encoding • Number assignments are called ‘code points’ • Number assignments are called ‘code points’ • Range #x0000 to #x10FFFF (21 bits) • Range #x0000 to #x10FFFF (21 bits) • ASCII range is the same in Unicode • ASCII range is the same in Unicode • Chars grouped into named ‘blocks’ • Chars grouped into named ‘blocks’ • E.g. Tamil, Arabic, Number Forms • E.g. Tamil, Arabic, Number Forms CLforJava CLforJava

  11. Composition / Composition / Normalization Normalization • Some chars are composed of others • Some chars are composed of others • E.g. ‘Ä’ decomposes to ‘A’ and ‘ ̈ ’ • E.g. ‘Ä’ decomposes to ‘A’ and ‘ ̈ ’ • 2 chars are equivalent iff their decomposed, • 2 chars are equivalent iff their decomposed, binary forms are identical binary forms are identical • But some chars are really “the same” even if • But some chars are really “the same” even if they’re different they’re different • E.g. some Katakana full and half-width chars • E.g. some Katakana full and half-width chars • There are 2 definitions of equivalence • There are 2 definitions of equivalence • Canonical and Compatibility • Canonical and Compatibility CLforJava CLforJava

  12. Collation Collation • Context-dependent (locales) • Context-dependent (locales) • Unicode defines a table-driven mechanism • Unicode defines a table-driven mechanism • Very configurable (originally from IBM) • Very configurable (originally from IBM) • Specifically not required • Specifically not required • Other mechanisms ok if equivalent results • Other mechanisms ok if equivalent results • Sun/Java uses a rule-based system • Sun/Java uses a rule-based system CLforJava CLforJava

  13. Bi-directional Algorithm Bi-directional Algorithm • Unicode specifies algorithm to handle nested • Unicode specifies algorithm to handle nested changes in direction (R to L, L to R) changes in direction (R to L, L to R) • Locale-dependent • Locale-dependent • Very important with mixed languages • Very important with mixed languages • Impacts the printer • Impacts the printer • Characters not printed in memory order • Characters not printed in memory order • Some characters are mirrored • Some characters are mirrored CLforJava CLforJava

  14. Line Break Algorithm Line Break Algorithm • Unicode specifies algorithm to determine • Unicode specifies algorithm to determine possible line breaks possible line breaks • Handles the <cr>, <lf>, <crlf> problem • Handles the <cr>, <lf>, <crlf> problem • Locale-dependent • Locale-dependent • Very important with mixed languages • Very important with mixed languages • Impacts the pretty printer • Impacts the pretty printer CLforJava CLforJava

  15. Implies Pervasive Changes Implies Pervasive Changes to Several Lisp to Several Lisp Components Components CLforJava CLforJava

  16. CLforJava Implementation CLforJava Implementation CLforJava CLforJava

  17. CLforJava Project CLforJava Project • Capstone software engineering course • Capstone software engineering course • Multi-semester undergraduate project • Multi-semester undergraduate project • Gives students a “real world” experience • Gives students a “real world” experience • New, original implementation of Common Lisp • New, original implementation of Common Lisp • Written in Java and Lisp • Written in Java and Lisp • See “Common Lisp for Java: A New • See “Common Lisp for Java: A New Implementatoin Intertwined with Java” Implementatoin Intertwined with Java” Wed 11 am Wed 11 am CLforJava CLforJava

  18. Character Types Character Types • CL standard defines • CL standard defines • Standard-Char - 96 ASCII chars • Standard-Char - 96 ASCII chars • Base-char, Extended-char - up to the impl • Base-char, Extended-char - up to the impl • CLforJava defines • CLforJava defines • Standard-Char - same as standard • Standard-Char - same as standard • Base-char - Unicode definition of base • Base-char - Unicode definition of base character character • Can’t be composed with char to the left • Can’t be composed with char to the left • Extended-char - all the rest • Extended-char - all the rest CLforJava CLforJava

  19. Character Naming Character Naming • Official names - LATIN SMALL LETTER A • Official names - LATIN SMALL LETTER A • Unofficial names - a • Unofficial names - a • Lispified names - LATIN-SMALL-LETTER-A • Lispified names - LATIN-SMALL-LETTER-A • #\a , #\|LATIN SMALL LETTER A| , • #\a , #\|LATIN SMALL LETTER A| , #\LATIN-SMALL-LETTER-A #\LATIN-SMALL-LETTER-A • Lisp names - RETURN, LINEFEED • Lisp names - RETURN, LINEFEED CLforJava CLforJava

  20. Character Naming in Java Character Naming in Java • 4 interfaces • 4 interfaces • lisp.common.type.Character • lisp.common.type.Character • lisp.common.type.BaseChar • lisp.common.type.BaseChar • lisp.common.type.StandardChar • lisp.common.type.StandardChar • lisp.common.type.ExtendedChar • lisp.common.type.ExtendedChar • Standard chars available as static fields in • Standard chars available as static fields in StandardChar StandardChar • public static final Character a; • public static final Character a; • public static final Character slash; • public static final Character slash; CLforJava CLforJava

Recommend


More recommend