थचऊॉ - - थचऊॉ (defun the- -old old- - صﻩطبصﻩطب --fn (a fn (a גףפגףפ d) d) (defun the - थचऊॉ थचऊॉ - (if (eql a גףפגףפ d 2) 2 (* a d 2) 2 (* a גףפגףפ d (the d (the- -old old- - صﻩطبصﻩطب --fn (1 fn (1- - (if (eql a a גףפגףפ d))))) a d))))) Unicode 4.0 In Common Unicode 4.0 In Common Lisp Lisp Adoption of Unicode In CLforJava Adoption of Unicode In CLforJava Jerry Boetje Jerry Boetje ILC 2005 ILC 2005 boetjeg@cofc.edu boetjeg@cofc.edu
ASCII Legacy ASCII Legacy • In the beginning (1983), there was • In the beginning (1983), there was • ASCII (universally recognized) • ASCII (universally recognized) • Everything else - mostly 8-bit encodings • Everything else - mostly 8-bit encodings • ISO-8859-x • ISO-8859-x • Code Pages (IBM PC) • Code Pages (IBM PC) • JIS and some Chinese encodings (16 bit) • JIS and some Chinese encodings (16 bit) • Couldn’t mix encodings • Couldn’t mix encodings • Doc in Hebrew, Kanji, and Serbo-Croation • Doc in Hebrew, Kanji, and Serbo-Croation CLforJava CLforJava
Lisp Response Lisp Response • Agree on a subset of ASCII that works • Agree on a subset of ASCII that works everywhere (standard char) everywhere (standard char) • Add font and bits attributes to characters • Add font and bits attributes to characters (later dropped) (later dropped) • Fuzzy distinction between types of chars • Fuzzy distinction between types of chars • Non-portable method for specifying file • Non-portable method for specifying file encoding encoding • Define functions that would work with ASCII • Define functions that would work with ASCII CLforJava CLforJava
Pretty Good For Its Time Pretty Good For Its Time CLforJava CLforJava
The Rest of the The Rest of the World’s Response World’s Response • Define a uniform encoding for all characters on • Define a uniform encoding for all characters on Earth Earth • Deal with the hard issues • Deal with the hard issues • Collation • Collation Unicode Unicode • Line breaks • Line breaks • Equivalence • Equivalence • Composition • Composition • etc. • etc. CLforJava CLforJava
20 Years Later 20 Years Later • Globalization requires speaking all languages • Globalization requires speaking all languages • Many vendor-specific solutions • Many vendor-specific solutions • Unicode version 4 has answers to many of the • Unicode version 4 has answers to many of the issues evoked by Common Lisp - and then issues evoked by Common Lisp - and then some some • It’s time to formally integrate Unicode into the • It’s time to formally integrate Unicode into the Common Lisp Standard Common Lisp Standard • But it’s not going to be easy! • But it’s not going to be easy! CLforJava CLforJava
Unicode 4 in Brief Unicode 4 in Brief CLforJava CLforJava
Nature of Characters Nature of Characters • It’s not enough to assign a number to a char • It’s not enough to assign a number to a char • Characters are no longer atomic • Characters are no longer atomic • A run of chars may be equivalent to one char • A run of chars may be equivalent to one char • Some provide information but not content • Some provide information but not content • Direction • Direction • Formatting • Formatting CLforJava CLforJava
Nature of Characters Nature of Characters • Never confuse the encoding with an ordering • Never confuse the encoding with an ordering • Collation is entirely context-dependent • Collation is entirely context-dependent • Does ‘o’ come before, after, or the same as • Does ‘o’ come before, after, or the same as ‘ö’ ‘ö’ • Different if your German or Swedish • Different if your German or Swedish • Chars have a rich set of properties • Chars have a rich set of properties • Simple - digit?, whitespace? • Simple - digit?, whitespace? • Complex - composition, direction, mirrored? • Complex - composition, direction, mirrored? CLforJava CLforJava
Encoding Encoding • Number assignments are called ‘code points’ • Number assignments are called ‘code points’ • Range #x0000 to #x10FFFF (21 bits) • Range #x0000 to #x10FFFF (21 bits) • ASCII range is the same in Unicode • ASCII range is the same in Unicode • Chars grouped into named ‘blocks’ • Chars grouped into named ‘blocks’ • E.g. Tamil, Arabic, Number Forms • E.g. Tamil, Arabic, Number Forms CLforJava CLforJava
Composition / Composition / Normalization Normalization • Some chars are composed of others • Some chars are composed of others • E.g. ‘Ä’ decomposes to ‘A’ and ‘ ̈ ’ • E.g. ‘Ä’ decomposes to ‘A’ and ‘ ̈ ’ • 2 chars are equivalent iff their decomposed, • 2 chars are equivalent iff their decomposed, binary forms are identical binary forms are identical • But some chars are really “the same” even if • But some chars are really “the same” even if they’re different they’re different • E.g. some Katakana full and half-width chars • E.g. some Katakana full and half-width chars • There are 2 definitions of equivalence • There are 2 definitions of equivalence • Canonical and Compatibility • Canonical and Compatibility CLforJava CLforJava
Collation Collation • Context-dependent (locales) • Context-dependent (locales) • Unicode defines a table-driven mechanism • Unicode defines a table-driven mechanism • Very configurable (originally from IBM) • Very configurable (originally from IBM) • Specifically not required • Specifically not required • Other mechanisms ok if equivalent results • Other mechanisms ok if equivalent results • Sun/Java uses a rule-based system • Sun/Java uses a rule-based system CLforJava CLforJava
Bi-directional Algorithm Bi-directional Algorithm • Unicode specifies algorithm to handle nested • Unicode specifies algorithm to handle nested changes in direction (R to L, L to R) changes in direction (R to L, L to R) • Locale-dependent • Locale-dependent • Very important with mixed languages • Very important with mixed languages • Impacts the printer • Impacts the printer • Characters not printed in memory order • Characters not printed in memory order • Some characters are mirrored • Some characters are mirrored CLforJava CLforJava
Line Break Algorithm Line Break Algorithm • Unicode specifies algorithm to determine • Unicode specifies algorithm to determine possible line breaks possible line breaks • Handles the <cr>, <lf>, <crlf> problem • Handles the <cr>, <lf>, <crlf> problem • Locale-dependent • Locale-dependent • Very important with mixed languages • Very important with mixed languages • Impacts the pretty printer • Impacts the pretty printer CLforJava CLforJava
Implies Pervasive Changes Implies Pervasive Changes to Several Lisp to Several Lisp Components Components CLforJava CLforJava
CLforJava Implementation CLforJava Implementation CLforJava CLforJava
CLforJava Project CLforJava Project • Capstone software engineering course • Capstone software engineering course • Multi-semester undergraduate project • Multi-semester undergraduate project • Gives students a “real world” experience • Gives students a “real world” experience • New, original implementation of Common Lisp • New, original implementation of Common Lisp • Written in Java and Lisp • Written in Java and Lisp • See “Common Lisp for Java: A New • See “Common Lisp for Java: A New Implementatoin Intertwined with Java” Implementatoin Intertwined with Java” Wed 11 am Wed 11 am CLforJava CLforJava
Character Types Character Types • CL standard defines • CL standard defines • Standard-Char - 96 ASCII chars • Standard-Char - 96 ASCII chars • Base-char, Extended-char - up to the impl • Base-char, Extended-char - up to the impl • CLforJava defines • CLforJava defines • Standard-Char - same as standard • Standard-Char - same as standard • Base-char - Unicode definition of base • Base-char - Unicode definition of base character character • Can’t be composed with char to the left • Can’t be composed with char to the left • Extended-char - all the rest • Extended-char - all the rest CLforJava CLforJava
Character Naming Character Naming • Official names - LATIN SMALL LETTER A • Official names - LATIN SMALL LETTER A • Unofficial names - a • Unofficial names - a • Lispified names - LATIN-SMALL-LETTER-A • Lispified names - LATIN-SMALL-LETTER-A • #\a , #\|LATIN SMALL LETTER A| , • #\a , #\|LATIN SMALL LETTER A| , #\LATIN-SMALL-LETTER-A #\LATIN-SMALL-LETTER-A • Lisp names - RETURN, LINEFEED • Lisp names - RETURN, LINEFEED CLforJava CLforJava
Character Naming in Java Character Naming in Java • 4 interfaces • 4 interfaces • lisp.common.type.Character • lisp.common.type.Character • lisp.common.type.BaseChar • lisp.common.type.BaseChar • lisp.common.type.StandardChar • lisp.common.type.StandardChar • lisp.common.type.ExtendedChar • lisp.common.type.ExtendedChar • Standard chars available as static fields in • Standard chars available as static fields in StandardChar StandardChar • public static final Character a; • public static final Character a; • public static final Character slash; • public static final Character slash; CLforJava CLforJava
Recommend
More recommend