e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Background • T EX: free typeseting system with a 25-year history • stable, reliable, flexible, widely implemented • experienced user community • rich collection of supporting tools • Originally designed for English typeseting • support for accents and other European characers • language support extended via custom fonts, macros, and preprocessors 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Traditional T EX input conventions • Input text is ASCII (or 8-bit codepage) Source text Typeset output Notes á \'{a} typical accent command ç \c{c} å \aa — --- ligature in typical T EX fonts α $\alpha$ math mode symbol अ�छा {\dn � acchaa} using custom preprocessor 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Multilingual typeseting with T EX • Text input • Escape sequences for non-ASCII characers • Multiple 8-bit codepages • Preprocessors for complex scripts • Font support • Fonts limited to 256 glyphs • Custom-encoded fonts with secific glyph sets • All tied together via complex T EX macros • Difficult to understand and extend • Difficult to integrate with other packages 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Towards a cleaner solution • Unicode: all required characers directly represented • no need for “escape sequences” to access characers not included in the current codepage • no need to switch between codepages according to the language/script being typeset • characers rendered via standard access codes • Characer/glyph model and modern font rendering technologies • complex script handling moved out of the domain of the text data stream 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Accented characers dan dan \halign{#\hfil\quad& � #\hfil\cr dubok dubok dan& ���� dan\cr d ž abe đ ak dubok& �� dubok\cr d ž in d ž abe džabe& ��� ak\cr D ž in d ž in džin& ��� džabe\cr đ ak D ž in Džin& ��� džin\cr Evropa Evropa � ak& ���� Džin\cr Evropa& � Evropa\cr} 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • CJK ideographs 書く \font\han="STSong" � at � 16pt ka-ku \font\rom="Gentium" � at � 8pt 最も \def\hc#1#2{\vtop{\hbox{\han � #1} � \hbox{\kern10pt\rom � #2}}} motto-mo \vtop{\hc{書く}{ka-ku} 最後 � \hc{最も}{motto-mo} sai-go � \hc{最後}{sai-go} 働く � \hc{働く}{hatara-ku} hatara-ku � \hc{海}{umi}} 海 umi 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Complex scripts \c � 1 ﺶﺋﺍﺪﻴﭘ ﻲﺟ ﺎﻴﻧﺩ \s ������������������ \p �� ن���آ ۽ ���ز ا�� ۾ ت��و�� ١ ۱ ۽ ������� ���ز ��و نا ٢ . ��� ا��� \v �� 1 ������������������������� ��������������������� . �� ����وا و����� �� ���� ���وا . ��� نا��و \v �� 2 ������������������������� �� ا�� ن��� �� ����� ۽ �� ���ڍ ن�� �������������� . ������������� ��ڏ ��� ا�� ���� ٣ �� ��� ا��� حور ��������������������������������� . ���� �� ���ور �� “. ��� ���ور ” �� ������������������������ ���������������������� \v �� 3 ��������������������������� ” ����� ���� .“ ��������������������� . �� 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode Key changes from T EX to X T EX E • Unicode as the text encoding • directly use Unicode input text, Unicode-encoded fonts • Fonts and rendering technologies • use any fonts available in the host computer • use existing smart-font rendering systems • Additional features for multilingual typeseting • optional font features • line breaking for Asian scripts • Backward compatibility issues • support for legacy T EX fonts and documents 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… • Characer type in T EX code was 8-bit value • one option: process text as UTF-8 • Characer codes used to index a number of tables • characer category, case pairs, etc. • Decision to use 16-bit characer codes • all 256-element tables enlarged to 65,536 elements to match the extended characer set • extended T EX commands that refer to characer codes 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… and beyond? • Unicode does not fit in 16 bits eithe r! • X T EX handles non-BMP chara c ers as UTF-16 E surrogate pairs • properties of individual characers cannot be set • unlikely to mater for typeseting usage: all surrogate codes can be treated as simple printable characers • keeps size of internal tables moderate, without extensive restructuring • Using UTF-16 happens to match the font rendering APIs that X T EX uses E 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Required for support of complex scripts in Unicode • Signi fi cant change from traditional T EX model • T EX regards “a secific characer code in a secific font” as the fundamental unit of text to be typeset • assumes such a characer has known, fixed dimensions • provision for ligatures by characer substitutions • a paragraph consists of sequence of “characer” nodes, to be precisely placed, and intervening “glue” nodes • A Unicode chara c er may not map to a single, known glyph • many scripts require contextual selection of glyphs • must measure characers in context, not in isolation 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Initial implementation using ATSUI on Mac OS X • typeseting process collects runs of characers (words) • calls ATSUI text layout APIs to measure width • a X T EX paragraph consists of sequence of “word” nodes E separated by “glue” • Typese t ing engine positions words, not glyphs • this is the job of the font rendering engine 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005
T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model Nodes in a T EX paragraph Corresponding nodes in X T EX E -.,(% ! / &'()% ! 34$ -.,(% ! . !"#$% ! &'() ! *+,-$ -.,(% ! $ &'()% ! 0#1-2 !"#$% ! &'() ! *+,-$ !"#$% ! &'() ! *+,-$ -.,(% ! 0 -.,(% ! # &'()% ! .'/ -.,(% ! 1 !"#$% ! &'() ! *+,-$ -.,(% ! - -.,(% ! 2 !"#$% ! &'() ! *+,-$ -.,(% ! 3 -.,(% ! ' -.,(% ! 4 !"#$% ! &'() ! *+,-$ 27th Internationalization and Unicode Conference Berlin, Germany, April 2005
T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • OpenType Layout support using ICU library • alternative font layout engine • provides support for OpenType features in Latin fonts • supports a number of complex (Indic/Asian) scripts • X T EX uses either ATSUI or ICU according to E layout tables found in fonts • overall typese t ing process is independent of font technology in use • distinction required only at lowest level of measuring a run of text in a given font • documents may freely mix AAT and OT fonts 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005
Recommend
More recommend