April 7, 2005 A Jonathan Kew SIL International 27th - PowerPoint PPT Presentation

e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Background • T EX: free typeseting system with a 25-year history • stable, reliable, flexible, widely implemented • experienced user community • rich collection of supporting tools • Originally designed for English typeseting • support for accents and other European characers • language support extended via custom fonts, macros, and preprocessors 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Traditional T EX input conventions • Input text is ASCII (or 8-bit codepage) Source text Typeset output Notes á \'{a} typical accent command ç \c{c} å \aa — --- ligature in typical T EX fonts α $\alpha$ math mode symbol अ�छा {\dn � acchaa} using custom preprocessor 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Multilingual typeseting with T EX • Text input • Escape sequences for non-ASCII characers • Multiple 8-bit codepages • Preprocessors for complex scripts • Font support • Fonts limited to 256 glyphs • Custom-encoded fonts with secific glyph sets • All tied together via complex T EX macros • Difficult to understand and extend • Difficult to integrate with other packages 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Towards a cleaner solution • Unicode: all required characers directly represented • no need for “escape sequences” to access characers not included in the current codepage • no need to switch between codepages according to the language/script being typeset • characers rendered via standard access codes • Characer/glyph model and modern font rendering technologies • complex script handling moved out of the domain of the text data stream 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Accented characers dan dan \halign{#\hfil\quad& � #\hfil\cr dubok dubok dan& �� dan\cr d ž abe đ ak dubok& �� dubok\cr d ž in d ž abe džabe& �� ak\cr D ž in d ž in džin& �� džabe\cr đ ak D ž in Džin& �� džin\cr Evropa Evropa � ak& �� Džin\cr Evropa& � Evropa\cr} 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • CJK ideographs 書く \font\han="STSong" � at � 16pt ka-ku \font\rom="Gentium" � at � 8pt 最も \def\hc#1#2{\vtop{\hbox{\han � #1} � \hbox{\kern10pt\rom � #2}}} motto-mo \vtop{\hc{書く}{ka-ku} 最後 � \hc{最も}{motto-mo} sai-go � \hc{最後}{sai-go} 働く � \hc{働く}{hatara-ku} hatara-ku � \hc{海}{umi}} 海 umi 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Complex scripts \c � 1 ﺶﺋﺍﺪﻴﭘ ﻲﺟ ﺎﻴﻧﺩ \s �� \p �� ن��آ ۽ ��ز ا�� ۾ ت��و�� ١ ۱ ۽ �� ز ��و نا ٢ . �� ا�� \v �� 1 �� . �� وا و�� وا . �� نا��و \v �� 2 �� ا�� ن�� ۽ �� ڍ ن�� . �� ڏ �� ا�� ٣ �� ا�� حور �� . �� ور �� “. �� ور ” �� \v �� 3 �� ” �� .“ �� . �� 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode Key changes from T EX to X T EX E • Unicode as the text encoding • directly use Unicode input text, Unicode-encoded fonts • Fonts and rendering technologies • use any fonts available in the host computer • use existing smart-font rendering systems • Additional features for multilingual typeseting • optional font features • line breaking for Asian scripts • Backward compatibility issues • support for legacy T EX fonts and documents 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… • Characer type in T EX code was 8-bit value • one option: process text as UTF-8 • Characer codes used to index a number of tables • characer category, case pairs, etc. • Decision to use 16-bit characer codes • all 256-element tables enlarged to 65,536 elements to match the extended characer set • extended T EX commands that refer to characer codes 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… and beyond? • Unicode does not fit in 16 bits eithe r! • X T EX handles non-BMP chara c ers as UTF-16 E surrogate pairs • properties of individual characers cannot be set • unlikely to mater for typeseting usage: all surrogate codes can be treated as simple printable characers • keeps size of internal tables moderate, without extensive restructuring • Using UTF-16 happens to match the font rendering APIs that X T EX uses E 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Required for support of complex scripts in Unicode • Signi fi cant change from traditional T EX model • T EX regards “a secific characer code in a secific font” as the fundamental unit of text to be typeset • assumes such a characer has known, fixed dimensions • provision for ligatures by characer substitutions • a paragraph consists of sequence of “characer” nodes, to be precisely placed, and intervening “glue” nodes • A Unicode chara c er may not map to a single, known glyph • many scripts require contextual selection of glyphs • must measure characers in context, not in isolation 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Initial implementation using ATSUI on Mac OS X • typeseting process collects runs of characers (words) • calls ATSUI text layout APIs to measure width • a X T EX paragraph consists of sequence of “word” nodes E separated by “glue” • Typese t ing engine positions words, not glyphs • this is the job of the font rendering engine 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model Nodes in a T EX paragraph Corresponding nodes in X T EX E -.,(% ! / &'()% ! 34$ -.,(% ! . !"#$% ! &'() ! *+,-$ -.,(% ! $ &'()% ! 0#1-2 !"#$% ! &'() ! *+,-$ !"#$% ! &'() ! *+,-$ -.,(% ! 0 -.,(% ! # &'()% ! .'/ -.,(% ! 1 !"#$% ! &'() ! *+,-$ -.,(% ! - -.,(% ! 2 !"#$% ! &'() ! *+,-$ -.,(% ! 3 -.,(% ! ' -.,(% ! 4 !"#$% ! &'() ! *+,-$ 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • OpenType Layout support using ICU library • alternative font layout engine • provides support for OpenType features in Latin fonts • supports a number of complex (Indic/Asian) scripts • X T EX uses either ATSUI or ICU according to E layout tables found in fonts • overall typese t ing process is independent of font technology in use • distinction required only at lowest level of measuring a run of text in a given font • documents may freely mix AAT and OT fonts 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

April 7, 2005 A Jonathan Kew SIL International 27th - PowerPoint PPT Presentation

e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 Te Multilingual Lion: T EX learns to seak Unicode Background

MetaData Management 2005 MetaData Management 2005 Toronto IRMAC April 19, 2005 April

Information Meeting 2005 Information Meeting 2005 November, 2005 November, 2005 UBE INDUSTRIES,

High Tech Computer Corp 2Q 2005 Review 1H 2005 Results 2005 Business outlook August 30 th 2005

Information Meeting 2005 Information Meeting 2005 May, 2005 May, 2005 UBE INDUSTRIES, LTD. UBE

2005 Half Year Results Presentation 6 months to 1 July 2005 10 August 2005 1 10 August 2005

st Half Results 2005 1 2005 1 st Half Results July 29 July 29 th th , 2005 , 2005 1 EDP

Information Meeting 2005 Information Meeting 2005 February, 2005 February, 2005 UBE INDUSTRIES,

Platinum 2005 2005 Platinum 16th May 2005 16th May 2005 Platinum Demand + 1% Platinum

Information Technology Information Technology at Small Colleges at Small Colleges ASCUE

Google Summer of Code 2005 Google Summer of Code 2005 Bernard Li Bernard Li SoC 2005 SoC 2005

Re : Cap Speakers Corner April 2005 April 1, 2005 Virginia Tax Study Group Location:

Ports Design Limited Ports Design Limited 2005 Interim Results 2005 Interim Results Strong 2005

SOLAR FAADE NDS 2005 - Modul 08 : CNC Shifted Seating Unit NDS 2005 - Modul 08 : CNC Cutplan

Parmalat Presentation Parmalat Presentation 6 October 2005 2 6 October 2005 1 6 October 2005

Financial Results Presentation for Financial Results Presentation for First Qu First Quarter,

2005 SPLOST Citizens Review Committee November 18, 2005 Update Meeting SPLOST Collections

Supplies to http://bit.ly/ClarinetHacks Go as slow as possible first semester, Pencil grips

( G O T Y O U R NO S E ) ( How Attackers steal your precious Data without using Scripts )

IHI Expedition Expedition: Making Mental Health Care Safer in the Hospital Setting Session 6:

conceptual designof software Daniel Jackson Northeastern University December 2014 sad

Angular Material Design Whats New in Angular Material Design Whats Cool in Material Design

Hospital-Based Assessment of Depression and Suicide Itai Danovitch, MD, MBA Chairman, Dept of

Vulnerability in under ones national and local context James Dunne Designated Nurse Fiona Finlay

Etiquette and dirty tricks in L A T EX Jephian C.-H. Lin Department of Applied