graphemic standardisation and human writing systems
play

Graphemic Standardisation and Human Writing Systems Workshop - PowerPoint PPT Presentation

Graphemic Standardisation and Human Writing Systems Workshop Victor Zimmermann 2019-06-22 Department of Computational Linguistics Heidelberg University Writing: A Human Invention Human writing systems have independently been invented about


  1. Graphemic Standardisation and Human Writing Systems Workshop Victor Zimmermann 2019-06-22 Department of Computational Linguistics Heidelberg University

  2. Writing: A Human Invention Human writing systems have independently been invented about five times in human history: • Indus Valley (Harappan Script) • Sumer (Cuneiform) • Egypt (Hieroglyphs) • Huang (Chinese) • Central America (Maya Script) Difference to speech, which exists in every human civilisation. There is no known case where a child has acquired writing by itself. It is a technology, just as the wheel or the iPhone. 1 tacos 29 | unicode

  3. Scripts vs. Character Sets • Script / Writing System: Method of visualising verbal communication. • Character set: Set of symbols used in writing system. • Alphabet: Set of characters representing phonemes. • Abjad: Set of characters representing consonants. • Abugida: Set of characters representing consonants with vowel notations. • Syllabary: Set of characters representing syllables. • Logography: Set of characters representing semantic units. 2 tacos 29 | unicode

  4. Writing Systems around the World 3 tacos 29 | unicode

  5. Graphemic Standardisation “When he wished to print, he took an iron frame and set it on the iron plate. In this he placed the types, set close together. When the frame was full, the whole made one solid block of type. He then placed it near the fire to warm it. When the paste [at the back] was slightly melted, he took a smooth board and pressed it over the surface, so that the block of type became as even as a whetstone.” - Shen Kuo (1031–1095) 4 tacos 29 | unicode

  6. Graphemic Standardisation • Early printing standards tightly linked to typography. • What could be printed was dependent on physical movable typesets. • Usage of German metal types used in the early printing presses led to the removal of characters like Thorn and Eth from the English alphabet. 5 tacos 29 | unicode

  7. Digital Standards • Early bit encodings like Morse or Baudot code utilize bit-like character encryption. • American 7-Bit ASCII Encoding used throughout Latin writing world. • Various 8-Bit extensions of ASCII emerged, eg. adding various diacritic versions for Northern European countries. • IBM releases own 8-Bit standard, the Extended Binary Coded Decimal Interchange Code (EBCDIC). • Many competing standards for Japanese led to Mojigake when moving data between companies. 6 tacos 29 | unicode

  8. Workshop: Canadian Aboriginal Syllabics Figure 1: Evans’ script, as published in 1841 7 tacos 29 | unicode

  9. Workshop: Cyrillic Scripts 8 tacos 29 | unicode

  10. Workshop: Cyrillic Scripts 9 tacos 29 | unicode

  11. Workshop: Hangul 10 tacos 29 | unicode

  12. Workshop: Hangul 11 tacos 29 | unicode

  13. Workshop Task • Are there any cultural aspects you need to consider when creating a standard? How should they influence your design? • How would a standard deal with the syllabic blocks, ligatures or diacritics present in your language? • Is there a particularly efficient way to represent this script? • Are there things you wish to include in your standard that go beyond character sets? • Does a program using your encoding need special instructions to represent your characters? What are they? 12 tacos 29 | unicode

  14. UTF-32, UTF-16, UTF-8 UTF-32: All code points are encoded in 32 bit. UTF-16: All code points are encoded in 16 or 32 bit. Little Endian: left-to-right, Big Endian: right-to-left UTF-8: Variable width encoding through continuation marks and 8 bit chunks. 13 tacos 29 | unicode

  15. Examples UTF-32 (U+1F32E) TACO 11110000 10011111 10001100 10101110 UTF-8 01110100 11111011 00111100 00011011 UTF-16 BE 11011000 00111100 11011111 00101110 UTF-16 LE 00000000 00000001 11110011 00101110 UTF-32 LATIN CAPITAL LETTER A (U+0041) 01000001 UTF-8 10000010 00000000 UTF-16 BE 00000000 01000001 UTF-16 LE 00000000 00000000 00000000 01000001 14 tacos 29 | unicode

  16. History of Emoji The history of Unicode is the history of compromise. UTF-8 came to be because ASCII users did not want to move from 7 to 16 or even 32 bit systems. Because of its logographic nature, Japanese uses a lot of code points. Since encodings always double in size for each bit, Japanese was left with a lot of open code points. Some Japanese phone companies used this space for symbols like emoticons or the poop emoji. 15 tacos 29 | unicode

  17. History of Emoji 2010 emojis were incorporated in the Unicode Standard to allow compatibility with Japanese phones. iPhone users found out. Poop emojis appeared in messages around the world soon after. Today Emojis are spread over multiple Unicode blocks, the Japanese blocks and a special miscellaneous block. The addition of new emojis is governed by the Unicode Consortium (not Apple). 16 tacos 29 | unicode

  18. References

  19. References [Uni19] The Unicode Consortium. The Unicode Standard Version 12.0 - Core Specification . Vol. 1. Mountain View, CA: The Unicode Consortium, 2019. isbn: 9781936213238. 17 coll | references

  20. Thank you! That’s it! Thank you for coming! Have fun at TaCoS 29 ! en.axtimhaus.eu zimmermann@cl.uni-heidelberg.de 18 coll | references

Recommend


More recommend