Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode ◮ same spelling – different sounds: ought , cough , tough , Typing it in Spoken language through , though , hiccough Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode ◮ same spelling – different sounds: ought , cough , tough , Typing it in Spoken language through , though , hiccough Transcription ◮ silent letters: knee , knight , knife , debt , psychology , Why speech is hard to represent mortgage Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode ◮ same spelling – different sounds: ought , cough , tough , Typing it in Spoken language through , though , hiccough Transcription ◮ silent letters: knee , knight , knife , debt , psychology , Why speech is hard to represent mortgage Articulation Acoustics ◮ one letter – multiple sounds: exit , use Relating written and spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode ◮ same spelling – different sounds: ought , cough , tough , Typing it in Spoken language through , though , hiccough Transcription ◮ silent letters: knee , knight , knife , debt , psychology , Why speech is hard to represent mortgage Articulation Acoustics ◮ one letter – multiple sounds: exit , use Relating written and ◮ multiple letters – one sound: the , revolution spoken language From Speech to Text From Text to Speech 8 / 59
Language and A note on the letter-sound correspondence Computers Topic 1: Text and Speech Encoding ◮ Alphabets use letters to encode sounds (consonants, Writing systems vowels). Alphabetic Syllabic Logographic ◮ But the correspondence between spelling and Systems with unusual realization pronounciation in many languages is quite complex, Relation to language Comparison of systems i.e., not a simple one-to-one correspondence. Encoding written language ◮ Example: English ASCII Unicode ◮ same spelling – different sounds: ought , cough , tough , Typing it in Spoken language through , though , hiccough Transcription ◮ silent letters: knee , knight , knife , debt , psychology , Why speech is hard to represent mortgage Articulation Acoustics ◮ one letter – multiple sounds: exit , use Relating written and ◮ multiple letters – one sound: the , revolution spoken language ◮ alternate spellings: jail or gaol ; but chef does not have From Speech to Text From Text to Speech an alternative seagh (despite sure , dead , laugh ) 8 / 59
Language and More examples for non-transparent letter-sound Computers Topic 1: Text and correspondences Speech Encoding Writing systems Alphabetic French Syllabic Logographic Systems with unusual realization (1) a. Versailles → [veRsai] Relation to language Comparison of systems b. ete , etais , etait , etaient → [ete] Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 9 / 59
Language and More examples for non-transparent letter-sound Computers Topic 1: Text and correspondences Speech Encoding Writing systems Alphabetic French Syllabic Logographic Systems with unusual realization (1) a. Versailles → [veRsai] Relation to language Comparison of systems b. ete , etais , etait , etaient → [ete] Encoding written language ASCII Unicode Typing it in Irish Spoken language Transcription (2) a. Baile A’tha Cliath (Dublin) → [bl’a: kli uh] Why speech is hard to represent Articulation b. samhradh (summer) → [sauruh] Acoustics c. scri’obhaim (I write) → [shgri:m] Relating written and spoken language From Speech to Text From Text to Speech 9 / 59
Language and More examples for non-transparent letter-sound Computers Topic 1: Text and correspondences Speech Encoding Writing systems Alphabetic French Syllabic Logographic Systems with unusual realization (1) a. Versailles → [veRsai] Relation to language Comparison of systems b. ete , etais , etait , etaient → [ete] Encoding written language ASCII Unicode Typing it in Irish Spoken language Transcription (2) a. Baile A’tha Cliath (Dublin) → [bl’a: kli uh] Why speech is hard to represent Articulation b. samhradh (summer) → [sauruh] Acoustics c. scri’obhaim (I write) → [shgri:m] Relating written and spoken language From Speech to Text From Text to Speech What is the notation used within the [] ? 9 / 59
Language and The International Phonetic Alphabet (IPA) Computers Topic 1: Text and Speech Encoding Writing systems ◮ Several special alphabets for representing sounds have Alphabetic Syllabic been developed, the best known being the International Logographic Systems with unusual Phonetic Alphabet (IPA). realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 10 / 59
Language and The International Phonetic Alphabet (IPA) Computers Topic 1: Text and Speech Encoding Writing systems ◮ Several special alphabets for representing sounds have Alphabetic Syllabic been developed, the best known being the International Logographic Systems with unusual Phonetic Alphabet (IPA). realization Relation to language Comparison of systems ◮ The phonetic symbols are unambiguous: Encoding written language ◮ designed so that each speech sound gets its own ASCII Unicode symbol, Typing it in ◮ eliminating the need for Spoken language ◮ multiple symbols used to represent simple sounds Transcription Why speech is hard to ◮ one symbol being used for multiple sounds. represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 10 / 59
Language and The International Phonetic Alphabet (IPA) Computers Topic 1: Text and Speech Encoding Writing systems ◮ Several special alphabets for representing sounds have Alphabetic Syllabic been developed, the best known being the International Logographic Systems with unusual Phonetic Alphabet (IPA). realization Relation to language Comparison of systems ◮ The phonetic symbols are unambiguous: Encoding written language ◮ designed so that each speech sound gets its own ASCII Unicode symbol, Typing it in ◮ eliminating the need for Spoken language ◮ multiple symbols used to represent simple sounds Transcription Why speech is hard to ◮ one symbol being used for multiple sounds. represent Articulation Acoustics ◮ Interactive example chart: http://web.uvic.ca/ling/ Relating written and spoken language resources/ipa/charts/IPAlab/IPAlab.htm From Speech to Text From Text to Speech 10 / 59
Language and Syllabic systems Computers Topic 1: Text and Speech Encoding Syllabic alphabets (Alphasyllabaries) Writing systems ◮ writing systems with symbols that represent a Alphabetic Syllabic consonant with a vowel, but the vowel can be changed Logographic Systems with unusual by adding a diacritic (= a symbol added to the letter). realization Relation to language ◮ Examples: Balinese, Javanese, Tibetan, Tamil, Thai, Comparison of systems Encoding written Tagalog language ASCII (cf. also: http://www.omniglot.com/writing/syllabic.htm) Unicode Typing it in Spoken language Syllabaries Transcription Why speech is hard to represent ◮ writing systems with separate symbols for each syllable Articulation Acoustics of a language Relating written and spoken language ◮ Examples: Cherokee. Ethiopic, Cypriot, Ojibwe, From Speech to Text From Text to Speech Hiragana (Japanese) (cf. also: http://www.omniglot.com/writing/syllabaries.htm#syll) 11 / 59
Language and Syllabary example: Cypriote Computers Topic 1: Text and Speech Encoding The Cypriot syllabary or Cypro-Minoan writing is thought to have Writing systems developed from the Linear A, or possibly the Linear B script of Crete, Alphabetic though its exact origins are not known. It was used from about 800 to 200 Syllabic Logographic BC. Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech (from: http://www.omniglot.com/writing/cypriot.htm) 12 / 59
Language and Syllabic alphabet example: Lao Computers Topic 1: Text and Speech Encoding Script developed in the 14th century to write the Lao language, based on Writing systems an early version of the Thai script, which was developed from the Old Alphabetic Khmer script, which was itself based on Mon scripts. Syllabic Logographic Systems with unusual realization Example for vowel diacritics around the letter k: Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech (from: http://www.omniglot.com/writing/lao.htm) 13 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 14 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual Example: development of Chinese character horse : realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 14 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual Example: development of Chinese character horse : realization Relation to language Comparison of systems Encoding written language ASCII Unicode ◮ Ideographs (Ideograms) : representations of abstract Typing it in Spoken language ideas Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 14 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual Example: development of Chinese character horse : realization Relation to language Comparison of systems Encoding written language ASCII Unicode ◮ Ideographs (Ideograms) : representations of abstract Typing it in Spoken language ideas Transcription ◮ Compounds: combinations of two or more logographs Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 14 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual Example: development of Chinese character horse : realization Relation to language Comparison of systems Encoding written language ASCII Unicode ◮ Ideographs (Ideograms) : representations of abstract Typing it in Spoken language ideas Transcription ◮ Compounds: combinations of two or more logographs Why speech is hard to represent ◮ Semantic-phonetic compounds: symbols with a Articulation Acoustics meaning element (hints at meaning) and a phonetic Relating written and element (hints at pronunciation). spoken language From Speech to Text From Text to Speech 14 / 59
Language and Logographic writing systems Computers Topic 1: Text and Speech Encoding ◮ Logographs (also called Logograms): Writing systems ◮ Pictographs (Pictograms) : originally pictures of Alphabetic Syllabic things, now stylized and simplified. Logographic Systems with unusual Example: development of Chinese character horse : realization Relation to language Comparison of systems Encoding written language ASCII Unicode ◮ Ideographs (Ideograms) : representations of abstract Typing it in Spoken language ideas Transcription ◮ Compounds: combinations of two or more logographs Why speech is hard to represent ◮ Semantic-phonetic compounds: symbols with a Articulation Acoustics meaning element (hints at meaning) and a phonetic Relating written and element (hints at pronunciation). spoken language From Speech to Text ◮ Examples: Chinese (Zh¯ ongw´ en), Japanese (Nihongo), From Text to Speech Mayan, Vietnamese, Ancient Egyptian 14 / 59
Language and Logograph writing system example: Chinese Computers Topic 1: Text and Speech Encoding Pictographs Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 15 / 59
Language and Logograph writing system example: Chinese Computers Topic 1: Text and Speech Encoding Pictographs Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Ideographs Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 15 / 59
Language and Logograph writing system example: Chinese Computers Topic 1: Text and Speech Encoding Pictographs Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Ideographs Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Compounds of Pictographs/Ideographs Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech (from: http://www.omniglot.com/writing/chinese types.htm) 15 / 59
Language and Computers Semantic-phonetic compounds Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 16 / 59
Language and Computers Semantic-phonetic compounds Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent An example from Ancient Egyptian Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech (from: http://www.omniglot.com/writing/egyptian.htm) 16 / 59
Language and Two writing systems with unusual realization Computers Topic 1: Text and Speech Encoding Tactile Writing systems ◮ Braille is a writing system that makes it possible to read Alphabetic Syllabic and write through touch; primarily used by the (partially) Logographic Systems with unusual blind. realization Relation to language ◮ It uses patterns of raised dots arranged in cells of up to Comparison of systems Encoding written six dots in a 3 x 2 configuration. language ASCII ◮ Each pattern represents a character, but some frequent Unicode Typing it in words and letter combinations have their own pattern. Spoken language Transcription Why speech is hard to represent Chromatographic Articulation Acoustics ◮ The Benin and Edo people in southern Nigeria have Relating written and spoken language developed a system of writing based on different color From Speech to Text From Text to Speech combinations and symbols. (cf. http://www.library.cornell.edu/africana/Writing Systems/Chroma.html) 17 / 59
Language and Braille alphabet Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 18 / 59
Language and Chromatographic system Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 19 / 59
Language and Relating writing systems to languages Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic ◮ There is not a simple correspondence between a Systems with unusual realization writing system and a language. Relation to language Comparison of systems ◮ For example, English uses the Roman alphabet, but Encoding written language Arabic numerals (e.g., 3 and 4 instead of III and IV). ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 20 / 59
Language and Relating writing systems to languages Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic ◮ There is not a simple correspondence between a Systems with unusual realization writing system and a language. Relation to language Comparison of systems ◮ For example, English uses the Roman alphabet, but Encoding written language Arabic numerals (e.g., 3 and 4 instead of III and IV). ASCII Unicode ◮ We’ll look at three other examples: Typing it in Spoken language ◮ Japanese Transcription ◮ Korean Why speech is hard to represent ◮ Azeri Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 20 / 59
Language and Japanese Computers Topic 1: Text and Speech Encoding Writing systems Japanese: logographic system kanji , syllabary katakana , Alphabetic syllabary hiragana Syllabic Logographic Systems with unusual ◮ kanji: 5,000-10,000 borrowed Chinese characters realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 21 / 59
Language and Japanese Computers Topic 1: Text and Speech Encoding Writing systems Japanese: logographic system kanji , syllabary katakana , Alphabetic syllabary hiragana Syllabic Logographic Systems with unusual ◮ kanji: 5,000-10,000 borrowed Chinese characters realization Relation to language ◮ katakana Comparison of systems Encoding written ◮ used mainly for non-Chinese loan words, onomatopoeic language words, foreign names, and for emphasis ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 21 / 59
Language and Japanese Computers Topic 1: Text and Speech Encoding Writing systems Japanese: logographic system kanji , syllabary katakana , Alphabetic syllabary hiragana Syllabic Logographic Systems with unusual ◮ kanji: 5,000-10,000 borrowed Chinese characters realization Relation to language ◮ katakana Comparison of systems Encoding written ◮ used mainly for non-Chinese loan words, onomatopoeic language words, foreign names, and for emphasis ASCII Unicode ◮ hiragana Typing it in Spoken language ◮ originally used only by women (10th century), but Transcription codified in 1946 with 48 syllables Why speech is hard to represent ◮ used mainly for word endings, kids’ books, and for Articulation Acoustics words with obscure kanji symbols Relating written and spoken language From Speech to Text From Text to Speech 21 / 59
Language and Japanese Computers Topic 1: Text and Speech Encoding Writing systems Japanese: logographic system kanji , syllabary katakana , Alphabetic syllabary hiragana Syllabic Logographic Systems with unusual ◮ kanji: 5,000-10,000 borrowed Chinese characters realization Relation to language ◮ katakana Comparison of systems Encoding written ◮ used mainly for non-Chinese loan words, onomatopoeic language words, foreign names, and for emphasis ASCII Unicode ◮ hiragana Typing it in Spoken language ◮ originally used only by women (10th century), but Transcription codified in 1946 with 48 syllables Why speech is hard to represent ◮ used mainly for word endings, kids’ books, and for Articulation Acoustics words with obscure kanji symbols Relating written and spoken language ◮ romaji: Roman characters From Speech to Text From Text to Speech 21 / 59
Language and Japanese example Computers Topic 1: Text and Speech Encoding The example uses kanji (red), hiragana (black), and katakana (blue): Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language Translation: ASCII Unicode Typing it in Capsule Hotel Spoken language A simple hotel where each room is capsule-shaped. When businessmen Transcription Why speech is hard to miss the last train home, they can stay overnight very cheaply instead of represent Articulation paying a lot of money to go home by taxi. Acoustics Relating written and spoken language From Speech to Text From Text to Speech (from: http://www.omniglot.com/writing/japanese.htm#origin) 22 / 59
Language and Korean Computers Topic 1: Text and Speech Encoding Writing systems “Korean writing is an alphabet, a syllabary and logographs Alphabetic Syllabic all at once.” (http://home.vicnet.net.au/ ∼ ozideas/writkor.htm) Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 23 / 59
Language and Korean Computers Topic 1: Text and Speech Encoding Writing systems “Korean writing is an alphabet, a syllabary and logographs Alphabetic Syllabic all at once.” (http://home.vicnet.net.au/ ∼ ozideas/writkor.htm) Logographic Systems with unusual ◮ The hangul system was developed in 1444 during King realization Relation to language Sejong’s reign. Comparison of systems Encoding written ◮ There are 24 letters: 14 consonants and 10 vowels language ◮ But the letters are grouped into syllables, i.e. the letters ASCII Unicode in a syllable are not written separately as in the English Typing it in system, but together form a single character. Spoken language Transcription E.g., “Hangeul” (from: http://www.omniglot.com/writing/korean.htm) : Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 23 / 59
Language and Korean Computers Topic 1: Text and Speech Encoding Writing systems “Korean writing is an alphabet, a syllabary and logographs Alphabetic Syllabic all at once.” (http://home.vicnet.net.au/ ∼ ozideas/writkor.htm) Logographic Systems with unusual ◮ The hangul system was developed in 1444 during King realization Relation to language Sejong’s reign. Comparison of systems Encoding written ◮ There are 24 letters: 14 consonants and 10 vowels language ◮ But the letters are grouped into syllables, i.e. the letters ASCII Unicode in a syllable are not written separately as in the English Typing it in system, but together form a single character. Spoken language Transcription E.g., “Hangeul” (from: http://www.omniglot.com/writing/korean.htm) : Why speech is hard to represent Articulation Acoustics ◮ In South Korea, hanja (logographic Chinese characters) Relating written and are also used. spoken language From Speech to Text From Text to Speech 23 / 59
Language and Azeri Computers Topic 1: Text and Speech Encoding Writing systems A Turkish language with speakers in Azerbaijan, northwest Alphabetic Syllabic Iran, and (former Soviet) Georgia Logographic Systems with unusual ◮ 7th century until 1920s: Arabic scripts. Three different realization Relation to language Comparison of systems Arabic scripts used Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 24 / 59
Language and Azeri Computers Topic 1: Text and Speech Encoding Writing systems A Turkish language with speakers in Azerbaijan, northwest Alphabetic Syllabic Iran, and (former Soviet) Georgia Logographic Systems with unusual ◮ 7th century until 1920s: Arabic scripts. Three different realization Relation to language Comparison of systems Arabic scripts used Encoding written ◮ 1929: Latin alphabet enforced by Soviets to reduce language ASCII Islamic influence. Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 24 / 59
Language and Azeri Computers Topic 1: Text and Speech Encoding Writing systems A Turkish language with speakers in Azerbaijan, northwest Alphabetic Syllabic Iran, and (former Soviet) Georgia Logographic Systems with unusual ◮ 7th century until 1920s: Arabic scripts. Three different realization Relation to language Comparison of systems Arabic scripts used Encoding written ◮ 1929: Latin alphabet enforced by Soviets to reduce language ASCII Islamic influence. Unicode Typing it in ◮ 1939: Cyrillic alphabet enforced by Stalin Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 24 / 59
Language and Azeri Computers Topic 1: Text and Speech Encoding Writing systems A Turkish language with speakers in Azerbaijan, northwest Alphabetic Syllabic Iran, and (former Soviet) Georgia Logographic Systems with unusual ◮ 7th century until 1920s: Arabic scripts. Three different realization Relation to language Comparison of systems Arabic scripts used Encoding written ◮ 1929: Latin alphabet enforced by Soviets to reduce language ASCII Islamic influence. Unicode Typing it in ◮ 1939: Cyrillic alphabet enforced by Stalin Spoken language Transcription ◮ 1991: Back to Latin alphabet, but slightly different than Why speech is hard to represent before. Articulation Acoustics → Latin typewriters and computer fonts were in great Relating written and spoken language demand in 1991 From Speech to Text From Text to Speech 24 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 25 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic ◮ accuracy: Can every word be written down accurately? Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 25 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic ◮ accuracy: Can every word be written down accurately? Systems with unusual realization ◮ learnability: How long does it take to learn the system? Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 25 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic ◮ accuracy: Can every word be written down accurately? Systems with unusual realization ◮ learnability: How long does it take to learn the system? Relation to language Comparison of systems ◮ cognitive ability: Are some systems unnatural? (e.g. Encoding written language Does dyslexia show that alphabets are unnatural?) ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 25 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic ◮ accuracy: Can every word be written down accurately? Systems with unusual realization ◮ learnability: How long does it take to learn the system? Relation to language Comparison of systems ◮ cognitive ability: Are some systems unnatural? (e.g. Encoding written language Does dyslexia show that alphabets are unnatural?) ASCII Unicode ◮ language-particular differences: English has thousands Typing it in Spoken language of possible syllables; Japanese has very few in Transcription Why speech is hard to comparison represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 25 / 59
Language and Comparison of writing systems Computers Topic 1: Text and Speech Encoding Writing systems What are the pros and cons of each type of system? Alphabetic Syllabic Logographic ◮ accuracy: Can every word be written down accurately? Systems with unusual realization ◮ learnability: How long does it take to learn the system? Relation to language Comparison of systems ◮ cognitive ability: Are some systems unnatural? (e.g. Encoding written language Does dyslexia show that alphabets are unnatural?) ASCII Unicode ◮ language-particular differences: English has thousands Typing it in Spoken language of possible syllables; Japanese has very few in Transcription Why speech is hard to comparison represent Articulation ◮ connection to history/culture: Will changing a writing Acoustics Relating written and system have social consequences? spoken language From Speech to Text From Text to Speech 25 / 59
Language and Encoding written language Computers Topic 1: Text and Speech Encoding ◮ Information on a computer is stored in bits . Writing systems ◮ A bit is either on (= 1, yes) or off (= 0, no). Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 26 / 59
Language and Encoding written language Computers Topic 1: Text and Speech Encoding ◮ Information on a computer is stored in bits . Writing systems ◮ A bit is either on (= 1, yes) or off (= 0, no). Alphabetic Syllabic ◮ A list of 8 bits makes up a byte , e.g., 01001010 Logographic Systems with unusual ◮ Just like with the base 10 numbers we’re used to, the realization Relation to language order of the bits in a byte matters: Comparison of systems Encoding written ◮ Big Endian : most important bit is leftmost (the standard language way of doing things) ASCII Unicode Typing it in ◮ The positions in a byte thus encode: 128 64 32 16 8 4 2 Spoken language 1 Transcription ◮ “There are 10 kinds of people in the world; those who Why speech is hard to represent know binary and those who don’t” Articulation Acoustics (from: http://www.wlug.org.nz/LittleEndian) Relating written and ◮ Little Endian : most important bit is rightmost (only spoken language From Speech to Text used on Intel machines) From Text to Speech ◮ The positions in a byte thus encode: 1 2 4 8 16 32 64 128 26 / 59
Language and Converting decimal numbers to binary - Tabular Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Using the first 4 bits, we want to know how to write 10 in bit Systems with unusual realization (or binary ) notation. Relation to language Comparison of systems Encoding written 8 4 2 1 language ASCII ? ? ? ? Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 27 / 59
Language and Converting decimal numbers to binary - Tabular Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Using the first 4 bits, we want to know how to write 10 in bit Systems with unusual realization (or binary ) notation. Relation to language Comparison of systems Encoding written 8 4 2 1 language ASCII ? ? ? ? Unicode Typing it in 8 < 10 ? ? ? Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 27 / 59
Language and Converting decimal numbers to binary - Tabular Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Using the first 4 bits, we want to know how to write 10 in bit Systems with unusual realization (or binary ) notation. Relation to language Comparison of systems Encoding written 8 4 2 1 language ASCII ? ? ? ? Unicode Typing it in 8 < 10 ? ? ? Spoken language 1 8 + 4 = 12 > 10 ? ? Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 27 / 59
Language and Converting decimal numbers to binary - Tabular Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Using the first 4 bits, we want to know how to write 10 in bit Systems with unusual realization (or binary ) notation. Relation to language Comparison of systems Encoding written 8 4 2 1 language ASCII ? ? ? ? Unicode Typing it in 8 < 10 ? ? ? Spoken language 1 8 + 4 = 12 > 10 ? ? Transcription Why speech is hard to 1 0 8 + 2 = 10 = 10 ? represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 27 / 59
Language and Converting decimal numbers to binary - Tabular Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Using the first 4 bits, we want to know how to write 10 in bit Systems with unusual realization (or binary ) notation. Relation to language Comparison of systems Encoding written 8 4 2 1 language ASCII ? ? ? ? Unicode Typing it in 8 < 10 ? ? ? Spoken language 1 8 + 4 = 12 > 10 ? ? Transcription Why speech is hard to 1 0 8 + 2 = 10 = 10 ? represent Articulation 1 0 1 0 Acoustics Relating written and spoken language From Speech to Text From Text to Speech 27 / 59
Language and Converting decimal numbers to binary - Division Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Decimal Remainder? Binary Comparison of systems Encoding written 10/2 = 5 no 0 language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 28 / 59
Language and Converting decimal numbers to binary - Division Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Decimal Remainder? Binary Comparison of systems Encoding written 10/2 = 5 no 0 language ASCII 5/2 = 2 yes 10 Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 28 / 59
Language and Converting decimal numbers to binary - Division Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Decimal Remainder? Binary Comparison of systems Encoding written 10/2 = 5 no 0 language ASCII 5/2 = 2 yes 10 Unicode Typing it in 2/2 = 1 no 010 Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 28 / 59
Language and Converting decimal numbers to binary - Division Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Decimal Remainder? Binary Comparison of systems Encoding written 10/2 = 5 no 0 language ASCII 5/2 = 2 yes 10 Unicode Typing it in 2/2 = 1 no 010 Spoken language 1/2 = 0 yes 1010 Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 28 / 59
Language and Converting decimal numbers to binary - Division Computers Topic 1: Text and Method Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Decimal Remainder? Binary Comparison of systems Encoding written 10/2 = 5 no 0 language ASCII 5/2 = 2 yes 10 Unicode Typing it in 2/2 = 1 no 010 Spoken language 1/2 = 0 yes 1010 Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 28 / 59
Language and Using bytes to store characters Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic With 8 bits (a single byte), you can represent 256 different Logographic Systems with unusual realization characters. Why would we want so many? Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 29 / 59
Language and Using bytes to store characters Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic With 8 bits (a single byte), you can represent 256 different Logographic Systems with unusual realization characters. Why would we want so many? Relation to language Comparison of systems ◮ If you look at a keyboard, you will find lots of Encoding written language non-English characters. ASCII Unicode ◮ With 256 possible characters, we can store every single Typing it in letter used in English, plus all the things like commas, Spoken language Transcription periods, space bar, percent sign (%), back space, and Why speech is hard to represent so on. Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 29 / 59
Language and An encoding standard: ASCII Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization ◮ ASCII = the American Standard Code for Information Relation to language Comparison of systems Interchange Encoding written language ◮ 7-bit code for storing English text ASCII Unicode ◮ 7 bits = 128 possible characters. Typing it in Spoken language ◮ The numeric order reflects alphabetic ordering. Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 30 / 59
Language and The ASCII chart Computers Codes 1–31 are used for control characters (backspace, line Topic 1: Text and Speech Encoding feed, tab, . . . ). Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Relation to language Comparison of systems Encoding written 32 48 0 65 A 82 R 97 a 114 r 33 ! 49 1 66 B 83 S 98 b 115 s language 34 “ 50 2 67 C 84 T 99 c 116 t ASCII 35 # 51 3 68 D 85 U 100 d 117 u 52 4 69 E 86 V 101 e 118 v Unicode 36 $ 37 % 53 5 70 F 87 W 102 f 119 w Typing it in 38 & 54 6 71 G 88 X 103 g 120 x 39 ’ 55 7 72 H 89 Y 104 h 121 y Spoken language 40 ( 56 8 73 I 90 Z 105 i 122 z Transcription 41 ) 57 9 74 J 91 [ 106 j 123 { 42 * 58 : 75 K 92 \ 107 k 124 — Why speech is hard to 76 L ] 108 l } represent 43 + 59 ; 93 125 44 , 60 77 M 94 ^ 109 m 126 ˜ < Articulation 45 - 61 = 78 N 95 _ 110 n 127 DEL Acoustics 46 . 62 79 O 96 ‘ 111 o > 47 / 63 ? 80 P 112 p Relating written and 64 @ 81 Q 113 q spoken language From Speech to Text From Text to Speech 31 / 59
Language and E-mail issues Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic ◮ Have you ever had something like the following at the Syllabic Logographic top of an e-mail sent to you? Systems with unusual realization [The following text is in the ‘‘ISO-8859-1’’ character set.] Relation to language Comparison of systems [Your display is set for the ‘‘US-ASCII’’ character set. ] Encoding written [Some characters may be displayed incorrectly. ] language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 32 / 59
Language and E-mail issues Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic ◮ Have you ever had something like the following at the Syllabic Logographic top of an e-mail sent to you? Systems with unusual realization [The following text is in the ‘‘ISO-8859-1’’ character set.] Relation to language Comparison of systems [Your display is set for the ‘‘US-ASCII’’ character set. ] Encoding written [Some characters may be displayed incorrectly. ] language ASCII ◮ Mail sent on the internet used to only be able to transfer Unicode Typing it in the 7-bit ASCII messages. But now we can detect the Spoken language incoming character set and adjust the input. Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 32 / 59
Language and E-mail issues Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic ◮ Have you ever had something like the following at the Syllabic Logographic top of an e-mail sent to you? Systems with unusual realization [The following text is in the ‘‘ISO-8859-1’’ character set.] Relation to language Comparison of systems [Your display is set for the ‘‘US-ASCII’’ character set. ] Encoding written [Some characters may be displayed incorrectly. ] language ASCII ◮ Mail sent on the internet used to only be able to transfer Unicode Typing it in the 7-bit ASCII messages. But now we can detect the Spoken language incoming character set and adjust the input. Transcription Why speech is hard to ◮ Note that this is an example of meta-information = represent Articulation information which is printed as part of the regular Acoustics Relating written and message, but tells us something about that message. spoken language From Speech to Text From Text to Speech 32 / 59
Language and Multipurpose Internet Mail Extensions (MIME) Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual MIME provides meta-information on the text, which tells us: realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 33 / 59
Language and Multipurpose Internet Mail Extensions (MIME) Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual MIME provides meta-information on the text, which tells us: realization Relation to language Comparison of systems ◮ which version of MIME is being used Encoding written ◮ what the charcter set is language ASCII ◮ if that character set was altered, how it was altered Unicode Typing it in Spoken language Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Transcription Content-Transfer-Encoding: 7bit Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 33 / 59
Language and Different coding systems Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic But wait, didn’t we want to be able to encode all languages? Syllabic Logographic There are ways ... Systems with unusual realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 34 / 59
Language and Different coding systems Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic But wait, didn’t we want to be able to encode all languages? Syllabic Logographic There are ways ... Systems with unusual realization Relation to language ◮ Extend the ASCII system with various other systems, Comparison of systems for example: Encoding written language ◮ ISO 8859-1: includes extra letters needed for French, ASCII Unicode German, Spanish, etc. Typing it in ◮ ISO 8859-7: Greek alphabet Spoken language ◮ ISO 8859-8: Hebrew alphabet Transcription Why speech is hard to ◮ JIS X 0208: Japanese characters represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 34 / 59
Language and Different coding systems Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic But wait, didn’t we want to be able to encode all languages? Syllabic Logographic There are ways ... Systems with unusual realization Relation to language ◮ Extend the ASCII system with various other systems, Comparison of systems for example: Encoding written language ◮ ISO 8859-1: includes extra letters needed for French, ASCII Unicode German, Spanish, etc. Typing it in ◮ ISO 8859-7: Greek alphabet Spoken language ◮ ISO 8859-8: Hebrew alphabet Transcription Why speech is hard to ◮ JIS X 0208: Japanese characters represent Articulation Acoustics ◮ Have one system for everything → Unicode Relating written and spoken language From Speech to Text From Text to Speech 34 / 59
Language and Unicode Computers Topic 1: Text and Speech Encoding Problems with having multiple encoding systems: Writing systems Alphabetic ◮ Conflicts: two encodings can use the same number for Syllabic Logographic two different characters and use different numbers for Systems with unusual realization Relation to language the same character. Comparison of systems ◮ Hassle: have to install many, many systems if you want Encoding written language to be able to deal with various languages ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 35 / 59
Language and Unicode Computers Topic 1: Text and Speech Encoding Problems with having multiple encoding systems: Writing systems Alphabetic ◮ Conflicts: two encodings can use the same number for Syllabic Logographic two different characters and use different numbers for Systems with unusual realization Relation to language the same character. Comparison of systems ◮ Hassle: have to install many, many systems if you want Encoding written language to be able to deal with various languages ASCII Unicode Typing it in Unicode tries to fix that by having a single representation for Spoken language every possible character. Transcription Why speech is hard to “Unicode provides a unique number for every represent Articulation character, no matter what the platform, no matter Acoustics Relating written and what the program, no matter what the language.” spoken language (www.unicode.org) From Speech to Text From Text to Speech 35 / 59
Language and How big is Unicode? Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual Version 3.2 has codes for 95,221 characters from alphabets, realization Relation to language syllabaries and logographic systems. Comparison of systems Encoding written ◮ Uses 32 bits – meaning we can store language ASCII 2 32 = 4 , 294 , 967 , 296 characters. Unicode Typing it in ◮ 4 billion possibilities for each character? That takes a lot Spoken language of space on the computer! Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 36 / 59
Language and Compact encoding of Unicode characters Computers Topic 1: Text and Speech Encoding Writing systems ◮ Unicode has three versions Alphabetic Syllabic ◮ UTF-32 (32 bits): direct representation Logographic ◮ UTF-16 (16 bits): 2 16 = 65536 Systems with unusual realization ◮ UTF-8 (8 bits): 2 8 = 256 Relation to language Comparison of systems ◮ How is it possible to encode 2 32 possibilities in 8 bits Encoding written language ASCII (UTF-8)? Unicode Typing it in ◮ Several bytes are used to represent one character. Spoken language ◮ Use the highest bit as flag: Transcription Why speech is hard to ◮ highest bit 0: single character represent Articulation ◮ highest bit 1: part of a multi byte character Acoustics Relating written and ◮ Nice consequence: ASCII text is in a valid UTF-8 spoken language From Speech to Text encoding. From Text to Speech 37 / 59
Language and How do we type everything in? Computers Topic 1: Text and Speech Encoding Writing systems ◮ Use a keyboard tailored to your specific language Alphabetic Syllabic e.g. Highly noticeable how much slower your English Logographic Systems with unusual typing is when using a Danish-designed keyboard. realization Relation to language Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 38 / 59
Language and How do we type everything in? Computers Topic 1: Text and Speech Encoding Writing systems ◮ Use a keyboard tailored to your specific language Alphabetic Syllabic e.g. Highly noticeable how much slower your English Logographic Systems with unusual typing is when using a Danish-designed keyboard. realization Relation to language ◮ Use a processor that allows you to switch between Comparison of systems Encoding written different character systems. language e.g. Type in Cyrillic characters on your English ASCII Unicode keyboard. Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 38 / 59
Language and How do we type everything in? Computers Topic 1: Text and Speech Encoding Writing systems ◮ Use a keyboard tailored to your specific language Alphabetic Syllabic e.g. Highly noticeable how much slower your English Logographic Systems with unusual typing is when using a Danish-designed keyboard. realization Relation to language ◮ Use a processor that allows you to switch between Comparison of systems Encoding written different character systems. language e.g. Type in Cyrillic characters on your English ASCII Unicode keyboard. Typing it in Spoken language ◮ Use combinations of characters. Transcription An e followed by an ’ might result in an ´ e Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 38 / 59
Language and How do we type everything in? Computers Topic 1: Text and Speech Encoding Writing systems ◮ Use a keyboard tailored to your specific language Alphabetic Syllabic e.g. Highly noticeable how much slower your English Logographic Systems with unusual typing is when using a Danish-designed keyboard. realization Relation to language ◮ Use a processor that allows you to switch between Comparison of systems Encoding written different character systems. language e.g. Type in Cyrillic characters on your English ASCII Unicode keyboard. Typing it in Spoken language ◮ Use combinations of characters. Transcription An e followed by an ’ might result in an ´ e Why speech is hard to represent ◮ Pick and choose from a table of characters. Articulation Acoustics Relating written and So, now we can encode every language, as long as it’s spoken language From Speech to Text written. From Text to Speech 38 / 59
Language and Unwritten languages Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic Logographic Systems with unusual realization Many languages have never been written down. Of the 6700 Relation to language Comparison of systems spoken, 3000 have never been written down. Encoding written language ◮ Salar, a Turkic language in China. ASCII Unicode ◮ Gugu Badhun, a language in Australia. Typing it in Spoken language ◮ Southeastern Pomo, a language in California Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 39 / 59
Language and The need for speech Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic ◮ What if we want to work with an unwritten language? Logographic Systems with unusual ◮ What if we want to examine the way someone talks and realization Relation to language don’t have time to write it down? Comparison of systems Encoding written language ASCII Unicode Typing it in Spoken language Transcription Why speech is hard to represent Articulation Acoustics Relating written and spoken language From Speech to Text From Text to Speech 40 / 59
Language and The need for speech Computers Topic 1: Text and Speech Encoding Writing systems Alphabetic Syllabic ◮ What if we want to work with an unwritten language? Logographic Systems with unusual ◮ What if we want to examine the way someone talks and realization Relation to language don’t have time to write it down? Comparison of systems Encoding written language Many applications for encoding speech: ASCII ◮ Building spoken dialogue systems, i.e. speak with a Unicode Typing it in computer (and have it speak back). Spoken language Transcription ◮ Helping people sound like native speakers of a foreign Why speech is hard to represent language. Articulation Acoustics ◮ Helping speech pathologists diagnose problems Relating written and spoken language From Speech to Text From Text to Speech 40 / 59
Recommend
More recommend