Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March 6, 2020 NPFL120 Multilingual Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Tokenization and Word Segmentation Tokenization and Word Segmentation 1/27 • IMPORTANT because: • Training tokenization ̸ = test tokenization • ⇒ accuracy goes down • Not always trivial • May interact with morphology • May include normalization (character-level)
Tokenization María Tokenization and Word Segmentation PUNCT PUNCT PROPN PUNCT PRON VERB PUNCT PUNCT PUNCT , » ! amo te “María, I love you!” Juan exclaimed. , ¡ « X VERB X PRON X amo!», exclamó Juan. te «¡María, 2/27 • Classic tokenization: • Separate punctuation from words • Recognize certain clusters of symbols like “...” • Perhaps keep together things like user@mail.x.edu
• Some problems • haven ’ t (English; should be have n’t ) • instal · lació (Catalan; should be 1 token) • single quote (punctuation) misspelled as acute accent (modifjer letter) • writing systems without spaces Using Unicode Character Categories $text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//; Tokenization and Word Segmentation 3/27 • https://perldoc.perl.org/perlunicode.html • $text =˜ s/(\pP)/ $1 /g; • Optionally recombine email addresses, URLs etc.
Using Unicode Character Categories $text =˜ s/(\pP)/ $1 /g; $text =˜ s/ˆ\s+//; $text =˜ s/\s+$//; Tokenization and Word Segmentation 3/27 • https://perldoc.perl.org/perlunicode.html • $text =˜ s/(\pP)/ $1 /g; • Optionally recombine email addresses, URLs etc. • Some problems • haven ’ t (English; should be have n’t ) • instal · lació (Catalan; should be 1 token) • single quote (punctuation) misspelled as acute accent (modifjer letter) • writing systems without spaces
• Unicode directed quotes and long hyphens to undirected ASCII • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc. • T • English/ASCII punctuation in foreign writing systems • 「你看過《三國演義》嗎?」他問我。 • “你看過‘三國演義’嗎?”他問我. • European/ASCII digits in Arabic, Devanagari etc. • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European) • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic) • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari) Tokenization and Word Segmentation Normalization »magyar« — ’magyar’ EX-like ASCII directed quotes `` and '' and hyphens -- and --- 4/27 • Often part of tokenization • Decimal comma to decimal point; separator of thousands
• T • English/ASCII punctuation in foreign writing systems • 「你看過《三國演義》嗎?」他問我。 • “你看過‘三國演義’嗎?”他問我. • European/ASCII digits in Arabic, Devanagari etc. • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European) • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic) • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari) Tokenization and Word Segmentation Normalization EX-like ASCII directed quotes `` and '' and hyphens -- and --- »magyar« — ’magyar’ 4/27 • Often part of tokenization • Decimal comma to decimal point; separator of thousands • Unicode directed quotes and long hyphens to undirected ASCII • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
• English/ASCII punctuation in foreign writing systems • 「你看過《三國演義》嗎?」他問我。 • “你看過‘三國演義’嗎?”他問我. • European/ASCII digits in Arabic, Devanagari etc. • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European) • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic) • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari) Tokenization and Word Segmentation Normalization EX-like ASCII directed quotes `` and '' and hyphens -- and --- »magyar« — ’magyar’ 4/27 • Often part of tokenization • Decimal comma to decimal point; separator of thousands • Unicode directed quotes and long hyphens to undirected ASCII • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc. • T
• European/ASCII digits in Arabic, Devanagari etc. • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European) • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic) • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari) Normalization EX-like ASCII directed quotes `` and '' and hyphens -- and --- Tokenization and Word Segmentation 4/27 »magyar« — ’magyar’ • Often part of tokenization • Decimal comma to decimal point; separator of thousands • Unicode directed quotes and long hyphens to undirected ASCII • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc. • T • English/ASCII punctuation in foreign writing systems • 「你看過《三國演義》嗎?」他問我。 • “你看過‘三國演義’嗎?”他問我.
Normalization »magyar« — ’magyar’ Tokenization and Word Segmentation EX-like ASCII directed quotes `` and '' and hyphens -- and --- 4/27 • Often part of tokenization • Decimal comma to decimal point; separator of thousands • Unicode directed quotes and long hyphens to undirected ASCII • “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — • Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc. • T • English/ASCII punctuation in foreign writing systems • 「你看過《三國演義》嗎?」他問我。 • “你看過‘三國演義’嗎?”他問我. • European/ASCII digits in Arabic, Devanagari etc. • 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European) • ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic) • ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Word Segmentation a Tokenization and Word Segmentation VERB PRON ADP DET NOUN PUNCT . mar Let’s go to the sea. el nos Vamos X NOUN PUNCT VERB? . mar Vámonos al 5/27 • Syntactic word vs. orthographic word • Multi-word tokens • Two-level scheme: • Tokenization (low level, punctuation, concatenative) • Word segmentation (higher level, not necessarily concatenative)
Word Segmentation “We wake up at fjve.” “Our guide wakes us up at fjve.” Tokenization and Word Segmentation 6/27 • Lexicalist hypothesis: • Words (not morphemes) are the basic units in syntax • Words enter in dependency relations • Words are forms of lemmas and have morphological features • Orthographic vs. syntactic word • Syntactically autonomous part of orthographic word • Contractions (al = a + el) • Clitics (vámonos = vamos + nos) • ¿A qué hora nos vamos mañana? • Nos despertamos a las cinco. • Nuestro guía nos despierta a las cinco.
Contractions in Arabic the throne Tokenization and Word Segmentation ADP+NOUN+PRON PROPN NOUN ADP VERB Baudouin to son his on He abdicated in favour of his son Baudouin. surrendered būdūān li+ibni+hi al-ʿarši ʿan yatanāzalu 7/27 لزﺎﻨﺘﻳﻦﻋشﺮﻌﻟاﻪﻨﺑﻻناودﻮﺑ
viṣṇuśarmedam) Segmentation as Part of Morphological Analysis Tokenization and Word Segmentation 8/27 • Arabic • ElixirFM: http://lindat.mff.cuni.cz/services/elixirfm/run.php • Enter ”ﻪﻨﺑﻻ“ (labnh) • Sanskrit • Sanskrit Reader Companion: http://sanskrit.inria.fr/DICO/reader.fr.html • Select Input convention = Devanagari • Enter “ सकलाथ�शा�सारं जगित समालो�य िव�णुशम�दम् ” (sakalārthaśāstrasāraṁ jagati samālokya • German compound splitting (unsupervised)
Chinese Word Segmentation . Tokenization and Word Segmentation PUNCT PROPN PRON ADP ADV . Valencia in we Now Wǎlúnxīyǎ We are now in Valencia. zài Xiànzài wǒmen 。 ⽡倫西亞 在 我們 現在 We are now in Valencia. Xiàn zài wǒ men zài wǎ lún xī yǎ. 現在我們在⽡倫西亞。 9/27
Words in Japanese NOUN た Kyōdō of beauty-salon to go CONV come will PAST PROPN ADP ADP VERB SCONJ AUX AUX 来る AUX nmod case obl case aux aux aux mark Tokenization and Word Segmentation ます て I went to the beauty salon of Kyōdō [, Beyond-R.] no 経堂 の 美容室 に ⾏っ て き まし た Kyōdō miyōshitsu ⾏く ni it te ki mashi ta 経堂 の 美容室 に 10/27
Words in Japanese VERB to going come PROPN ADP NOUN ADP VERB VerbForm=Conv VerbForm=Fin of Tense=Past Polite=Form nmod case obl case advcl Tokenization and Word Segmentation beauty-salon Kyōdō I went to the beauty salon of Kyōdō [, Beyond-R.] no 経堂 の 美容室 に ⾏って きました Kyōdō miyōshitsu 来る ni itte kimashita 経堂 の 美容室 に ⾏く 11/27
Recommend
More recommend