The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research KLÁRA OSOLSOBĚ ÚČJ FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ 1 25.09.2019 DERIMO 2019
Goals SAUČ – Corpus based linguistic manual Three steps of the automatic analysis Conclusion 2 25.09.2019 DERIMO 2019
3 25.09.2019 DERIMO 2019
4 25.09.2019 DERIMO 2019
Pitfall No 1: The tokenization The affixes described in SAUČ are usually graphically a part of a single lexeme. MRE (Multiword Expression) 5 25.09.2019 DERIMO 2019
Tokenization: circumfix na na- -o na natvrdo × na na tvrdo two ways of writing the preposition that is not graphically united with some newly created adverb is an independent unit tagged as a preposition and its nominal part is very often not identified Whereas only “written together variants” are included in the frequency report." 6 25.09.2019 DERIMO 2019
Tokenization: : prefix + reflexive particle se za za- se se zamyslel se nad tím × on se nad tím asi ani pořádně nezamyslel two most frequent word order variants (variants <- 1,1> ) – frequency repport 7 25.09.2019 DERIMO 2019
. Pitfall No 2: Assigning lemma + tag interpretation based on the morphological dictionary MorfFlex CZ , LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11858/00-097C-0000-0015- A780-9. ) The productivity measuring is dictionary- dependent. 8 25.09.2019 DERIMO 2019
A A lack of of the Dictionary: : circumfix na na- -o na na-prav-o, , na na-lev-o × na na- těsn -o, , na na- kratičk -o lemma =” na. *o” & tag=”D.*” lemma =” na. *o” & tag=”X.*” The words of low frequency (unrecognised by the automatic morphological analysis) correspond to the model of such type of compound adverbs in Czech and show its productivity . The productivity picture based on the results of automatic analysis is inaccurate. 9 25.09.2019 DERIMO 2019
A A lack of of the Dictionary : : - oš Mil- oš , Jug- oš × Káj- oš , , Tal- oš The query [lemma=".* oš " & tag="NN[MI].*"] gives125 lemmata, 69 are relevant. The query [lemma="(.* oš )|(.* oš [ eiů ])|(.* oších )|(.* ošům ) & tag="X.*"] gives 282 words, 36 are relevant lemmata. The examples given to illustrate the second query would indicate, that if we were not doing so, productivity would be significantly skewed. 10 25.09.2019 DERIMO 2019
Pitfall No 3: The disambiguation the process of identifying which interpretation of a word is used in context The biggest problem here is homonymy (affects cases of part of speech transition , polyfunctional affixes , and overgeneration of formal query ). Corpus analysis results are „disambiguation - addicted” . 11 25.09.2019 DERIMO 2019
-cí cí vedou-cí cí (leader/leading) vedoucí (↖1/2/3) 8.348 (1) gerund (e. g. Slepý vedoucí slepého je nebezpečný. = ‚ The blind leading the blind is dangerous. ‘), (2) adjective ( Vedoucí disidenti dostali dlouhé tresty. = ‚ The leading dissidents had received long prison sentences ‘) a (3) noun ( profesionální vedoucí = ‚ professional leader ‘). (1) and (2) are not distinguished by the automatic morphological analysis. (3) is tagged , but the results of the desambiguation are far from satisfactory 12 25.09.2019 DERIMO 2019
cestují-cí cí (travelling/traveller) 13 25.09.2019 DERIMO 2019
Desambiguation: : sou- -í (overgeneration) soutěžení, souručenství, soukromí Lemmas ended by the string of the characters , which doesn’t correspond to the words created by the affix ) were excluded. sou- -í : soustřed -i- t se → s oustřed - ě n - í – concentration, ží - t → sou - ži - t-í – coexistence, soutěž -i-t → soutěž -en-í – competition, souž -i- t → souž - en-í – suffering/problem, soused → soused- ství – neighborhood, soukromí – privacy 14 25.09.2019 DERIMO 2019
Conclusion limits of working with the results of automatic part of speech tagging the method of data mining is sufficiently described at the beginning of the frequency report (the corpus query) Despite the above-mentioned simplistic solutions, it is not disputed that without using the results of automatic tagging, any way of creating the Dictionary of affixes used in Czech would be a) incomparably more time-consuming, b) more expensive and c) in its result less objective. A detailed morphological description of word forms based on the data gained during the work on SAUČ is reflected in the NovaMorf project ( Osolsobě et al. 2017). 15 25.09.2019 DERIMO 2019
NOVAMORF (https://sites.google.com/site/koncepcenovamorf/) /) 16 25.09.2019 DERIMO 2019
Thank you for your attention Více soch je sou- soš -í , více žen je s- ouž -en-í / sou- žen -í . “ Several sculptures create a sculptural group, several women create a problem .” 17 25.09.2019 DERIMO 2019
Recommend
More recommend