temporal classification for historical romanian texts
play

Temporal classification for historical Romanian texts Alina Maria - PowerPoint PPT Presentation

http://nlp.unibuc.ro Temporal classification for historical Romanian texts Alina Maria Ciobanu Anca Dinu Liviu P. Dinu Vlad Niculae Octavia-Maria ulea Center for Computational Linguistics University of Bucharest August 2013 .... .. ..


  1. http://nlp.unibuc.ro Temporal classification for historical Romanian texts Alina Maria Ciobanu Anca Dinu Liviu P. Dinu Vlad Niculae Octavia-Maria Şulea Center for Computational Linguistics University of Bucharest August 2013 .... .. .. .... .... .... .... .... .... .... .... .... .... .... .... .... .. .. .. .. .. .. .... .. . . .

  2. Temporal text classification ▶ Classifying texts after the time frame they were written in ▶ Coarseness level: century ▶ Supervised classification approach ▶ Romanian texts from XVI -- XX centuries . . . . . . . . . . . . . . . . . . . . .. . . .. .. . . .. . .. . .. . .. . .. . .. . .. .. . .. . .. . .. . . .. .. . .. . .. . .. . .. .

  3. Historical texts (16th century) Beginning of written Romanian, first printed books. Religious texts and translations. ▶ Codicele Todorescu ▶ Codicele Martian ▶ Coresi, Evanghelia cu învățătură ▶ Coresi, Lucrul apostolesc ▶ Coresi, Psaltirea slavo-română ▶ Coresi, Targul evangheliilor ▶ Coresi, Tetraevanghelul ▶ Manuscrisul de la Ieud ▶ Palia de la Orăștie ▶ Psaltirea Hurmuzaki . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . .. . .. .. . .. . .. . .. . . .. . .. .. . . .. . .. . .. .. . .. . .. . .. . .. .

  4. . . . . . . . . . . . . . . . . . . . . .. . . .. . .. .. . .. . .. . .. . . .. . .. .. . . .. .. . .. . . .. .. . .. . .. . .. . .. . .. .

  5. Historical texts (17th century) Social, economical, cultural and political chronicles of Moldavia ▶ The Bible ▶ Miron Costin, Letopisețul Țarii Moldovei ▶ Miron Costin, De neamul moldovenilor ▶ Grigore Ureche, Letopisețul Țarii Moldovei ▶ Dosoftei, Viața si petreacerea sfinților ▶ Varlaam Motoc, Cazania ▶ Varlaam Motoc, Raspunsul împotriva Catehismului calvinesc . . . . . . . . . . . . . . . . . . . . .. . .. . . .. .. . .. . .. . . .. . .. . .. . .. . .. .. . .. . .. . . .. .. . .. . .. . .. . .. .

  6. Historical texts (18th century) More chronicles, beginning of literature ▶ Antim Ivireanul, Opere ▶ Axinte Uricariul, Letopisețul Țării Românești și al Țării Moldovei ▶ Ioan Canta, Letopisețul Țării Moldovei ▶ Dimitrie Cantemir, Istoria ieroglifică ▶ Dimitrie E. Brașoveanul, Gramatica românească ▶ Ion Neculce, O samă de cuvinte . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . .. .. . . .. . .. .. . . .. . .. .. . .. . . .. . .. .. . .. . .. . .. . .. . .. .

  7. Historical texts (19th and 20th century) 19th century: ▶ Mihai Eminescu, Opere (journalism works), vol. IX--XIII 20th century: literature ▶ Eugen Barbu, Groapa ▶ Mircea Cartarescu, Orbitor ▶ Marin Preda, Cel mai iubit dintre pământeni . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . . .. .. . . .. . .. .. . .. . . .. . .. . .. .. . .. . .. . .. . .. .

  8. Preprocessing ▶ All text had already been digitized and transcribed to latin. ▶ Removed: numbers, references and annotations. ▶ Tokenized: whitespace, punctuation ▶ Split: 500 sentence chunks ▶ Train-test split with ratio 1/4 ▶ 3-fold cross validation for model selection . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . . .. .. . . .. . .. .. . .. . . .. . .. . .. .. . .. . .. . .. . .. .

  9. Features ▶ lengths (avg. characters per word, avg. words per sentence) ▶ stopwords (50 most common words) ▶ endings (suffixes of length 1--3) ▶ dictionary (unambiguous matches in DexOnline) ▶ obsolete marker (all dictionaries) ▶ dictionaries of archaisms (2 dictionaries) ▶ published before 1975 (7 dictionaries) ▶ published after 1975 (31 dictionaries) . . . . . . . . . . . . . . . . . . . . .. . .. . . .. .. . .. . .. . . .. . .. . .. . .. . .. .. . .. . .. . . .. .. . .. . .. . .. . .. .

  10. Results lengths stopwords endings dictionary RF SVM 25 . 38 25 . 38 86 . 58 79 . 87 ✓ ✓ 98 . 51 95 . 16 97 . 76 97 . 02 ✓ ✓ 98 . 51 96 . 27 ✓ ✓ ✓ 98 . 51 94 . 78 98 . 88 * 98 . 14 ✓ ✓ ✓ ✓ ✓ 98 . 51 97 . 77 68 . 27 22 . 01 ✓ ✓ ✓ 92 . 92 23 . 13 98 . 14 23 . 89 ✓ ✓ ✓ ✓ ✓ 98 . 50 23 . 14 98 . 14 23 . 53 ✓ ✓ ✓ ✓ ✓ 98 . 51 25 . 00 98 . 88 23 . 14 ✓ ✓ ✓ * 99 . 25 22 . 75 ✓ ✓ ✓ ✓ . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . .. . .. . .. . .. .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. .

  11. Test results ▶ Linear SVM, C = 10 4 : 98 . 8 % accuracy, confusion: 17th and 20th century ▶ Random forest, 50 trees: 97 . 7 % accuracy, confusion: 16th and 17th century . . . . . . . . . . . . . . . . . . . . .. . . .. .. . . .. . .. . .. . .. . .. . .. . .. .. . .. . .. . .. . . .. .. . .. . .. . .. . .. .

  12. χ 2 feature selection ( N f , y − E f , y ) 2 ∑ χ 2 ( f ) = E f , y f , y 1.0 amu au care cari 0.8 0.6 0.4 0.2 0.0 16 17 18 19 20 16 17 18 19 20 16 17 18 19 20 16 17 18 19 20 1.0 de derept lu pe 0.8 0.6 0.4 0.2 0.0 1.0 pre se 0.8 0.6 0.4 0.2 0.0 . . . . . . . . . . . . . . . . . . . . . .. .. . .. . . .. . .. .. . . .. . .. . .. . .. . .. .. . . .. . .. . .. .. . .. . .. . .. . .. .

  13. Results and sanity check ▶ RF from last slides: 98 . 8 % ▶ NB predicting century: 90 . 1 % accuracy ▶ RF predicting century (20 trees): 100 % ▶ RF predicting source document: 72 . 1 % ▶ RF predicting document, evaluated for century: 98 . 1 % ▶ > 95 % confidence on 20th century novels set in the past . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . . .. .. . . .. . .. .. . .. . . .. . .. . .. .. . .. . .. . .. . .. .

Recommend


More recommend