chlt project
play

CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin - PowerPoint PPT Presentation

CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin Morphological Analyser C.N.R. Istituto di Linguistica Computazionale Andrea Bozzi Giuseppe Cappelli Marco Passarotti Paolo Ruffolo Bozzi, Passarotti, CHLT LEMLAT 1 1. LEMLAT A


  1. CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin Morphological Analyser C.N.R. Istituto di Linguistica Computazionale Andrea Bozzi Giuseppe Cappelli Marco Passarotti Paolo Ruffolo Bozzi, Passarotti, CHLT LEMLAT 1

  2. 1. LEMLAT A latin morphological analyser for CHLT Bozzi, Passarotti, CHLT LEMLAT 2

  3. LEMLAT • Lexical collated sources: – Georges – Gradenwitz – Oxford Latin Dictionary – TLL (partially) • Number of entries – 58147 LES (invariable parts of the inflected forms) Bozzi, Passarotti, CHLT LEMLAT 3

  4. The LEMLAT dictionary structure ID Num. LES COD LES A0014 ABALIENATION N31 A0015 V ABALIEN V1 A0015 ABALEN V1 A0016 ABALIUD I A0017 ABALTERUTRUM I A0018 ABAMBUL V1I A0019 ABAMIT N1 A0020 ABANTE I A0021 V ABARC V2 A0021 ABERC V2 Different LES receive the same ID Number, if they have a common lemma (generated by the LES registered with V code): A0015 V ABALIEN V1 A0015 ABALEN V1 Lemma: abalieno Bozzi, Passarotti, CHLT LEMLAT 4

  5. The LEMLAT morphological analysis Input form Lemma Segmentation attempts COD LEM ID Num. Bozzi, Passarotti, CHLT LEMLAT 5

  6. LEMLAT tests • Checking of the Decretum Gratiani Lemmatization • Production of lexical index for LIE (Lessico Intellettuale Europeo, Roma), in Leibniz texts • Lemmatization of the Latin Grammarians Corpus (not published) • Lexical resource for Olissipo Project (University of Lisboa) Bozzi, Passarotti, CHLT LEMLAT 6

  7. Why LEMLAT for CHLT? • Lexical quantity • Graphical variants management • Open-source usable tool Bozzi, Passarotti, CHLT LEMLAT 7

  8. Comparison LEMLAT/Other latin morphological analysers 1. • Compared analysers – Words: Version 1.97 by William Whitaker http://www.erols.com/whitaker/words.htm – Nomen: by Paravia (Italian publishing house) – Perseus Latin Morphological Analysis: by Perseus Project Bozzi, Passarotti, CHLT LEMLAT 8

  9. Comparison LEMLAT/Other latin morphological analysers 2. • Lexical quantity – LEMLAT: 58147 LES – Words: 48698 stems – Nomen: 31903 lemmas – Perseus: ? Example – pardalios • LEMLAT: analysed • Words: not analysed • Nomen: not analysed • Perseus: not analysed Bozzi, Passarotti, CHLT LEMLAT 9

  10. Comparison LEMLAT/Other latin morphological analysers 3. • Graphical variants management – vies (form of via : abl., pl. in Corp. Inscr. Lat. 4, 1410) • LEMLAT: lemmatized as a form of via, vieo and vio • Words: lemmatized as a form of vieo • Nomen: lemmatized as a form of vieo and vio • Perseus: lemmatized as a form of vio Bozzi, Passarotti, CHLT LEMLAT 10

  11. 2. What has to be done on LEMLAT for CHLT requirements Aims, means and problems Bozzi, Passarotti, CHLT LEMLAT 11

  12. Aims • Completion of LEMLAT synthetical morphological analysis with an analytical one, through adding on the LEMLAT lemmatization results the following items: – new morphological informations aquai • LEMLAT: aqu-ai (segmented form), aqua (lemma), n1 (COD LEM) • CHLT LEMLAT: aqua (lemma) Common, Noun, I Decl., Gen., Sing., Fem. – new stylistic and historical-linguistic informations aquai • CHLT LEMLAT: aqua (lemma) Common, Noun, I Decl., Gen., Sing., Fem., Poetic., Arch. Bozzi, Passarotti, CHLT LEMLAT 12

  13. How to obtain these aims • New coding of the LEMLAT basical wordform segments (morphemes) recognized by the segmentation module: – LES: antiqu- – SM (paradigmatic suffixes): -issim- – SF (endings): -orum Bozzi, Passarotti, CHLT LEMLAT 13

  14. Type of codes • Definition of codes according to morphological coding conventions developed by EAGLES project (Expert Advisory Group on Language Engineering Standards) EAGLES coding advantages: – accepted standard – largely tested on a number of languages – flexibility and personalization (useful for this first application on a dead language) Bozzi, Passarotti, CHLT LEMLAT 14

  15. SF Coding Codes positions and their attributes ====== ================== Code P ATTRIBUTE ====== ================== 1 PoS 2 Type 3 Flexive Category 4 Mood 5 Tense 6 Case 7 Gender 8 Number 9 Person 10 Degree Bozzi, Passarotti, CHLT LEMLAT 15

  16. Example Third position: values and codes = ===================== ===================== = P ATTRIBUTE VALUE C = ===================== ===================== = 3 Flexive Category I decl. A II decl. B III decl. C IV decl. D V decl. E I conjug. F II conjug. G III conjug. H IV conjug. L Conjug e/i M Exceptional Conjug. N No Flexive Category - Bozzi, Passarotti, CHLT LEMLAT 16

  17. Coding samples SF LEMLAT Cod. EAGLES Cod. Examples a n1 NcA--bfs-- ros-a a n1 NcA--bms-- pirat-a a n1 NcA--nfs-- ros-a a n1 NcA--nms-- pirat-a a n1 NcA--vfs-- ros-a a n1 NcA--vms-- pirat-a a n1e NcA--bfs-- plastic-a a n1e NcA--bms-- poet-a a n1e NcA--nfs-- plastic-a a n1e NcA--nms-- poet-a a n1e NcA--vfs-- plastic-a a n1e NcA--vms-- poet-a abus n1e NcA--bfp-- de-abus abus n1e NcA--dfp-- de-abus Bozzi, Passarotti, CHLT LEMLAT 17

  18. A coding problem • The following kinds of forms are lemmatized by LEMLAT with no segmentation: – FE (exceptional forms): registered as such in the look-up table, with COD LES FE (ex. amassint ) A1705 AMASSINT FE A1705 V AM V1 – LE (exceptional lemmas): generated through a special information registered in the fourth field of the look-up table (ex. agape ) A1128 AGAP N1E -E – I (invariable forms): registered as such in the look-up table, with COD LES I (ex. assultim ) A3200 ASSULTIM I Bozzi, Passarotti, CHLT LEMLAT 18

  19. Why is this a problem? • Remember! The analytical morphological analysis we need derives from the coding of wordform segments (LES/SM/SF) • No segmentation of input wordform means no recognition of its segments • No recognition of input wordform segments means no analytical morphological analysis of that wordform Bozzi, Passarotti, CHLT LEMLAT 19

  20. Problem solution 1. FE and I • Every single FE and I will be manually coded in an ad hoc file, where all FE and I are listed AMASSINT FE VmFa6—p3- ASSULTIM I Ri------- Bozzi, Passarotti, CHLT LEMLAT 20

  21. Problem solution 2. LE • Every LE will receive its morphological analysis according to: – the COD LES of the LE LES – the kind of information registered in the fourth field of the LE LES raw in the look-up table: LEMLAT adds this information to the LES to generate the LE A1128 AGAP N1E -E LE: agape ( AGAP plus –E ) no segmented wordform! COD LES: N1E + Morphological analysis: Fourth field: -E Common, Noun, I Decl., Nomin., Sing., Fem. Common, Noun, I Decl., Voc., Sing., Fem. Common, Noun, I Decl., Abl., Sing., Fem. Bozzi, Passarotti, CHLT LEMLAT 21

  22. 3. The future Next steps and CHLT LEMLAT developments and applications Bozzi, Passarotti, CHLT LEMLAT 22

  23. Next steps 1. • To add gender codes to every single nominal LES in the look-up table (partially automatic operation) A0019f ABAMIT N1 – Input form: abamitas Segmentation: abamit-as • SF: -as n1 as n1 NcA--afp- as n1 NcA--amp– • LES: abamit- A0019f ABAMIT N1 Selected SF: as n1 NcA--afp- Bozzi, Passarotti, CHLT LEMLAT 23

  24. Next steps 2. • To code SM • To code FE and I • To code stylistic and historical-linguistic informations • Software – To choose a RDBMS (Relational Database Management System) among the available open-source systems – To use the chosen RDBMS in LEMLAT – Software development for implementing of new features Bozzi, Passarotti, CHLT LEMLAT 24

  25. Next steps 3. (but additional funds are needed) • To add proper nouns (Onomasticon) in the look- up table • To add late latin items (from Humanism and Renaissance) in the look-up table Bozzi, Passarotti, CHLT LEMLAT 25

  26. Future CHLT LEMLAT developments and applications Proposal for EU Sixth Framework, 2003 • Latin Lexical Database for content extraction – To be added to lemmas: • Encyclopedic and dictionary informations • Etymological informations • Informations about people, places and things • Images • Movies and sounds • Syntactic analysis (syntactic disambiguator) • Metric structure analyser – Metric reading through a multimedial tool (text-to-speech and sound reproduction) Bozzi, Passarotti, CHLT LEMLAT 26

Recommend


More recommend