CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin Morphological Analyser C.N.R. Istituto di Linguistica Computazionale Andrea Bozzi Giuseppe Cappelli Marco Passarotti Paolo Ruffolo Bozzi, Passarotti, CHLT LEMLAT 1
1. LEMLAT A latin morphological analyser for CHLT Bozzi, Passarotti, CHLT LEMLAT 2
LEMLAT • Lexical collated sources: – Georges – Gradenwitz – Oxford Latin Dictionary – TLL (partially) • Number of entries – 58147 LES (invariable parts of the inflected forms) Bozzi, Passarotti, CHLT LEMLAT 3
The LEMLAT dictionary structure ID Num. LES COD LES A0014 ABALIENATION N31 A0015 V ABALIEN V1 A0015 ABALEN V1 A0016 ABALIUD I A0017 ABALTERUTRUM I A0018 ABAMBUL V1I A0019 ABAMIT N1 A0020 ABANTE I A0021 V ABARC V2 A0021 ABERC V2 Different LES receive the same ID Number, if they have a common lemma (generated by the LES registered with V code): A0015 V ABALIEN V1 A0015 ABALEN V1 Lemma: abalieno Bozzi, Passarotti, CHLT LEMLAT 4
The LEMLAT morphological analysis Input form Lemma Segmentation attempts COD LEM ID Num. Bozzi, Passarotti, CHLT LEMLAT 5
LEMLAT tests • Checking of the Decretum Gratiani Lemmatization • Production of lexical index for LIE (Lessico Intellettuale Europeo, Roma), in Leibniz texts • Lemmatization of the Latin Grammarians Corpus (not published) • Lexical resource for Olissipo Project (University of Lisboa) Bozzi, Passarotti, CHLT LEMLAT 6
Why LEMLAT for CHLT? • Lexical quantity • Graphical variants management • Open-source usable tool Bozzi, Passarotti, CHLT LEMLAT 7
Comparison LEMLAT/Other latin morphological analysers 1. • Compared analysers – Words: Version 1.97 by William Whitaker http://www.erols.com/whitaker/words.htm – Nomen: by Paravia (Italian publishing house) – Perseus Latin Morphological Analysis: by Perseus Project Bozzi, Passarotti, CHLT LEMLAT 8
Comparison LEMLAT/Other latin morphological analysers 2. • Lexical quantity – LEMLAT: 58147 LES – Words: 48698 stems – Nomen: 31903 lemmas – Perseus: ? Example – pardalios • LEMLAT: analysed • Words: not analysed • Nomen: not analysed • Perseus: not analysed Bozzi, Passarotti, CHLT LEMLAT 9
Comparison LEMLAT/Other latin morphological analysers 3. • Graphical variants management – vies (form of via : abl., pl. in Corp. Inscr. Lat. 4, 1410) • LEMLAT: lemmatized as a form of via, vieo and vio • Words: lemmatized as a form of vieo • Nomen: lemmatized as a form of vieo and vio • Perseus: lemmatized as a form of vio Bozzi, Passarotti, CHLT LEMLAT 10
2. What has to be done on LEMLAT for CHLT requirements Aims, means and problems Bozzi, Passarotti, CHLT LEMLAT 11
Aims • Completion of LEMLAT synthetical morphological analysis with an analytical one, through adding on the LEMLAT lemmatization results the following items: – new morphological informations aquai • LEMLAT: aqu-ai (segmented form), aqua (lemma), n1 (COD LEM) • CHLT LEMLAT: aqua (lemma) Common, Noun, I Decl., Gen., Sing., Fem. – new stylistic and historical-linguistic informations aquai • CHLT LEMLAT: aqua (lemma) Common, Noun, I Decl., Gen., Sing., Fem., Poetic., Arch. Bozzi, Passarotti, CHLT LEMLAT 12
How to obtain these aims • New coding of the LEMLAT basical wordform segments (morphemes) recognized by the segmentation module: – LES: antiqu- – SM (paradigmatic suffixes): -issim- – SF (endings): -orum Bozzi, Passarotti, CHLT LEMLAT 13
Type of codes • Definition of codes according to morphological coding conventions developed by EAGLES project (Expert Advisory Group on Language Engineering Standards) EAGLES coding advantages: – accepted standard – largely tested on a number of languages – flexibility and personalization (useful for this first application on a dead language) Bozzi, Passarotti, CHLT LEMLAT 14
SF Coding Codes positions and their attributes ====== ================== Code P ATTRIBUTE ====== ================== 1 PoS 2 Type 3 Flexive Category 4 Mood 5 Tense 6 Case 7 Gender 8 Number 9 Person 10 Degree Bozzi, Passarotti, CHLT LEMLAT 15
Example Third position: values and codes = ===================== ===================== = P ATTRIBUTE VALUE C = ===================== ===================== = 3 Flexive Category I decl. A II decl. B III decl. C IV decl. D V decl. E I conjug. F II conjug. G III conjug. H IV conjug. L Conjug e/i M Exceptional Conjug. N No Flexive Category - Bozzi, Passarotti, CHLT LEMLAT 16
Coding samples SF LEMLAT Cod. EAGLES Cod. Examples a n1 NcA--bfs-- ros-a a n1 NcA--bms-- pirat-a a n1 NcA--nfs-- ros-a a n1 NcA--nms-- pirat-a a n1 NcA--vfs-- ros-a a n1 NcA--vms-- pirat-a a n1e NcA--bfs-- plastic-a a n1e NcA--bms-- poet-a a n1e NcA--nfs-- plastic-a a n1e NcA--nms-- poet-a a n1e NcA--vfs-- plastic-a a n1e NcA--vms-- poet-a abus n1e NcA--bfp-- de-abus abus n1e NcA--dfp-- de-abus Bozzi, Passarotti, CHLT LEMLAT 17
A coding problem • The following kinds of forms are lemmatized by LEMLAT with no segmentation: – FE (exceptional forms): registered as such in the look-up table, with COD LES FE (ex. amassint ) A1705 AMASSINT FE A1705 V AM V1 – LE (exceptional lemmas): generated through a special information registered in the fourth field of the look-up table (ex. agape ) A1128 AGAP N1E -E – I (invariable forms): registered as such in the look-up table, with COD LES I (ex. assultim ) A3200 ASSULTIM I Bozzi, Passarotti, CHLT LEMLAT 18
Why is this a problem? • Remember! The analytical morphological analysis we need derives from the coding of wordform segments (LES/SM/SF) • No segmentation of input wordform means no recognition of its segments • No recognition of input wordform segments means no analytical morphological analysis of that wordform Bozzi, Passarotti, CHLT LEMLAT 19
Problem solution 1. FE and I • Every single FE and I will be manually coded in an ad hoc file, where all FE and I are listed AMASSINT FE VmFa6—p3- ASSULTIM I Ri------- Bozzi, Passarotti, CHLT LEMLAT 20
Problem solution 2. LE • Every LE will receive its morphological analysis according to: – the COD LES of the LE LES – the kind of information registered in the fourth field of the LE LES raw in the look-up table: LEMLAT adds this information to the LES to generate the LE A1128 AGAP N1E -E LE: agape ( AGAP plus –E ) no segmented wordform! COD LES: N1E + Morphological analysis: Fourth field: -E Common, Noun, I Decl., Nomin., Sing., Fem. Common, Noun, I Decl., Voc., Sing., Fem. Common, Noun, I Decl., Abl., Sing., Fem. Bozzi, Passarotti, CHLT LEMLAT 21
3. The future Next steps and CHLT LEMLAT developments and applications Bozzi, Passarotti, CHLT LEMLAT 22
Next steps 1. • To add gender codes to every single nominal LES in the look-up table (partially automatic operation) A0019f ABAMIT N1 – Input form: abamitas Segmentation: abamit-as • SF: -as n1 as n1 NcA--afp- as n1 NcA--amp– • LES: abamit- A0019f ABAMIT N1 Selected SF: as n1 NcA--afp- Bozzi, Passarotti, CHLT LEMLAT 23
Next steps 2. • To code SM • To code FE and I • To code stylistic and historical-linguistic informations • Software – To choose a RDBMS (Relational Database Management System) among the available open-source systems – To use the chosen RDBMS in LEMLAT – Software development for implementing of new features Bozzi, Passarotti, CHLT LEMLAT 24
Next steps 3. (but additional funds are needed) • To add proper nouns (Onomasticon) in the look- up table • To add late latin items (from Humanism and Renaissance) in the look-up table Bozzi, Passarotti, CHLT LEMLAT 25
Future CHLT LEMLAT developments and applications Proposal for EU Sixth Framework, 2003 • Latin Lexical Database for content extraction – To be added to lemmas: • Encyclopedic and dictionary informations • Etymological informations • Informations about people, places and things • Images • Movies and sounds • Syntactic analysis (syntactic disambiguator) • Metric structure analyser – Metric reading through a multimedial tool (text-to-speech and sound reproduction) Bozzi, Passarotti, CHLT LEMLAT 26
Recommend
More recommend