Framework for Supporting Multilingual Resource Development at Expert System Jose Manuel Gomez-Perez jmgomez@expertsystem.com META-FORUM 2016, July 5th, 2016
Expert System – About us Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Expert System’s COGITO • COGITO interprets text to empower beLer, more informed decision making • Based on Sensigrafo, a monolingual representa5on of knowledge that is both deep and wide • Sensigrafo contains millions of word defini5ons, related concepts and linguis5c informa5on • Several Person-Years each • COGITO leverages context informa5on for disambigua5on based on Sensigrafo • Document categoriza5on and informa.on extrac5on encoded on top of Sensigrafo in rule-based categoriza.on and extrac.on languages • Rule modeling supported by COGITO Studio Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Expert System Today 14 languages na5vely supported Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Challenges and Opportuni.es • Due to Expert System's rapid expansion in the European market, the company faced the challenge of crea5ng new monolingual resources from scratch, or… • Achieve na5ve mul5linguality in a cost-effec5ve manner , while maintaining high accuracy and reducing .me to market • Generalized MT is not the solu5on - resul.ng accuracy drops at least 10% average • Many of the projects in the new countries conceptually similar to previous projects in other languages • Enable reuse of exis5ng seman5c and linguis5c resources , including monolingual rule bases, across languages Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Approach Context-based mapping No previous rule iden.fica.on base to reuse • The goal is not to automate Large document the whole process, rather : corpus available Rule Learning • Bootstrap resources, providing knowledge engineers with a solid base and allevia.ng the blank page syndrome par.cularly for rule development Word & Sense • Leverage context informa5on , Embeddings both in text and in the monolingual Sensigrafos, to improve transla.on quality Automa.c Rule • Provide confidence values to Transla.on guide valida5on efforts • Focus on the exis5ng Reusable rule base monolingual rule bases exists (in a different language) Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Automa.c Rule Transla.on • Transform rules in the original language into Abstract Syntax Trees (AST) . Main nodes include concepts (word senses) , lemmas , and keywords • AST translator replicates ASTs, modifying or replacing nodes from the source language to the target language • Different handling for each node and operator type. Rely on concept mapping between source and target Sensigrafos • Applied to 90K rules in IPTC, EUROVOC, etc. and language pairs IT- ES, IT-FR, EN-DE ü 99.9% rules translated ü 55% to 70% accuracy Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Word and Sense Embeddings • Suggest missing links between • Tokenized, lemma.zed and Sensigrafos using context for normalized the EUROPARL word sense disambigua.on parallel corpora using COGITO • Builds on MT work to infer • Skip-gram model, window size missing dic.onary entries 10, vector dimensionality 400 (Mikolov et al) • Linear projec.on learnt from a • Learn monolingual models and a linear projec5on between them dic.onary with the 5,000 most • Learnt rela5ons display several frequent terms in the source degrees of relatedness with language and their MT different confidence values, e.g. equivalent in the target equivalence, similarity, co- • Transla.on matrix code in Java occurrence, etc. available in GitHub • Pleno (ES) -> full, plenary, partsession, Hortefeux, approve, summarize (EN) hZps://github.com/josemanuelgp/ word2vec_vector-transla5on-java Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Rule Learning • Automa5cally bootstrap a rule base star5ng from a targeted • Focus on beginner’s rules rather than perfect rules • Two main approaches, based on _-idf and decisión tres ü Precision >34% ü Recall >65% Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Come see our poster! Framework for Suppor.ng Mul.lingual Resource Development at Expert System Jose Manuel Gomez-Perez jmgomez@expertsystem.com Framework for Suppor.ng Mul.lingual Resource Development at Expert System META-FORUM 2016
Recommend
More recommend