CHIST-ERA Conference 2011 A Micro-Systemic Approach for Dependable Natural Language Processing Sylviane CARDEY & Peter GREENFIELD Centre Tesnière, Université de Franche-Comté, France http://tesniere.univ-fcomte.fr sylviane.cardey@univ-fcomte.fr
i) Today’s Starting Point CHIST-ERA Conference 2011 2 Sylviane CARDEY & Peter GREENFIELD
Natural Language Processing Today F or the heterogeneous data for From Data to New Knowledge (D2K) in the form of natural language , this latter is often the weakest link in complex systems connecting natural and artificial elements (e.g. the 1977 Tenerife airport disaster - 583 fatalities). Compounded with this, with the exception of controlled languages, natural language processing is notorious in defying even elementary engineering practices where quality relies on norms and without which reliable interoperability is impossible. CHIST-ERA Conference 2011 3 Sylviane CARDEY & Peter GREENFIELD
NLP Reliability Today When we look at NLP applications what strikes us first is their unreliability : • Machine translation exists , reliable machine translation does not • Information searching with noise and with weak signals ignored Why is this so? CHIST-ERA Conference 2011 4 Sylviane CARDEY & Peter GREENFIELD
Why is NLP So Unreliable Today? Many would say: • The basic precepts of engineering practice are ignored (normalisation, case based testing, traceability,…) • Evaluation/tuning counts more than fundamental research • Corpus linguistics approaches are too favoured even though these are limited as being performance based (rather than competence) & sample based (rarely exhaustive) But why are these so? Why these impasses? CHIST-ERA Conference 2011 5 Sylviane CARDEY & Peter GREENFIELD
What Really is the Problem with NLP Today? • We contend that the regrettable state of NLP today is at least in part because one has forgotten (or one cannot admit) that natural language per-se is natural; it is not an artifact. • We contend too that web semantics (e.g. RDF and SPARQL), taxonomies and such like which are suitable for artefacts are not part of the practice of NLP (which is not to say that they cannot be interfaced with NLP). CHIST-ERA Conference 2011 6 Sylviane CARDEY & Peter GREENFIELD
Complexity & Society • Natural language is very complex: One is confronted with the well known (to linguists) language inherent phenomena such as openness (neologisms...), ambiguity, homophony, homonymy, synonymy, anaphora, ‘levels’ (phonology, lexis, syntax, semantics...) etc. • Natural language is a social phenomenon One has to contend with (normalise or exploit) ‘real and authentic’ language as practised by real human beings (slang, ‘errors’, dialects…). CHIST-ERA Conference 2011 7 Sylviane CARDEY & Peter GREENFIELD
it deosn't mttaer in waht oredr it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the ... CHIST-ERA Conference 2011 8 Sylviane CARDEY & Peter GREENFIELD
Information • How can we filter and interpret information and how can we construct it and translate it? • A message which is malformed or incorrectly interpreted or not interpreted can provoke serious catastrophes. CHIST-ERA Conference 2011 9 Sylviane CARDEY & Peter GREENFIELD
French morphological system Open system inflexion etymology formative elements root derivation-composition radical CHIST-ERA Conference 2011 10 Sylviane CARDEY & Peter GREENFIELD
How do you spell …? • model + ing? • model + er? • distil + ing • frolic + ing? CHIST-ERA Conference 2011 11 Sylviane CARDEY & Peter GREENFIELD
Polycategories In the French sentence: ' la méchante rigole car le petit est malade ' (the nasty woman laughs because the little boy is ill) out of context, all the lexical units are ambiguous... Lexical Unit Categories la {Art., Nom, Pro. pers.} méchante {Nom, Adj.} rigole {Nom, Verbe coni.} car {Nom, Conj.} le {Art., Pro. pers.} petit {Nom, Adj.} est {Nom, Verbe conj.} malade {Nom, Adi.} CHIST-ERA Conference 2011 12 Sylviane CARDEY & Peter GREENFIELD
Logic, Mathematics and Poetry 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe … "Jabberwocky" , Lewis Carroll. Through the Looking-Glass, and What Alice Found There (1872). CHIST-ERA Conference 2011 13 Sylviane CARDEY & Peter GREENFIELD
Keywords Important words and non-important ones. The problem is what is an important word? The main question is what is a word ? He is a has been, he has been working on the same methodology for too long. The product ought to be perfect. The consumer is really saying: The product ought to be perfect but it is not . ? For some months this product is no longer as it was before. ? The product would be very good without garlic. CHIST-ERA Conference 2011 14 Sylviane CARDEY & Peter GREENFIELD
Confusions • perdre de l’altitude / prendre de l’altitude • dess ou s / dess u s (specially for Anglophones) • al titude / at titude / la titude • up locked / un locked CHIST-ERA Conference 2011 15 Sylviane CARDEY & Peter GREENFIELD
NLP & Dependability Compliance for Life/Safety-Critical Applications • Dependability compliance is not possible with statistical, keyword and other non systems based approaches. • However, though dependability compliance is potentially possible with systems (‘rule’) based approaches, in reality this is not the case . One cannot analyse a language(s) in its entirety due to its complexity. CHIST-ERA Conference 2011 16 Sylviane CARDEY & Peter GREENFIELD
Machine Learning & Other ‘Short-Cuts’ Machine learning approaches/existing resources/‘standards’ concerning natural language as a substitute for manual linguistic analysis (“language is too complex/too much work/too… so this is why I use a ‘shortcut’”) are not a panacea. • “But the corpus did not include this case…” • “But the tags in the ‘standard’ tag set make no sense for this application…” • “But the (pre-ordained) annotation effort was at least as much in person-hours as a linguist’s in-depth analysis… And the results are not-reusable…” • “But we do not have an exhaustive case-based benchmark…” • “But the dictionary (in extension) cannot handle neologisms…” CHIST-ERA Conference 2011 17 • … Sylviane CARDEY & Peter GREENFIELD
So, what can we do? Can we put Natural Language Processing (NLP) on a firm footing, and admit NLP to the world of dependability engineering for life/safety- critical applications? If we can meet this challenge, then NLP for less demanding applications, today’s and the future’s, can surely benefit. CHIST-ERA Conference 2011 18 Sylviane CARDEY & Peter GREENFIELD
ii) Future Trends CHIST-ERA Conference 2011 19 Sylviane CARDEY & Peter GREENFIELD
NLP must provide reliable applications We contend that the future trends in information technology, such as conformity to mandatory regulations concerning dependability, will impose that NLP must provide reliable applications . Thus we will have to: 1. admit that natural language is very complex and that natural language is a social phenomenon. 2. devise the appropriate conforming analysis techniques leading to reliable NLP applications. Centre Tesnière has and is working in this direction. CHIST-ERA Conference 2011 20 Sylviane CARDEY & Peter GREENFIELD
A Micro-Systemic Approach for Dependable NLP Centre Tesnière‘s micro-systemic linguistic analysis approach proposes that to be processed safely languages have to be decomposed into systems which can be analysed by a human being and by machine because they are small enough but also complete so as to be able to work together as a unified system. As well as this, the systems so delimited can interact with other such systems, and this interaction is a property of language. Nothing is independent: lexis, morphology, syntax are linked. CHIST-ERA Conference 2011 21 Sylviane CARDEY & Peter GREENFIELD
French morphological system Open system inflexion etymology belle beau formative elements root derivation-composition radical bellement feminine adverb adjective beau belle bellement inflexion suffix CHIST-ERA Conference 2011 22 Sylviane CARDEY & Peter GREENFIELD
Our Model We have developed, using discrete constructive mathematics, a stable (zero obsolescence) abstract core model-theoretic model. During the analysis for some application the ensuing processes of instantiating this model prones: • exhaustive analyses • fine analyses • compositional analyses • normalisation to promote intra/interoperability • the linguist generalises (competence) CHIST-ERA Conference 2011 23 Sylviane CARDEY & Peter GREENFIELD
Recommend
More recommend