Grammatical inference: an introduction Colin de la Higuera University of Nantes
Nantes @wikipedia 2 Colin de la Higuera, Nantes 2013
Acknowledgements � Pieter Adriaans, Hasan Ibne Akram, Anne-Muriel Arigon, Leo Becerra-Bonache, Cristina Bibire, Alex Clark, Rafael Carrasco, Paco Casacuberta, Pierre Dupont, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Jeffrey Heinz, Jean-Christophe Janodet, Satoshi Kobayachi, Laurent Miclet, Thierry Murgue, Tim Oates, Jose Oncina, Frédéric Tantini, Franck Thollard, Sicco Verwer, Enrique Vidal, Menno van Zaanen,... http://pagesperso.lina.univ-nantes.fr/~cdlh/ http://videolectures.net/colin_de_la_higuera/ 3 Colin de la Higuera, Nantes 2013
Practical information � Grammatical Inference is module X9IT050 � 18 hours � http://pagesperso.lina.univ- nantes.fr/~cdlh/X9IT050.html � Exam: to be decided 4 Colin de la Higuera, Nantes 2013
Some useful links Grammatical Inference Software � The Repository https://logiciels.lina.univ- nantes.fr/redmine/projects/gisr/wiki � Talks on http://videolectures.net � A book � Articles � Start here: http://pagesperso.lina.univ- nantes.fr/~cdlh/X9IT050.html 5 Colin de la Higuera, Nantes 2013
What I plan to talk about 11/9/2013 An introduction to grammatical inference. About what learning a 1. language means, how we can measure success 18/9/2013 An introduction to grammatical inference. A motivating example 2. 25/9/2013 Learning: identifying or approximating? 3. 2/10/2013 Learning from text 4. 9/10/2013 Learning from text: the window languages 5. 16/10/2013 Learning from an informant: the RPNI algorithm and variants 6. 23/10/2013 Learning distributions: why? How should we measure success? 7. About distances between distributions 6/11/2013 Learning distributions: learning the weights given a structure. 8. EM, Gibbs sampling and the spectral methods 13/11/2013 Learning distributions: state merging techniques 9. 20/11/2013 Active learning 1 About active learning 10. 27/11/2013 Active learning 2 The MAT algorithm 11. 4/12/2013 Learning transducers 12. 11/12/2013 Learning probabilistic transducers 13. 18/12/2013 Exam 6 14. Colin de la Higuera, Nantes 2013
Outline (of this first talk) What is grammatical inference about? 1. Why is it a difficult task? 2. Why is it a useful task? 3. Validation issues 4. Some criteria 5. 7 Colin de la Higuera, Nantes 2013
1 Grammatical inference is about learning a grammar given information about a language � Information is strings, trees or graphs � Information can be (typically) � Text: only positive information � Informant: labelled data � Actively sought (query learning, teaching) Above lists are not limitative 8 Colin de la Higuera, Nantes 2013
The functions/goals � Languages and grammars from the Chomsky hierarchy � Probabilistic automata and context-free grammars � Hidden Markov Models � Patterns � Transducers 9 Colin de la Higuera, Nantes 2013
The Chomsky hierarchy Recursively enumerable languages Context sensitive languages Regular Context-free languages languages 10 Colin de la Higuera, Nantes 2013
The Chomsky hierarchy revisited � Regular languages � Recognized by DFA, NFA � Generated by regular grammars � Described by regular expressions � Context-free languages � Generated by CF grammars � Recognized by Stack automata � Context-sensitive languages � CS grammars (parsing is not in P) � Turing machines � Parsing is undecidable 11 Colin de la Higuera, Nantes 2013
Other formalisms � Topological formalisms � Semilinear languages � Hyperplanes � Balls of strings 12 Colin de la Higuera, Nantes 2013
Distributions of strings � A probabilistic automaton defines a distribution over the strings 13 Colin de la Higuera, Nantes 2013
Fuzzy automata � An automaton will say that string w belongs to the language with probability p � The difference with the probabilistic automata is that � The total sum of probabilities may be different than 1 (may even be infinite) � The fuzzy automaton cannot be used as a generator of strings 14 Colin de la Higuera, Nantes 2013
The data: examples of strings A string in Gaelic and its translation to English: � Tha thu cho duaichnidh ri èarr àirde de a ’ coisich deas damh � You are as ugly as the north end of a southward traveling ox 15 Colin de la Higuera, Nantes 2013
http://www.flickr.com/photos/popfossa/3992549630/ Time series pose the problem of the alphabet: • An infinite alphabet? • Discretizing? • An ordered alphabet 16 Colin de la Higuera, Nantes 2013
GIORGIO BERNARDI, REGINA GOURSOT, EDDA RAYKO, RENÉ GOURSOT, BAYA CHERIF-ZAHAR, AND ROBERTA MELIS http://www.scopenvironment.org/downloadpubs/scope44/ chapter05.html 17 Colin de la Higuera, Nantes 2013
>A BAC=41M14 LIBRARY=CITB_978_SKB AAGCTTATTCAATAGTTTATTAAACAGCTTCTTAAATAGGATATAAGGCAGTGCCATGTA GTGGATAAAAGTAATAATCATTATAATATTAAGAACTAATACATACTGAACACTTTCAAT GGCACTTTACATGCACGGTCCCTTTAATCCTGAAAAAA TGCTATTGCCATCTTTATTTCA GAGACCAGGGTGCTAAGGCTTGAGAGTGAAGCCACTTTCCCCAAGCTCACACAGCAAAGA CACGGGGACACCAGGACTCCATCTACTGCAGGTTGTCTGACTGGGAACCCCCATGCACCT GGCAGGTGACAGAAATAGGAGGCATGTGCTGGGTTTGGAAGAGACACCTGGTGGGAGAGG GCCCTGTGGAGCCAGATGGGGCTGAAAACAAATGTTGAATGCAAGAAAAGTCGAGTTCCA GGGGCATTACATGCAGCAGGATATGCTTTTTAGAAAAAGTCCAAAAACACTAAACTTCAA CAATATGTTCTTTTGGCTTGCATTTGTGTATAACCGTAATTAAAAAGCAAGGGGACAACA CACAGTAGATTCAGGATAGGGGTCCCCTCTAGAAAGAAGGAGAAGGGGCAGGAGACAGGA TGGGGAGGAGCACATAAGTAGATGTAAATTGCTGCTAATTTTTCTAGTCCTTGGTTTGAA TGATAGGTTCATCAAGGGTCCATTACAAAAACATGTGTTAAGTTTTTTAAAAATATAATA AAGGAGCCAGGTGTAGTTTGTCTTGAACCACAGTTATGAAAAAAATTCCAACTTTGTGCA TCCAAGGACCAGATTTTTTTTAAAATAAAGGATAAAAGGAATAAGAAA TGAACAGCCAAG TATTCACTATCAAATTTGAGGAA TAATAGCCTGGCCAACATGGTGAAACTCCATCTCTAC TAAAAATACAAAAATTAGCCAGGTGTGGTGGCTCATGCCTGTAGTCCCAGCTACTTGCGA GGCTGAGGCAGGCTGAGAATCTCTTGAACCCAGGAAGTAGAGGTTGCAGTAGGCCAAGAT GGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTATGTCCAAAAAAAAAAAAA AAAAAAAGGAAAAGAAAAAGAAAGAAAACAGTGTATATATAGTATATAGCTGAAGCTCCC TGTGTACCCATCCCCAATTCCATTTCCCTTTTTTGTCCCAGAGAACACCCCATTCCTGAC TAGTGTTTTATGTTCCTTTGCTTCTCTTTTTAAAAACTTCAATGCACACATATGCATCCA TGAACAACAGATAGTGGTTTTTGCATGACCTGAAACATTAATGAAATTGTATGATTCTAT 18 Colin de la Higuera, Nantes 2013
http://bandelestudio.com/tutoriel-mao-sur- la-creation-musicale/ 19 Colin de la Higuera, Nantes 2013
http://fr.wikipedia.org/wiki/Philippe_VI_de_France 20 Colin de la Higuera, Nantes 2013
21 Colin de la Higuera, Nantes 2013
<book> <part> <chapter> <sect1/> <sect1> <orderedlist numeration="arabic"> <listitem/> <f:fragbody/> </orderedlist> </sect1> </chapter> </part> </book> 22 Colin de la Higuera, Nantes 2013
<?xml version="1.0"?> <?xml-stylesheet href="carmen.xsl" type="text/xsl"?> <?cocoon-process type="xslt"?> <!DOCTYPE pagina [ <!ELEMENT pagina (titulus?, poema)> <!ELEMENT titulus (#PCDATA)> <!ELEMENT auctor (praenomen, cognomen, nomen)> <!ELEMENT praenomen (#PCDATA)> <!ELEMENT nomen (#PCDATA)> <!ELEMENT cognomen (#PCDATA)> <!ELEMENT poema (versus+)> <!ELEMENT versus (#PCDATA)> ]> <pagina> <titulus>Catullus II</titulus> <auctor> <praenomen>Gaius</praenomen> <nomen>Valerius</nomen> <cognomen>Catullus</cognomen> </auctor> 23 Colin de la Higuera, Nantes 2013
24 Colin de la Higuera, Nantes 2013
And also � Business processes � Bird songs � Images (contours and shapes) � Robot moves � Web services � Malware � … 25 Colin de la Higuera, Nantes 2013
2 What does learning mean? � Suppose we write a program that can learn grammars … are we done? � A first question is: « why bother? » � If my programme works, why do something more about it? � Why should we do something when other researchers in Machine Learning are not? 26 Colin de la Higuera, Nantes 2013
Motivating reflection #1 � Is 17 a random number? � Is 0110110110110101011000111101 a random sequence? (Is grammar G the correct grammar for a given sample S ?) 27 Colin de la Higuera, Nantes 2013
Motivating reflection #2 � In the case of languages, learning is an ongoing process � Is there a moment where we can say we have learnt a language? 28 Colin de la Higuera, Nantes 2013
Motivating reflection #3 � Statement “ I have learnt ” does not make sense � Statement “ I am learning ” makes sense � At least when learning over infinite spaces 29 Colin de la Higuera, Nantes 2013
What usually is called “ having learnt ” � That the grammar / automaton is the smallest, best (re a score) � Combinatorial characterisation � That some optimisation problem has been solved � That the “ learning ” algorithm has converged (EM) 30 Colin de la Higuera, Nantes 2013
What is not said � That having solved some complex combinatorial question we have an Occam, Compression, MDL, Kolmogorov complexity like argument which gives us some guarantee with respect to the future � Computational learning theory has got such results 31 Colin de la Higuera, Nantes 2013
Recommend
More recommend