Automatic Learning of a Morphological Model Theory and Unsupervised Approaches
Unsupervised Learning - ToC Foundations Problem description General architecture Papers: – Goldsmith 2001 and 2006 – Goldsmith and Hu 2004
Unsupervised Models - Foundations Saffran et al. 1996: Adults are capable of discovering word units rapidly in a stream of a nonsense language without any connection to meaning. Creutz and Lagus 2007: This suggests that humans do use distributional cues, such as transition probabilities between sounds, in language learning. And these kinds of statistical patterns in language data can be successfully exploited by appropriately designed algorithms.
Unsupervised Models - Problem desc. Input: – untagged text in an orthographic form, words are separated. – No syntactic, semantic or phonological info is given. Output: – A lexicon of morphemes (stems and affixes). Frequent un- segmented words might also be included (mixed lexicon, see Creutz and Lagus). – An analysis of each word in the corpus by segmentation into morphemes. Results evaluation – The analysis should be as close as possible to the gold standard, obtained by manual segmentation. – The data should be represented efficiently.
Unsupervised Models – Components Bootstrapping heuristic - creates the initial morphological model. Incremental heuristics – create an improved morphological model based on the existing one. Evaluation model - compares morphological models and tells whether a significant improvement was achieved.
Unsupervised Models – Flow diag. Bootstrapping Model of Corpus Heuristic Morphology Evaluate Evaluation Model Replace old Incremental New is (much) better (MDL, MAP) model Heuristics E v a l u Not much a t e improvement New Model of Morphology Stop
Unsupervised Models – Preface Goldsmith 2001/ 6 – Recursive segmentation into stem+suffix/prefix+stem. – Evaluation in terms of Minimum Description Length. Goldsmith and Hu 2004 – NFSA holds layered morpheme compositions. – Evaluation in terms of Minimum Description Length. Creutz and Lagus 2005/7 – HMM of categories emitting morphemes. – Evaluation in terms of Maximum A Posteriori.
Unsupervised Models - Methodologies - 1 Tradeoff between restrictiveness and flexibility – A too restricting model may exclude all optimal and near optimal models, making learning a good model impossible, regardless of how much data and computation time is spent. – A too flexible model is very hard to learn as it requires impractical amounts of data and computation.
Unsupervised Models - Methodologies - 2 Minimum Description Length (Rissanen 1989) – Main ideas: Every regularity in data may be used to compress that data. Learning can be equated with finding regularities in data. – Formalization of Occam's Razor: The best hypothesis for a given set of data is the one that leads to the largest compression of the data. – Choose the best model by simultaneously considering model accuracy and model complexity. – Simpler models are favored over complex ones. This generally improves generalization capacity by inhibiting overlearning.
Unsupervised Models - Methodologies - 3 MDL formulation is used in Goldsmith’s papers: ( , ) argmin DescriptionLength CorpusC ModelM ModelM 1 argmin ( ) log length M 2 ( | ) P C M ModelM Alternative formulation, used in the papers by Creutz and Lagus: Maximum A Posteriori (MAP): ( | ) argmax P Lexicon Corpus Lexicon Chen 1996: The two approaches are equivalent with respect to the task discussed.
Unsupervised Models – Goldsmith 2001/6 The signature concept – List of affixes appearing with a stem E.g. Jump: NULL.ed.ing.s Each stem has a unique signature. The structure of the morphological model: – List of stems, e.g. { cat, jump, laugh, hat, walk, sav } – List of affixes, e.g. { NULL, ed, ing, s, e, es } ptr( NULL ) – List of signatures and their ptr( jump ) ptr( ed ) associated stems, e.g. ptr( walk ) ptr( ing ) ptr( laugh ) ptr( s )
Goldsmith 2001/6 - Samples Some signatures from The Adventures of Tom Sawyer : Signature Sample of tokens # Stems # Tokens NULL.ed.ing betray betrayed 69 816 betraying NULL.ed.ing.s remain remained 14 516 remaining remains NULL.s cow cows 253 3414 e.ed.es.ing notice noticed 4 62 notices noticing
Goldsmith 2001/6 - Bootstrapping heuristic - 1 Creates the initial (rough) morphological model. Cuts words into morphemes (stem + affix) and builds the lists. The most effective way is based on an early proposal of Harris (1955, 1967), named: Successor frequency . The idea: words should be cut where it is least likely to predict the succeeding character.
Goldsmith 2001/6 - Bootstrapping heuristic - 2 An example of successor frequency – Consider the word: government – Assume that empirically: { n } follows gover { e , i , m , o , s , #} follows govern { e } follows governm – We get the frequencies : g o v e r 1 n 6 m 1 e n t Peak successor frequency tells us to cut the word into govern + ment
Goldsmith 2001/6 - Bootstrapping heuristic - 3 Difficulties with successor frequency : – Consider: c 9 o 18 n 11 s 6 e 4 r 1 v 2 a 1 t 1 i 2 v 1 e 1 s Peak Peak Peak May lead to under-cut or over-cut. How could one decide? Set constraints (higher precision, lower recall) – Stems length must be at least 5. – Number of stems in signatures must be at least 5. – Absolute peaks: frequencies form must be 1 N 1
Goldsmith 2001/6 - Incremental Heuristics - 1 General idea: – Try to reorganize the lists in the model – Evaluate the model length with/without the change and proceed accordingly. Create signatures by a Loose fit strategy: For every known suffix F, For every word that can be split into S+F: Collect all the suffixes of S and suggest a new signature. The Check Signatures function: – Move letters from stems to suffixes (slide left the boundary) Examine each signature and suggest moving letters. Example 1: consider the i in: - i on or – i ve. Example 2: consider the words ending with – a ble and – i ble.
Goldsmith 2001/6 - Incremental Heuristics - 2 Extending to unanalyzed words – Recall the conservatism of the bootstrapping successor frequency peak: 1 N 1. – No account was given for words like derivation and derivative (could not be analyzed as deriv-ation and deriv-ative ) – Heuristic: for every unanalyzed word, suggest a cut into a known stem+suffix (prefer the most common stem) Slide right the stem-suffix boundary – Consider signatures with suffixes all sharing the same prefix. E.g: te.ting.ts – Suggest sliding right the boundary, thus creating: e.ing.s .
Goldsmith 2001/6 – Description Length Evaluation 1 Evaluating the description length in MDL: 1 – Recall the formula: ( ) log DL length ModelM 2 ( | ) P CorupsC M Evaluating the model length: ( ) ( ) ( ) ( ) length ModelM length stems length affixes length sig
Goldsmith 2001/6 – Description Length Evaluation 2 Stems list structure: Stem 1 Stem 2 Stem 3 … . Stem N Number of Stems - N Ptr Ptr Ptr Ptr Stem 1 Data Stem 2 Data Stem 3 Data ,,, Stem N Data W The stems list (T) length: log ( ) (log 26* ( ) log( )) T length t 2 2 t inW t T bits for bits for bits for items # data pointer W log ( ) (log 26* ( ) log( )) The affixes list (F) length: F length f 2 2 f inW f F
Goldsmith 2001/6 – Description Length Evaluation 3 Evaluating the model length (cont’): – The signatures list (S) length: bits for signatures count: log ([ ]) S 2 [ ] W (log ) + bits for signatures pointers: [ ] s s Sig ( + for each sig: s Sig bits for stems count + bits for stems pointers: [ ] W log [ ( )] (log ) Stems s 2 2 [ ] t t Stems s ( ) + bits for affixes count + bits for affixes pointers: [ ] s log *[ ( )] (log )) Affixes s 2 2 [ ] f in s ( ) f Affixes s
Goldsmith 2001/6 – Description Length Evaluation 4 Evaluating the probability of the corpus decomposition: 1 1 log log log ( | ) P w M 2 2 2 ( | ) ( | ) P CorupsC M P w M w Corpus w Corpus Evaluating the probability of the decomposition of each word, w: ( | ) ( | ) ( | )* ( | , )* ( | , ) P w M P w t f M P sig M P t sig M P f sig M log ( | ) log ( ( | )) log ( ( | , )) log ( ( | , )) P w M P sig M P t sig M P f sig M 2 2 2 2 ( ) ( ) W sig w sig w log ( ) log ( ) log ( ) 2 2 2 ( ) ( ) ( ) ( ) ( ) sig w stem w in sig w affix w in sig w
Goldsmith 2001/6 – An Example of Loose-fit - 1 Assume that the corpus contains: { act , acted , action , acts , acting }. The bootstrapping heuristic adds all words to the stems list and also { NULL , ed , ion , s, ing} to the affixes list. Loose fit: consider adding the signature: NULL.ed.ion.s.ing instead of 4 instances. – Evaluate the cost of each alternative. – Increment only if a cheaper alternative is found.
Recommend
More recommend