Multiple Word Alignment with Profile Hidden Markov Models Aditya - PowerPoint PPT Presentation

Multiple Word Alignment with Profile Hidden Markov Models Aditya Bhargava and Grzegorz Kondrak Department of Computing Science University of Alberta {abhargava,kondrak}@cs.ualberta.ca

2 Multiple word alignment • Given multiple words, align them all to each other • Our approach: Profile HMMs, used in biological sequence analysis • Use match, insert, and delete states to model changes • Evaluate on cognate set matching ▫ Beat baselines of average and minimum edit distance

3 What you can expect • Introduction: word alignment • Profile hidden Markov models ▫ For bioinformatics ▫ For words? • Experiments • Conclusions & future work

4 Introduction • Multiple word alignment: ▫ Take a set of words ▫ Generate some alignment of these words ▫ Similar and equivalent characters should be aligned together • Pairwise alignment gets us: ▫ String similarity and word distances ▫ Cognate identification ▫ Comparative reconstruction

5 Introduction • Extending to multiple words gets us: ▫ String similarity with multiple words ▫ Better-informed cognate identification ▫ Better-informed comparative reconstruction • We propose Profile HMMs for multiple alignment ▫ Test on cognate set matching

6 Profile hidden Markov models

7 Profile hidden Markov models • Match states are “defaults” • Insert states are used to represent insert symbols • Delete states are used to represent the absence of symbols

8 Profile hidden Markov models MMIIIM AG...C A-AG.C AG.AA- --AAAC AG...C • In this sample DNA alignment, dashes represent deletes and periods represent skipped inserts

9 Profile hidden Markov models MMIIIM AG...C A-AG.C AG.AA- --AAAC AG...C

16 Profile hidden Markov models • To construct a Profile HMM from aligned sequences: ▫ Determine which columns are match columns and which are insert columns, then estimate transition and emission probabilities directly from counts • To construct a Profile HMM from unaligned sequences: ▫ Choose a model length, initialize the model, then train it to the sequences using Baum-Welch

17 Profile hidden Markov models • Evaluating a sequence for membership in a family ▫ Use the forward algorithm to get the probability ▫ Use Viterbi to align the sequence • Multiple alignment of unaligned sequences ▫ Construct & train a Profile HMM ▫ Use Viterbi to align the sequences

18 Profile hidden Markov models • Profile HMMs are generalizations of Pair HMMs ▫ Word similarity and cognate identification • Unlike Pair HMMs, Profile HMMs are position- specific ▫ Each model is constructed from a specific family of sequences ▫ Pair HMMs are trained over many pairs of words

19 Profile HMMs for words • Words are also sequences! • Similar to their use for biological sequences, we apply Profile HMMs to multiple word alignment • We also test Profile HMMs on matching words to cognate sets • We made our own implementation and investigated several parameters

20 Profile HMMs: parameters • Favour match states? • Pseudocount methods ▫ Constant-value, background frequency, substitution matrix • Pseudocount weight • Pseudocounts added during Baum-Welch

21 Experiments: Data • Comparative Indoeuropean Data Corpus ▫ Cognation data for words in 95 languages corresponding to 200 meanings • Each meaning reorganized into disjoint cognate sets

22 Experiments: Multiple cognate alignment • Parameters determined from cognate set matching MIIMIIMI D--E--N- experiments (later) D--E--NY • Pseudocount weight set to 100 to bias the model using a Z--E--N- substitution matrix DZ-E--N- • Highly-conserved columns are aligned correctly DZIE--N- D--A--N- • Similar-sounding characters are aligned also correctly, DI-E--NA thanks to the substitution matrix method D--E--IZ • Insert columns should not be considered aligned D--E---- D--Y--DD • Problems with multi-character phonemes D--I--A- ▫ An expected problem when using the English alphabet D--I--E- instead of e.g. IPA D-----I- Z-----I- Z--U--E- Z-----U- J--O--UR DJ-O--U- J--O--UR G--IORNO

23 Experiments: Cognate set matching • How can we evaluate the alignments in a principled way? There is no gold standard! • We emulate the biological sequence analysis task of matching a sequence to a family; we match a word to a cognate set • The task is to correctly identify the cognate set to which a word belongs given a number of cognate sets having the same meaning as the word; we choose the model yielding the highest score

24 Experiments: Cognate set matching • Development set of 10 meanings (~5% of the data) • Substitution matrix derived from Pair HMM method • Best parameters: ▫ Favour match states ▫ Use substitution matrix pseudocount ▫ Use 0.5 for pseudocount weight ▫ Add pseudocounts during Baum-Welch

25 Experiments: Cognate set matching Accuracy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Average Edit Distance Minimum Edit Distance Profile HMM Average Edit Distance: 77.0% Minimum Edit Distance: 91.0% Profile HMM: 93.2%

26 Experiments: Cognate set matching • Accuracy better than both average and minimum edit distance • Why so close to MED? ▫ Many sets had duplicate words (same orthographic representation for different languages)

27 Conclusions • Profile HMMs can work for word-related tasks • Multiple alignments are reasonable • Cognate set matching performance exceeds minimum and average edit distance • If multiple words need to be considered, Profile HMMs present a viable method

28 Future work • Better model construction from aligned sequences • Better initial models for unaligned sequences • Better pseudocount methods • N-gram output symbols

Multiple Word Alignment with Profile Hidden Markov Models Aditya - PowerPoint PPT Presentation

Multiple Word Alignment with Profile Hidden Markov Models Aditya Bhargava and Grzegorz Kondrak Department of Computing Science University of Alberta {abhargava,kondrak}@cs.ualberta.ca 2 Multiple word alignment Given multiple words, align

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

Presenter: Fei He, Sergei Maslovs group Dec 02, 2013 1 Why I Think You Should Know About

HLA and Drug Resistance Thomas Harrer Dept. of Medicine 3 University Hospital Erlangen Sandra

Pr Present enters ers: : Yu Xiaoyu, Zhang Shu, Wang Xuan, Hou Yuelong Background Utilization

alignment BCB410 presentation by Nirvana Nursimulu Friday 25 th November 2011 MSA: definition

The linear mitochondrial genome of the quarantine pest Synchytrium endobioticum ; Wart disease

Evolution of the ESC Providing Models of Shared Services since 1970 October 2019 The

Standards Coordinating Body For Cellular/Gene and Regenerative Therapies and Cell-Based Drug

Municipal Consolidation Tax Rate Feasibility Analysis Analysis Presented to: The City and