��������������������� Frank�Wood Gatsby� UCL Cedric�Archambeau Gatsby Jan�Gasthaus HKUST Lancelot�James Gatsby Yee�Whye Teh
����������������� • Model – Smoothing�Markov�model�of�discrete�sequences – Extension�of�hierarchical�Pitman�Yor�process�[Teh�2006] • Unbounded�depth�(context�length) • Algorithms�and�estimation – Linear�time�suffix0tree�graphical�model�identification�and�construction – Standard�Chinese�restaurant�franchise�sampler • Results – Maximum�contextual�information�used�during�inference – Competitive�language�modelling�results • Limit�of� n 0gram�language�model�as� n →∞ – Same�computational�cost�as�a�Bayesian�interpolating�50gram�language� model
����������������� • Uses – Any�situation�in�which�a�low0order�Markov�model�of�discrete� sequences�is�insufficient – Drop�in�replacement�for�smoothing�Markov�model • Name? – ‘‘A�Stochastic�Memoizer�for�Sequence�Data ’’ →� Sequence� Memoizer�(SM)� • Describes�posterior�inference�[Goodman�et�al� ‘ 08]
��������������������������������������� • Sequence�Markov�models�are�usually�constructed�by�treating�a� sequence�as�a�set�of�(exchangeable)�observations�in�fixed0length� contexts o | [] a | o � c | ao a | [] a | cao c | a oacac → oacac → a | ca oacac → oacac → c | [] c | aca a | c c | ac a | [] c | a c | [] unigram bigram trigram 40gram Increasing�context�length�/�order�of�Markov�model Decreasing�number�of�observations Increasing�number�of�conditional�distributions�to�estimate�(indexed�by�context) Increasing�power�of�model
������������������������� N � P ( x i | x � , . . . x i − � ) P ( x �� N ) = i �� N � ≈ P ( x i | x i − n �� , . . . x i − � ) , n = 2 i �� = P ( x � ) P ( x � | x � ) P ( x � | x � ) P ( x � | x � ) . . . • Example P (o) P (a | o) P (c | a) P (a | c) P (c | a) P (oacac) = G �� (o) G ��� (a) G ��� (a) G ��� (c) G ��� (a) =
������������������������������������ ������ • Discrete�distribution� ↔ vector�of�parameters G � � � = [ π � , . . . , π K ] , K ∈ | Σ | • Counting�/�Maximum�likelihood�estimation� – Training�sequence� x �� N π k = � { � k } ˆ G � � � ( X = k ) = ˆ � { � } G � � � – Predictive�inference P ( X n �� | x � . . . x N ) = ˆ G � � � ( X n �� ) • Example x i – Non0smoothed�unigram�model�( ��� ǫ ) i = 1 : N
!����������������� • Estimation P ( G � � � | x �� n ) ∝ P ( x �� n |G � � � ) P ( G � � � ) • Predictive�inference � U P ( X n �� | x �� n ) = P ( X n �� |G � � � ) P ( G � � � | x �� n ) d G � � � • Priors�over�distributions G � � � ∼ Dirichlet( U ) , G � � � ∼ PY( d, c, U ) G � � � • Net�effect – Inference�is� “ smoothed ” w.r.t.�uncertainty�about� unknown� distribution • Example x i – Smoothed�unigram�( ��� ǫ ) i = 1 : N
"�#������������������������� ������ discount concentration G � � � ∼ PY( d, c, G � σ � � �� ) ∼ G � � � base distribution x i • Tool�for�tying�together�related�distributions�in�hierarchical�models • Measure�over�measures • Base�measure�is�the� “ mean ” measure E [ G � � � ( dx )] = G � σ � � �� ( dx ) • A�distribution�drawn�from�a�Pitman�Yor�process�is�related�to�its base� distribution� – (equal�when� � =� ∞ or� �� =�1) ���������������� ’ ���
$�����%&���$���������������� • Generalization�of�the�Dirichlet�process�( �� =�0) – Different�(power0law)�properties – Better�for�text�[Teh,�2006]�and�images�[Sudderth and�Jordan,�2009] • Posterior�predictive�distribution Can’t actually do this integral this way � P ( X N �� | x �� N ; c, d ) ≈ P ( x N �� |G � � � ) P ( G � � � | x �� N ; c, d ) d G � � � �� K � k �� ( m k − d ) � ( φ k = X N �� ) + c + dK � = c + N G � σ � � �� ( X N �� ) c + N • Forms�the�basis�for�straightforward,�simple�samplers • Rule�for�stochastic�memoization
'������������!����������������� • Estimation U {G � � � , G � � � , G � � � } , Θ = ( = σ ( � ) = σ ( � ) P (Θ | x �� N ) ∝ P ( x �� N | Θ) P (Θ) • Predictive�inference G � � � P ( X N �� | x �� N ) � P ( X N �� | Θ) P (Θ | x �� N ) d Θ = • Naturally�related�distributions�tied� G � � � G � � � together G � the United States � ∼ PY( d, c, G � United States � ) Net�effect� • x i x j – Observations�in�one�context�affect� inference�in�other�context. – Statistical�strength�is�shared�between� similar�contexts i = 1 : N � � � j = 1 : N � � � • Example – Smoothing�bi0gram�( ��� ǫ � � � �� ∈ Σ )
��)'$&$������������"����� Observations Conditional Distributions Posterior Predictive Probabilities U G �� G ��� G ���� G ����
�*��$���������������$���������+,���� Observations Conditional Distributions Posterior Predictive Probabilities U CP U G �� G ��� G ���� G ����
�*��$���������������$���������+,���� Observations Conditional Distributions Posterior Predictive Probabilities U CP U G �� CP U G ��� G ���� G ����
'$&$������������"����������� • Share�statistical�strength�between� sequentially�related�predictive� ������� G �� conditional�distributions – Estimates�of�highly�specific� conditional�distributions G ���� �� ���� ������ G ����� G ��� G ����� – Are�coupled�with�others�that�are� related G ��� �� ���� – Through�a�single�common,�more0 G ��� ���� ������ G ��� ���� G ��� ���� general�shared�ancestor G ��� ���� ������ G ��� �� ���� • Corresponds�intuitively�to�back0off G ���� �� ���� G ��� �� ���� G ���� �� ���� G ��� �� ���� G ���� �� ���� G ��� �� ���� G ���� �� ����
'������������$������&���$������� G �� | d � , U ∼ PY( d � , 0 , U ) G � � � | d | � | , G � σ � � �� ∼ PY( d | � | , 0 , G � σ � � �� ) x i | � �� i − � = � ∼ G � � � i = 1 , . . . , T ∀ � ∈ Σ n − � • Bayesian�generalization�of�smoothing� n 0gram�Markov�model� • Language�model�:�outperforms�interpolated�Kneser0Ney�(KN)�smoothing • Efficient�inference�algorithms�exist� – [Goldwater�et�al� ’ 05;�Teh,� ’ 06;�Teh,�Kurihara,�Welling,� ’ 08] • Sharing�between�contexts�that�differ�in�most�distant�symbol�only • Finite�depth ����������������� ’ �������� ’ ���
"������������������������������������ • A�sequence�can�be�characterized�by�a�set�of� single observations�in�unique�contexts�of�growing�length o | [] Increasing�context�length a | o oacac → c | ao Always�a�single�observation a | cao c | acao Foreshadowing:�all�suffixes�of�the�string� “ cacao ”
--.��%������ ’’ ����� N � P ( x �� N ) = P ( x i | x � , . . . x i − � ) i �� = P ( x � ) P ( x � | x � ) P ( x � | x � , x � ) P ( x � | x � , . . . x � ) . . . • Example P (oacac) = P (o) P (a | o) P (c | oa) P (a | oac) P (c | oaca) • Smoothing�essential – Only�one�observation�in�each�context! • Solution – Hierarchical�sharing�ala�HPYP
Recommend
More recommend