Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, November 9, 2011
Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011
Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011
Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011
Hierarchical Modeling • many data are well-modeled by an underlying tree [Celtic] Irish [Celtic] Gaelic (Scots) [Celtic] Welsh [Celtic] Cornish [Celtic] Breton [Iranian] Tajik [Iranian] Persian [Iranian] Kurdish (Central) [Romance] French [Germanic] German [Germanic] Dutch [Germanic] English [Germanic] Icelandic [Germanic] Swedish [Germanic] Norwegian [Germanic] Danish [Romance] Spanish [Greek] Greek (Modern) [Slavic] Bulgarian [Romance] Romanian [Romance] Portuguese [Romance] Italian [Romance] Catalan [Albanian] Albanian [Slavic] Polish [Slavic] Slovene [Slavic] Serbian − Croatian [Slavic] Ukrainian [Slavic] Russian [Baltic] Lithuanian [Baltic] Latvian [Slavic] Czech [Iranian] Pashto [Indic] Panjabi [Indic] Hindi [Indic] Kashmiri [Indic] Sinhala [Indic] Nepali [Iranian] Ossetic [Indic] Maithili [Indic] Marathi [Indic] Bengali [Armenian] Armenian (Western) [Armenian] Armenian (Eastern) Wednesday, November 9, 2011
Hierarchical Modeling Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: • the underlying tree may not be known Wednesday, November 9, 2011
Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: • the underlying tree may not be known • predictions in deep hierarchies can be strongly influenced by the prior Wednesday, November 9, 2011
Learning the Tree Wednesday, November 9, 2011
Learning the Tree • major approaches for choosing a tree: Wednesday, November 9, 2011
Learning the Tree • major approaches for choosing a tree: • agglomerative clustering Wednesday, November 9, 2011
Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) Wednesday, November 9, 2011
Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) • stochastic branching processes Wednesday, November 9, 2011
Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) • stochastic branching processes • nested random partitions Wednesday, November 9, 2011
Agglomerative Clustering • start with each datum in its own subtree • iteratively merge subtrees based on a similarity metric • issues: • can’t add new data • can’t form hierarchies over latent parameters • difficult to incorporate structured domain knowledge Wednesday, November 9, 2011
Stochastic Branching Processes • fully Bayesian model � • data starts at top and branches � � � based on an arrival process � � � (Dirichlet diffusion trees) • can also start at bottom and ���� merge (Kingman coalescents) Wednesday, November 9, 2011
Stochastic Branching Processes • many nice properties • infinitely exchangeable • complexity of tree grows with the data • latent parameters must undergo a continuous-time diffusion process • unclear how to construct such a process for models over discrete data Wednesday, November 9, 2011
Random Partitions • stick-breaking process: a way to partition the unit interval into countably many masses π 1 , π 2 ,... • draw β k from Beta(1, γ ) • let π k = β k x (1- β 1 ) ... (1- β k-1 ) • the distribution over the π k is called a Dirichlet process Wednesday, November 9, 2011
Random Partitions • suppose { π k } k=1,..., ∞ are drawn from a Dirichlet process • for n=1,..,N, let X n ~ Multinomial({ π k }) • induces distribution over partitions of {1,...,N} • given partition of {1,...,N}, add X N+1 to a part of size s with probability s/(N+ γ ) and to a new part with probability γ /(N+ γ ) • Chinese restaurant process Wednesday, November 9, 2011
Nested Random Partitions • a tree is equivalent to a collection of nested partitions • nested tree <=> nested random partitions • partition at each node given by Chinese restaurant process • issue: when to stop recursing? Wednesday, November 9, 2011
Martingale Property • martingale property: E[f( θ child ) | θ parent ] = f( θ parent ) • implies E[f( θ v ) | θ u ] = f( θ u ) for any ancestor u of v • says that learning about a child does not change beliefs in expectation Wednesday, November 9, 2011
Doob’s Theorem Wednesday, November 9, 2011
Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . Wednesday, November 9, 2011
Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . • Then lim n ! ∞ f( θ n ) exists with probability 1. Wednesday, November 9, 2011
Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . • Then lim n ! ∞ f( θ n ) exists with probability 1. • Intuition: each new random variable reveals more information about f( θ ) until it is completely determined. Wednesday, November 9, 2011
Doob’s Theorem • Use Doob’s theorem to build infinitely deep hierarchy • data associated with infinite paths v 1 ,v 2 ,... down the tree • each datum drawn from distribution parameterized by lim n f( θ vn ) Wednesday, November 9, 2011
Doob’s Theorem • all data have infinite depth • can think of effective depth of a datum as first point where it is in a unique subtree • effective depth is O(logN) Wednesday, November 9, 2011
Letting the Complexity Grow with the Data Wednesday, November 9, 2011
Letting the Complexity Grow with the Data 9 25 8 20 7 6 maximum depth average depth 15 5 10 4 3 nCRP nCRP 5 TSSB − 10 − 0.5 TSSB − 10 − 0.5 TSSB − 20 − 1.0 TSSB − 20 − 1.0 2 TSSB − 50 − 0.5 TSSB − 50 − 0.5 TSSB − 100 − 0.8 TSSB − 100 − 0.8 0 1 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 number of data points number of data points Wednesday, November 9, 2011
Hierarchical Beta Processes Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit • X d | θ d ~ Bernoulli( θ d ), where θ is the limit Wednesday, November 9, 2011
Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit • X d | θ d ~ Bernoulli( θ d ), where θ is the limit • note that X d | θ v,d ~ Bernoulli( θ v,d ) as well Wednesday, November 9, 2011
Hierarchical Beta Processes Wednesday, November 9, 2011
Priors for Deep Hierarchies • for HBP , θ v,d converges to 0 or 1 • rate of convergence: tower of exponentials e e ee ··· • numerical issues + philosophically troubling Wednesday, November 9, 2011
Priors for Deep Hierarchies • inverse Wishart time-series • Σ n+1 | Σ n ~ InvW( Σ n ) • converges to 0 with probability 1 • becomes singular to numerical precision • rate also given by tower of exponentials Wednesday, November 9, 2011
Recommend
More recommend