Low-Rank Tensors for Scoring Dependency Structures Tao Lei Yu Xin, Yuan Zhang, Regina Barzilay, Tommi Jaakkola CSAIL, MIT 1
Dependency Parsing ROOT I ate cake with a fork today PRON VB NN IN DT NN NN β’ Dependency parsing as maximization problem: π§ β = argmax π π¦, π§; π yβπ(π¦) β’ Key aspects of a parsing system: π(π¦, π§; π) Our Goal 1. Accurate scoring function argmax 2. Efficient decoding procedure 2
Finding Expressive Feature Set Traditional view: requires a rich, expressive set of manually-crafted feature templates ROOT I ate cake with a fork today PRON VB NN IN DT NN NN High-dim. sparse vector π π¦, π§ β β π β¦ β¦ 1 0 1 1 0 0 0 0 Feature Template: Feature Example: βVB β¨ NN β¨ 2β head POS, modifier POS and length 3
Finding Expressive Feature Set Traditional view: requires a rich, expressive set of manually-crafted feature templates ROOT I ate cake with a fork today PRON VB NN IN DT NN NN High-dim. sparse vector π π¦, π§ β β π β¦ β¦ 1 0 1 1 0 0 0 0 Feature Template: Feature Example: βate β¨ cakeβ head word and modifier word 4
Finding Expressive Feature Set Traditional view: requires a rich, expressive set of manually-crafted feature templates ROOT I ate cake with a fork today PRON VB NN IN DT NN NN High-dim. sparse vector π π¦, π§ β β π β¦ β¦ 1 0 2 1 2 0 0 0 β Parameter vector π β β π β¦ β¦ 0.1 0.3 2.2 1.1 0 0.1 0.9 0 π π π¦, π§ = π, π π¦, π§ 5
Traditional Scoring Revisited β’ Features and templates are manually-selected concatenations of atomic features, in traditional vector-based scoring: ROOT I ate cake with a fork today PRON VB NN IN DT NN NN Arc Features: Attach Modifier Head Length? ate β¨ cake β¨ 2 ate cake HW_MW_LEN: Word: Yes NN VB POS: β¨ β¨ NN+cake VB+ate POS+Word: VB PRON Left POS: No IN NN Right POS: 6
Traditional Scoring Revisited β’ Features and templates are manually-selected concatenations of atomic features, in traditional vector-based scoring: ROOT I ate cake with a fork today PRON VB NN IN DT NN NN Arc Features: Attach Modifier Head Length? ate β¨ cake β¨ 2 ate cake HW_MW_LEN: Word: Yes ate β¨ cake NN VB POS: HW_MW: β¨ β¨ NN+cake VB+ate POS+Word: VB PRON Left POS: No IN NN Right POS: 7
Traditional Scoring Revisited β’ Features and templates are manually-selected concatenations of atomic features, in traditional vector-based scoring: ROOT I ate cake with a fork today PRON VB NN IN DT NN NN Arc Features: Attach Modifier Head Length? ate β¨ cake β¨ 2 ate cake HW_MW_LEN: Word: Yes ate β¨ cake NN VB POS: HW_MW: β¨ β¨ NN+cake VB+ate POS+Word: VB β¨ NN β¨ 2 HP_MP_LEN: VB PRON Left POS: No VB β¨ NN HP_MP: IN NN Right POS: β¦ β¦ 8
Traditional Scoring Revisited β’ Problem: very difficult to pick the best subset of concatenations Too few templates Lose performance Too many templates Too many parameters to estimate Features are correlated Searching the best set? Choices are exponential β’ Our approach: use low-rank tensor (i.e. multi-way array) ο§ Capture a whole range of feature combinations ο§ Keep the parameter estimation problem in control 9
Low-Rank Tensor Scoring: Formulation β’ Formulate ALL possible concatenations as a rank-1 tensor π β,π π β π π atomic head atomic modifier atomic arc feature vector feature vector feature vector Attach Head Modifier Length? cake ate Yes VB NN VB+ate NN+cake VB PRON No IN NN 10
Low-Rank Tensor Scoring: Formulation β’ Formulate ALL possible concatenations as a rank-1 tensor β β πΓπΓπ π β,π π β π π β β atomic head atomic modifier atomic arc feature vector feature vector feature vector π¦β¨π§β¨π¨ πππ = π¦ π π§ π π¨ π tensor product Each entry indicates the occurrence of one feature concatenation 11
Low-Rank Tensor Scoring: Formulation β’ Formulate ALL possible concatenations as a rank-1 tensor β β πΓπΓπ π β,π π β π π β β atomic head atomic modifier atomic arc feature vector feature vector feature vector β’ Formulate the parameters as a tensor as well π β β π : π π β β π = π, π ββπ (vector-based) π΅ β β πΓπΓπ : π π’πππ‘ππ β β π = π΅, π β β π π β π β,π (tensor-based) Involves features not in π Can be huge. On English: π Γ π Γ π β 10 11 12
Low-Rank Tensor Scoring: Formulation β’ Formulate ALL possible concatenations as a rank-1 tensor β β πΓπΓπ π β,π π β π π β β atomic head atomic modifier atomic arc feature vector feature vector feature vector β’ Formulate the parameters as a low-rank tensor π β β π : π π β β π = π, π ββπ (vector-based) π΅ β β πΓπΓπ : π π’πππ‘ππ β β π = π΅, π β β π π β π β,π (tensor-based) π, π β β π Γπ , π β β π Γπ : π΅ = π π β¨π π β¨π(π) Low-rank tensor 13 r rank-1 tensors
Low-Rank Tensor Scoring: Formulation π π’πππ‘ππ β β π = π΅, π β β¨π π β¨π β,π π΅ = π π β¨π π β¨π(π) βΉ π = ππ β π ππ π π ππ β,π π π=1 π ππ β π ππ π π ππ β,π π β β π Dense low-dim representations: π=1 = Γ dense dense sparse 14
Low-Rank Tensor Scoring: Formulation π π’πππ‘ππ β β π = π΅, π β β¨π π β¨π β,π π΅ = π π β¨π π β¨π(π) βΉ π = ππ β π ππ π π ππ β,π π π=1 π ππ β π ππ π π ππ β,π π β β π Dense low-dim representations: π=1 π ππ β π ππ π π ππ β π π Element-wise products: , π=1 π ππ β π ππ π π ππ β π π Sum over these products: , π=1 15
Intuition and Explanations Example: Collaborative Filtering Approximate user-ratings via low-rank βpriceβ π 2Γπ : preferences βqualityβ ?? βpriceβ π 2Γπ : properties βqualityβ user-rating sparse matrix A ο§ Ratings not completely independent ο§ Items share hidden properties (βpriceβ and βqualityβ) ο§ Users have hidden preferences over properties 16
Intuition and Explanations Example: Collaborative Filtering Approximate user-ratings via low-rank V ( 1 ) V ( r ) β + β― + ?? βpriceβ βqualityβ U ( 1 ) U ( r ) user-rating sparse matrix A π΅ = π T π = βπ π β π(π) π Γ π π + π π # of parameters: Intuition: Data and parameters can be approximately characterized by a small number of hidden factors 17
Intuition and Explanations Our Case: Approximate parameters (feature weights) via low-rank parameter tensor A ?? ... 2 β¦ 4 ?? ... 2 β¦ 4 ?? ... 2 β¦ 4 β¦ 0 0 β¦ β¦ β¦ 0 0 β¦ β¦ β¦ 0 0 β¦ β¦ + β― + β β¦ 0 0 β¦ β¦ 0 0 β¦ β¦ 0 0 β¦ β¦ 1 0.9 β¦ 5 β¦ 1 0.9 β¦ 5 β¦ 1 0.9 β¦ 5 similar values because β¦ 0.1 0.1 β¦ β¦ β¦ 0.1 0.1 β¦ β¦ β¦ 0.1 0.1 β¦ β¦ βappleβ and βbananaβ have similar syntactic behavior π΅ = βπ π β π π β π π ο§ Hidden properties associated with each word ο§ Share parameter values via the hidden properties 18
Low-Rank Tensor Scoring: Summary β’ Naturally captures full feature expansion (concatenations) -- Without mannually specifying a bunch of feature templates β’ Controlled feature expansion by low-rank (small r ) -- better feature tuning and optimization Head Atomic ate VB VB+ate β’ Easily add and utilize new, auxiliary features PRON NN -- Simply append them as atomic features person:I number:singular Emb[1]: -0.0128 19 Emb[2]: 0.5392
Combined Scoring β’ Combining traditional and tensor scoring in π πΏ (π¦, π§) : πΏ β π π π¦, π§ + 1 β πΏ β π π’πππ‘ππ π¦, π§ πΏ β [0,1] Set of manual Full feature expansion selected features controlled by low-rank Similar β sparse+low-rank β idea for matrix decomposition: Tao and Yuan, 2011; Zhou and Tao, 2011; Waters et al., 2011; Chandrasekaran et al., 2011 β’ Final maximization problem given parameters π, π, π, π : π§ β = argmax π πΏ π¦, π§; π, π, π, π yβπ(π¦) 20
Learning Problem π β’ Given training set D = π¦ π , π§ π π=1 β’ Search for parameter values that score the gold trees higher than others: βπ§ β ππ¬ππ π¦ π : π π¦ π , π§ π β₯ π π¦ π , π§ + π§ π β π§ β π π Non-negative loss β’ The training objective: unsatisfied constraints are penalized against π π + π 2 + π 2 + π 2 + π 2 min π· π,π,π,π,π π β₯0 π Training loss Regularization Calculating the loss requires to solve the expensive maximization problem; Following common practices, adopt online learning framework. 21
Recommend
More recommend