learning effective and interpretable semantic models
play

Learning Effective and Interpretable Semantic Models using - PowerPoint PPT Presentation

1 Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding (NNSE) Brian Murphy, Partha Talukdar, Tom Mitchell Machine Learning Department Carnegie Mellon University 2 Distributional Semantic Modeling 2


  1. 1 Learning Effective and Interpretable Semantic Models using Non-Negative Sparse Embedding (NNSE) Brian Murphy, Partha Talukdar, Tom Mitchell Machine Learning Department Carnegie Mellon University

  2. 2 Distributional Semantic Modeling

  3. 2 Distributional Semantic Modeling • Words are represented in a high dimensional vector space

  4. 2 Distributional Semantic Modeling • Words are represented in a high dimensional vector space • Long history: • (Deerwester et al., 1990), (Lin, 1998), (Turney, 2006), ...

  5. 2 Distributional Semantic Modeling • Words are represented in a high dimensional vector space • Long history: • (Deerwester et al., 1990), (Lin, 1998), (Turney, 2006), ... • Although effective, these models are often not interpretable

  6. 2 Distributional Semantic Modeling • Words are represented in a high dimensional vector space • Long history: • (Deerwester et al., 1990), (Lin, 1998), (Turney, 2006), ... • Although effective, these models are often not interpretable Examples of top 5 words from 5 randomly chosen dimensions from SVD 300

  7. Why interpretable dimensions? 3 Semantic Decoding: (Mitchell et al., Science 2008)

  8. Why interpretable dimensions? 3 Semantic Decoding: (Mitchell et al., Science 2008) “motorbike” Input stimulus word

  9. Why interpretable dimensions? 3 Semantic Decoding: (Mitchell et al., Science 2008) (0.87, ride) (0.29, see) “motorbike” . . . (0.00, rub) Input (0.00, taste) stimulus word Semantic representation

  10. Why interpretable dimensions? 3 Semantic Decoding: (Mitchell et al., Science 2008) (0.87, ride) (0.29, see) “motorbike” . . . (0.00, rub) Input (0.00, taste) stimulus word Semantic Mapping learned representation from fMRI data

  11. Why interpretable dimensions? 3 Semantic Decoding: (Mitchell et al., Science 2008) (0.87, ride) (0.29, see) predicted “motorbike” . activity . for . “motorbike” (0.00, rub) Input (0.00, taste) stimulus word Semantic Mapping learned representation from fMRI data

  12. 4 Why interpretable dimensions? (Mitchell et al., Science 2008)

  13. 4 Why interpretable dimensions? (Mitchell et al., Science 2008)

  14. 4 Why interpretable dimensions? (Mitchell et al., Science 2008) • Interpretable dimension reveals insightful brain activation patterns!

  15. 4 Why interpretable dimensions? (Mitchell et al., Science 2008) • Interpretable dimension reveals insightful brain activation patterns! • But, features in the semantic representation were based on 25 hand-selected verbs

  16. 4 Why interpretable dimensions? (Mitchell et al., Science 2008) • Interpretable dimension reveals insightful brain activation patterns! • But, features in the semantic representation were based on 25 hand-selected verbs ‣ can’t represent arbitrary concepts

  17. 4 Why interpretable dimensions? (Mitchell et al., Science 2008) • Interpretable dimension reveals insightful brain activation patterns! • But, features in the semantic representation were based on 25 hand-selected verbs ‣ can’t represent arbitrary concepts ‣ need data-driven, broad coverage semantic representations

  18. 5 What is an Interpretable, Cognitively-plausible Representation?

  19. 5 What is an Interpretable, Cognitively-plausible Representation? features words

  20. 5 What is an Interpretable, Cognitively-plausible Representation? features words w ... 0 0 0 ... 1.2 ...

  21. 5 What is an Interpretable, Cognitively-plausible Representation? features words w ... 0 0 0 ... 1.2 ... 1. Compact representation: Sparse , many zeros

  22. 5 What is an Interpretable, Cognitively-plausible Representation? features words w ... 0 0 0 ... 1.2 ... 1. Compact representation: 2. Uneconomical to store negative Sparse , many zeros (or inferable) characteristics: Non-Negative

  23. 5 What is an Interpretable, Cognitively-plausible Representation? features words 3. Meaningful Dimensions: w ... 0 0 0 ... 1.2 ... Coherent 1. Compact representation: 2. Uneconomical to store negative Sparse , many zeros (or inferable) characteristics: Non-Negative

  24. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative

  25. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned

  26. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned Corpus-derived (existing)

  27. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned Hand Coded 83.5 Corpus-derived Corpus-derived 83.1 (existing) Prediction accuracy (on Neurosemantic Decoding) (Murphy, Talukdar, Mitchell, StarSem 2012)

  28. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned Corpus-derived (existing)

  29. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned Corpus-derived (existing) NNSE Our proposal

  30. 6 Properties of Different Semantic Representations Broad Non- Effective Sparse Coherent Coverage Negative Hand Tuned Corpus-derived (existing) NNSE Our proposal

  31. 7 Non-Negative Sparse Embedding (NNSE)

  32. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) X

  33. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) X w i input representation for word w i

  34. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) X A D = x w i input representation for word w i

  35. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) Basis X A D = x w i input representation NNSE representation for word w i for word w i

  36. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) Basis X A D = x w i input representation NNSE representation for word w i for word w i • matrix A is non-negative

  37. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) Basis X A D = x w i input representation NNSE representation for word w i for word w i • matrix A is non-negative • sparsity penalty on the rows of A

  38. 7 Non-Negative Sparse Embedding (NNSE) Input Matrix (Corpus Cooc + SVD) Basis X A D = x w i input representation NNSE representation for word w i for word w i • matrix A is non-negative • sparsity penalty on the rows of A • alternating minimization between A and D, using SPAMS package

  39. 8 NNSE Optimization X A D = x

  40. 9 Experiments

  41. 9 Experiments • Three main questions

  42. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice?

  43. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice? 2. What is the degree of sparsity of NNSE?

  44. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice? 2. What is the degree of sparsity of NNSE? 3. Are NNSE dimensions coherent?

  45. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice? 2. What is the degree of sparsity of NNSE? 3. Are NNSE dimensions coherent? • Setup

  46. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice? 2. What is the degree of sparsity of NNSE? 3. Are NNSE dimensions coherent? • Setup • partial ClueWeb09, 16bn tokens, 540m sentences, 50m documents

  47. 9 Experiments • Three main questions 1. Are NNSE representations effective in practice? 2. What is the degree of sparsity of NNSE? 3. Are NNSE dimensions coherent? • Setup • partial ClueWeb09, 16bn tokens, 540m sentences, 50m documents • dependency parsed using Malt parser

  48. 10 Baseline Representation: SVD

  49. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract

  50. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract • document co-occurrence

  51. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract • document co-occurrence • dependency features from the parsed corpus

  52. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract • document co-occurrence • dependency features from the parsed corpus • Reduce dimensionality using SVD. Subsets of this reduced dimensional space is the baseline

  53. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract • document co-occurrence • dependency features from the parsed corpus • Reduce dimensionality using SVD. Subsets of this reduced dimensional space is the baseline • This is also the input (X) to NNSE

  54. 10 Baseline Representation: SVD • For about 35k words (~adult vocabulary) , extract • document co-occurrence • dependency features from the parsed corpus • Reduce dimensionality using SVD. Subsets of this reduced dimensional space is the baseline • This is also the input (X) to NNSE • Other representations were also compared (e.g., LDA, Collobert and Weston, etc.), details in the paper

  55. 11 Is NNSE effective in Neurosemantic Decoding?

Recommend


More recommend