deconstructing data science
play

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering overview Jan 31, 2016 Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of


  1. 
 Deconstructing Data Science David Bamman, UC Berkeley 
 Info 290 
 Lecture 5: Clustering overview Jan 31, 2016

  2. Clustering • Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of skyscrapers

  3. Unsupervised Learning Le et al. (2012), “Building High-level Features Using Large Scale Unsupervised Learning” (ICML)

  4. Netflix Amazon Twitter New York Times

  5. Unsupervised Learning • Matrix completion (e.g., user recommendations on Netflix, Amazon) Ann Bob Chris David Erik Star Wars 5 5 4 5 3 Bridget 4 4 1 Jones Rocky 3 5 Rambo ? 2 5

  6. 𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

  7. Methods differ in the kind of structure learned Deep learning Probabilistic graphical models Networks Topic models K-means clustering Hierarchical clustering

  8. Hierarchical Clustering • Hierarchical order among the elements being clustered

  9. Dendrogram Shakespeare’s plays Witmore (2009) 
 http://winedarksea.org/? p=519

  10. Bottom-up clustering

  11. Similarity P ( X ) × P ( X ) → R • What are you comparing? • How do you quantify the similarity/difference of those things?

  12. Probability 0.4 0.2 0.0 the a dog cat runs to store

  13. Unigram probability 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most

  14. Similarity v vocab u � 2 u X � P Hamlet − P Romeo Euclidean = t i i i Cosine similarity, Jensen-Shannon divergence…

  15. Cluster similarity

  16. Cluster similarity • Single link: two most similar elements • Complete link: two least similar elements • Group average: average of all members

  17. Flat Clustering • Partitions the data into a set of K clusters B A C

  18. Flat Clustering • Partitions the data into a set of K clusters

  19. K-means

  20. K-means

  21. Representation x ∈ R F [x is a data point characterized by F real numbers, one for each feature] • This is a huge decision that impacts what you can learn

  22. Yes on abortion 1 access Yes on expanding gun 0 rights Yes on tax 0 breaks Voting behavior Yes on ACA 1 Yes on 0 abolishing IRS x ∈ R 5

  23. Last name starts 0 with < “A” Last name starts 0 with < “B” Last name starts 1 with < “C” Last name starts 1 with < “D” First letter of last name … 1 Last name starts 1 with < “Z” x ∈ R 26

  24. Representation 𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

  25. Evaluation • Much more complex than supervised learning since there’s often no notion of “truth”

  26. Internal criteria • Elements within clusters should be more similar to each other • Elements in different clusters should be less similar to each other

  27. External criteria • How closely does your clustering reproduce another (“gold standard”) clustering?

  28. Learned clusters A B C Comparison clusters

  29. Evaluation: Purity G = { g 1 . . . g k } • Learned clusters 
 (as learned by our algorithm) • External clusters 
 C = { c 1 . . . c j } (from some external source) = 1 � | g k ∩ c j | Purity max N j k

  30. Learned (G) A B C = 1 � | g k ∩ c j | max N j k External (C)

  31. Learned (G) A B C = 1 � | g k ∩ c j | max N j k External (C)

  32. Learned (G) A B C = 1 � | g k ∩ c j | max N j k External (C)

  33. Learned (G) A B C = 1 � | g k ∩ c j | max N j k External (C)

  34. Learned (G) A B C (1 + 1 + 2) / 7 = .57 External (C)

  35. Evaluation: Rand Index Every pair of data points is either in the same external cluster, or it’s not. = binary classification

  36. same Rand Index cluster? Rubio Paul 1 Rubio Cruz 1 Rubio Trump 0 Rubio Fiorina 0 Rubio Clinton 0 Rubio Sanders 0 Paul Cruz 1 Paul Trump 0

  37. Rand Index Predicted ( ŷ ) same 
 different 
 cluster cluster same 
 True (y) cluster different 
 cluster 21 decisions N ( N − 1 ) / 2

  38. Learned Predicted ( ŷ ) same 
 different 
 cluster cluster True (y) same 
 cluster different 
 cluster External

  39. Rand Index Predicted ( ŷ ) same 
 different 
 From the confusion matrix, cluster cluster we can calculate standard measures from binary same 
 True (y) 1 4 cluster classification different 
 4 12 The Rand Index = cluster accuracy (1 + 12) / 21 = .619

  40. Example Clustering characters into distinct types

  41. The Villain • Does (agent): kill, hunt, severs, chokes • Has done to them (patient): fights, defeats, refuses • Is described as (attribute): evil, frustrated, lord

  42. The Villain • Is character in the movie “Star Wars” • Science Fiction, Adventure, Space Opera, Fantasy, Family Film, Action • Is played by David Prowse • Male • 42 years old in 1977

  43. Task Learning character types from textual descriptions of characters. Data Source 42,306 movie plot summaries Wikipedia 15,099 English novels (1700-1899) HathiTrust

  44. Personas dark major henchman attribute warrior sergeant shoot aim overpower agent interrogate kill Highest weighted features: Male • Action • War film • Jason Bourne, Bourne Supremacy

  45. Personas capture corner transport patient imprison trap infiltrate deduce leap agent evade obtain flee escape swim hide agent manage Highest weighted features: Female • Action • Adventure • Ginormica (Monsters vs. Aliens)

  46. Evaluation I: Names • Gold clusters: characters with the same name (sequels, remakes) • Noise: “street thug” • 970 unique character names used twice in the data; n=2,666

  47. Evaluation II: TV Tropes • Gold clusters: manually clustered characters from www.tvtropes.com • “The Surfer Dude” • “Arrogant Kung-Fu Guy” • “Hardboiled Detective” • “The Klutz” • “The Valley GIrl” • 72 character tropes containing 501 characters

  48. Purity: Names Persona Regression Dirichlet Persona 70 52.5 35 17.5 0 25x25 25x50 25x100 50x25 50x50 50x100

  49. Purity: TV Tropes Persona Regression Dirichlet Persona 70 52.5 35 17.5 0 25x25 25x50 25x100 50x25 50x50 50x100

  50. Evaluation 𝓨 task learn patterns that define architectural set of skyscrapers styles learn patterns that define genre set of books learn patterns that suggest “types” of customer data customer behavior

  51. Digital Humanities • Marche (2012), Literature Is not Data: Against Digital Humanities • Underwood (2015), Seven ways humanists are using computers to understand text.

  52. Text visualization

  53. Characteristic vocabulary Characteristic words by William Wordsworth (in comparison to other contemporary poets) [Underwood 2015]

  54. Finding and organizing texts • e.g., finding all examples of a complex literary form (Haiku). • Supplement traditional searches: book catalogues, search engines.

  55. Modeling literary forms • What features of a text are predictive of Haiku?

  56. Modeling social boundaries Predicting reviewed texts [Underwood and Sellers (2015)]

  57. Unsupervised modeling

Recommend


More recommend