semi supervised learning and text analysis
play

Semi-Supervised Learning and Text Analysis Machine Learning 10-701 - PowerPoint PPT Presentation

Semi-Supervised Learning and Text Analysis Machine Learning 10-701 November 29, 2005 Tom M. Mitchell Carnegie Mellon University Document Classification: Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ...


  1. Semi-Supervised Learning and Text Analysis Machine Learning 10-701 November 29, 2005 Tom M. Mitchell Carnegie Mellon University

  2. Document Classification: Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0

  3. For code, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”

  4. Supervised Training for Document Classification • Common algorithms: – Logistic regression, Support Vector Machines, Bayesian classifiers • Quite successful in practice – Email classification (spam, foldering, ...) – Web page classification (product description, publication, ...) – Intranet document organization • Research directions: – More elaborate, domain-specific classification models (e.g., for email) – Using unlabeled data too � semi-supervised methods

  5. EM for Semi-supervised document classification

  6. Using Unlabeled Data to Help Train Naïve Bayes Classifier Learn P(Y|X) Y Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 ? 0 1 1 0 X 1 X 2 X 3 X 4 ? 0 1 0 1

  7. From [Nigam et al., 2000]

  8. w t is t-th word in vocabulary M Step: E Step:

  9. Elaboration 1: Downweight the influence of unlabeled examples by factor λ Chosen by cross validation New M step:

  10. Using one labeled example per class

  11. 20 Newsgroups

  12. 20 Newsgroups

  13. EM for Semi-Supervised Doc Classification • If all data is labeled, corresponds to Naïve Bayes classifier • If all data unlabeled, corresponds to mixture-of- multinomial clustering • If both labeled and unlabeled data, it helps if and only if the mixture-of-multinomial modeling assumption is correct • Of course we could extend this to Bayes net models other than Naïve Bayes (e.g., TAN tree)

  14. Bags of Words, or Bags of Topics?

  15. LDA: Generative model for documents [Blei, Ng, Jordan 2003] Also extended to case where number of topics is not known in advance (hierarchical Dirichlet processes – [Blei et al, 2004])

  16. Clustering words into topics with Hierarchical Topic Models (unknown number of clusters) [Blei, Ng, Jordan 2003] Probabilistic model for generating document D: 1. Pick a distribution P(z| θ) of topics according to P (θ|α) 2. For each word w Pick topic z from P(z | θ ) • Pick word w from P(w |z, φ ) • Training this model defines topics (i.e., φ which defines P(W|Z))

  17. Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY DISEASE MIND WATER WORK STUDY GAME MAGNETIC WORLD STORIES BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET DREAM TELL DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE GERMS DREAMS CHARACTER SWIM EXPERIENCE KNOWLEDGE BASEBALL NEEDLE THOUGHT CHARACTERS FEVER SWIMMING EMPLOYMENT WORK PLAYERS CURRENT AUTHOR CAUSE IMAGINATION POOL OPPORTUNITIES RESEARCH PLAY COIL MOMENT READ CAUSED LIKE WORKING CHEMISTRY FIELD POLES THOUGHTS TOLD SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING VIRUSES OWN SHARK SKILLS MANY BASKETBALL COMPASS REAL TALES INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT VIRUS LIFE SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING MICROORGANISMS IMAGINE SHARKS FIND FIELD PLAYING ELECTRIC SENSE SHORT PERSON DIVING POSITION PHYSICS HIT DIRECTION FICTION INFECTIOUS CONSCIOUSNESS DOLPHINS FIELD LABORATORY TENNIS FORCE STRANGE ACTION COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS FEELING TRUE CAUSING LONG REQUIRE WORLD GAMES BE EVENTS SMALLPOX WHOLE SEAL OPPORTUNITY SCIENTIST SPORTS MAGNETISM BEING TELLS BODY DIVE EARN STUDYING BAT POLE TALE INFECTIONS MIGHT DOLPHIN ABLE SCIENCES TERRY INDUCED CERTAIN HOPE NOVEL UNDERWATER [Tennenbaum et al]

  18. Example topics induced from a large collection of text JOB SCIENCE BALL STORY FIELD DISEASE MIND WATER WORK STUDY GAME MAGNETIC WORLD STORIES BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET DREAM TELL DISEASES Significance: SEA CAREER SCIENTIFIC FOOTBALL WIRE GERMS DREAMS CHARACTER SWIM EXPERIENCE KNOWLEDGE BASEBALL NEEDLE THOUGHT CHARACTERS FEVER SWIMMING EMPLOYMENT WORK PLAYERS CURRENT AUTHOR CAUSE IMAGINATION • Learned topics reveal hidden, POOL OPPORTUNITIES RESEARCH PLAY COIL MOMENT READ CAUSED LIKE WORKING CHEMISTRY FIELD POLES THOUGHTS TOLD SPREAD implicit semantic categories in SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING VIRUSES OWN SHARK SKILLS MANY BASKETBALL COMPASS REAL TALES INFECTION the corpus TANK CAREERS MATHEMATICS COACH LINES PLOT VIRUS LIFE SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING MICROORGANISMS IMAGINE SHARKS FIND PLAYING FIELD ELECTRIC SENSE SHORT PERSON • In many cases, we can DIVING POSITION PHYSICS HIT DIRECTION FICTION INFECTIOUS CONSCIOUSNESS DOLPHINS FIELD LABORATORY TENNIS FORCE STRANGE ACTION represent documents with 10 2 COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS FEELING TRUE CAUSING LONG topics instead of 10 5 words REQUIRE WORLD GAMES BE EVENTS SMALLPOX WHOLE SEAL OPPORTUNITY SCIENTIST SPORTS MAGNETISM BEING TELLS BODY DIVE EARN STUDYING BAT POLE TALE INFECTIONS MIGHT DOLPHIN ABLE SCIENCES TERRY INDUCED CERTAIN HOPE NOVEL • Especially important for short UNDERWATER documents (e.g., emails). Topics overlap when words don’t ! [Tennenbaum et al]

  19. Can we analyze roles and relationships between people by analyzing email word or topic distributions?

  20. Author-Recipient-Topic model for Email Latent Dirichlet Allocation Author-Recipient Topic (LDA) (ART) [Blei, Ng, Jordan, 2003] [McCallum, Corrada, Wang, 2004]

  21. Enron Email Corpus • 250k email messages • 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

  22. Topics, and prominent sender/receivers discovered by ART [McCallum et al, 2004] Top words within topic : Top author-recipients exhibiting this topic

  23. Topics, and prominent sender/receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice Presidence of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

  24. Discovering Role Similarity Traditional SNA ART connection strength (A,B) = Similarity in Similarity in authored topics, recipients they conditioned on sent email to recipient

  25. Co-Training for Semi-supervised document classification Idea: take advantage of *redundancy*

  26. my advisor Redundantly Sufficient Features Professor Faloutsos

  27. Redundantly Sufficient Features my advisor Professor Faloutsos

  28. Redundantly Sufficient Features

  29. my advisor Redundantly Sufficient Features Professor Faloutsos

  30. Co-Training Key idea: Classifier 1 and Classifier J must: 1. Correctly classify labeled examples 2. Agree on classification of unlabeled Answer 1 Answer 2 Classifier 1 Classifier 2

  31. CoTraining Algorithm #1 [Blum&Mitchell, 1998] Given: labeled data L, unlabeled data U Loop: Train g1 (hyperlink classifier) using L Train g2 (page classifier) using L Allow g1 to label p positive, n negative examps from U Allow g2 to label p positive, n negative examps from U Add these self-labeled examples to L

  32. CoTraining: Experimental Results • begin with 12 labeled web pages (academic course) • provide 1,000 additional unlabeled web pages • average error: learning from labeled data 11.1%; • average error: cotraining 5.0% Typical run:

  33. Co-Training for Named Entity Extraction (i.e.,classifying which strings refer to people, places, dates, etc.) [Riloff&Jones 98; Collins et al., 98; Jones 05] Answer1 Answer2 Classifier 1 Classifier 2 New York I flew to ____ today I flew to New York today.

  34. CoTraining setting : • wish to learn f: X � Y, given L and U drawn from P(X) • features describing X can be partitioned (X = X1 x X2) such that f can be computed from either X1 or X2 One result [Blum&Mitchell 1998]: • If – X1 and X2 are conditionally independent given Y – f is PAC learnable from noisy labeled data • Then – f is PAC learnable from weak initial classifier plus unlabeled data

  35. Co-Training Rote Learner pages hyperlinks + My advisor + + - - - - - -

  36. Co-Training Rote Learner pages hyperlinks + + My advisor + + + - - - - - - - - -

  37. Co-Training Rote Learner pages hyperlinks + + + + My advisor + + + + + - - - - - - - - - - - -

  38. Expected Rote CoTraining error given m examples : CoTraining setting → : learn f X Y = × where X X X 1 2 where x drawn from unknown distributi on ∃ ∀ = = , ( ) ( ) ( ) ( ) and g g x g x g x f x 1 2 1 1 2 2 = ∑ [ ] ∈ − ∈ m ( )( 1 ( )) E error P x g P x g j j j Where g is the j th connected component of graph j of L+U, m is number of labeled examples

Recommend


More recommend