web user profiling using data redundancy
play

Web User Profiling using Data Redundancy - PowerPoint PPT Presentation

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie Tang, Jing Zhang Tsinghua University 1 Web User Profiling using Data Redundancy Introduction Traditional Way Basic Idea


  1. Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie Tang, Jing Zhang Tsinghua University 1

  2. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 2

  3. Position Phone & Fax Email Homepage Affiliation Address • Expert Finding • Recommendation • Getting in Touch • … 3

  4. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 4

  5. Traditional Way: Two-Step • Source Finding • Extraction 5

  6. Traditional Way: Two-Step • Source Finding SVM LR CRF • Extraction 6

  7. Traditional Way: Two-Step • Low Recall – single data source • Low Precision – error propagation 7

  8. Traditional Way: Two-Step • Low Recall – single data source • Low Precision – error propagation Result Profile Extraction Homepage Finding 90% 90% 81% * = 8

  9. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 9

  10. Basic Idea • A Uniform Framework ü All in one step, avoiding error propagation ü Incorporate information from different data sources: Homepage, Google Scholar, Twitter, Linkedin, Facebook, etc. 10

  11. Basic Idea • A Uniform Framework • Search Engine as the data source 11

  12. Basic Idea • Search Engine as Data Source 12

  13. Basic Idea - Search Engine as Data Source Why snippets? ü Efficient • Different from traditional methods that crawled each of the relevant pages, It is much faster and more stable, as different servers that host the relevant pages may have very different network speed. ü Effective • we found with the constructed “smart” queries, more than 90% of the profile attributes are already contained in the snippets returned by the search engine. ü Economical • One additional advantage is that we do not need to maintain a large database to record all the relevant pages for all the query persons. This is very important, as, for example, in AMiner, we have more than 130,000,000 researchers— maintaining such a big database for all researchers itself is a challenging task. 13

  14. Basic Idea • A Uniform Framework • Search Engine as the Data Source • Smart Query Construction Categorical : Gender, Position, Country… Profile Attributes Non-Categorical : Email, Affiliation, Address… 14

  15. Query Construction Non-Categorical Person_Name + Attribute_Name Query = “Phillip S. Yu email” 15

  16. Query Construction Categorical Person_Name + Representative Words Query = “Phillip S. Yu his OR her” 16

  17. Representative Words Male Female “her” “his” “he” “he” “…” “…” Query = “Phillip S. Yu his OR her” 17

  18. Basic Idea • A Uniform Framework • Search Engine as the Data Source • Smart Query Construction • Basic Classification 18

  19. Feature Definition Email Gender • First name in prefix • How many “his” • Last name in prefix • How many “her” • Initials in prefix • … • … 19

  20. Basic Classification Email Gender Uniformly outperform the baselines (CTRF, FGNL) 20

  21. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 21

  22. MagicFG - Markov Logic Factor Graph Data Redundancy Logic Factors More Accurate Classification 22

  23. Why logic factors? ü Depict and utilize correlations between possible candidates from redundant data. ü Incorporate human knowledge to guide and amend the classification model. y 3 y 5 y 1 g ( y 4 , y 5 ) g ( y 1 , y 2 ) g ( y 2 , y 4 ) y 4 y 2 Prior Knowledge f ( y 5 , x 5 ) f ( y 3 , x 3 ) Complete Consistency f ( y 2 , x 2 ) f ( y 4 , x 4 ) f ( y 1 , x 1 ) Partial Consistency e 3 , v e 5 , v e 1 , v e 4 , v e 2 , v 23

  24. Logic Factors • Complete Consistency Two same vertices must share the same label. psyu@cs.uic.edu psyu@cs.uic.edu True True OR psyu@cs.uic.edu psyu@cs.uic.edu False False 24

  25. Logic Factors • Partial Consistency Two similar vertices probably share the same (preferred) label. e.g. Two Emails sharing the same prefix are probably both credible for the target user. probably psyu@cs.uic.edu psyu@uic.edu True True 25

  26. Logic Factors • Prior Knowledge Some prior knowledge can be converted to logic factors. e.g. Some Email addresses are modified (blocked) for some reason, whose domains are still visible and credible. Emails with the same domain with a blocked one are probably valid. probably email@cs.uic.edu psyu@uic.edu Blocked True 26

  27. Markov Logic Factor Graph • Attribute factor function • Logic factor function • Log-likelihood function • Target parameter 27

  28. Markov Logic Factor Graph • Training: Gradient Ascent • Gradient: • Learning: • Classification: 28

  29. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 29

  30. Accuracy Performance • Comparison between MagicFG and state-of-the-art methods for Email and Gender extraction 94 100 92 95 90 90 88 86 85 84 80 82 75 80 78 70 Precision Recall F1-score Precision Recall F1-score TCRF MagicFG FGNL MagicFG Gender Email 30

  31. Accuracy Performance Logic factors do help! 94 94.5 94 93 93.5 92 93 91 92.5 90 92 89 91.5 88 91 90.5 87 90 86 Precision Recall F1-score Precision Recall F1-score Basic Basic+CC Basic Basic+CC Basic+CC+PC Basic+CC+PC+PK Gender Email 31

  32. Web User Profiling using Data Redundancy • Introduction • Traditional Way • Basic Idea • MagicFG • Experiments • Conclusion 32

  33. Conclusion • Motivation - To solve the problem of low recall and error propagation in traditional two-step methods. • Basic Idea - Search engine as the data source. • MagicFG - Utilize correlations in redundant data. - Incorporate human knowledge 33

  34. Thank you! Code & Data http://aminer.org/profiling 34

Recommend


More recommend