3/19/10 Pucktada Treeratpituk (Puck) PhD Student College of Information Sciences & Technology Penn State University Background • Education – 3 rd yr PhD. Information Sciences & Technology, Penn State • CiteSeerX digital library (http://citeseerx.ist.psu.edu) • Advisor: Dr. C. Lee Giles – Intelligent Information Systems Research Lab – MS. Language Technology, Carnegie Mellon University – MS. Computer Science, Stanford University – BS. Computer Science, Mathematics and Economics, Carnegie Mellon University • General Research Interest – Data Mining, Information Retrieval, Information Extraction, Natural Language Processing and Digital Library 1
3/19/10 Current Research • Author Disambiguation in Digital Libraries – using machine learning techniques to resolve name ambiguity – http://citeseerx.ist.psu.edu • Key Phrase Extraction from Scholarly Work – mining & identifying important topics in each academic paper • Automatic Expertise Extraction – identifying expertise & research interests for each authors based on the content of their publication records – http://singularity.ist.psu.edu/expert • Expert Finding in Digital Library – computing query-dependent expert ranking Author Disambiguation in CiteSeerX 2
3/19/10 Automatic Expertise Extraction http://singularity.ist.psu.edu/expert Author Disambiguation in CiteSeerX • Since CiteSeerX crawls for scientific papers (pdf/ps) from the web, we have to rely on metadata we extract from the papers for the disambiguation. • Approaches – Learn to estimate the likelihood that two author names from two different papers refer to the same person, based on metadata such as affiliation, paper title, coauthors, etc., then do the clustering. – Previous Work • SVM + DBSCAN (Huang et al, PKDD’06) • Topic Model (Song et al, JCDL’07) • Random Forest (Treeratpituk et al, JCDL’09) (for MEDLINE) • Also do other types of record matching such as citations 3
3/19/10 What’s Next • Iterative approach – Right now, disambiguation is done in batch. – Adding new documents every week, thus disambiguation should also be done iteratively. • Interactive Mode – Will never be 100% perfect. – Allow user correction (merge/split) – Provide suggestions. Thanks… 4
Recommend
More recommend