Sparse ML for Text 1/33 L. El Ghaoui Information Overload Taming the Beast: Topic imaging Predictive approach Sparse Machine Learning for Large Text Corpora Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA Laurent El Ghaoui SAFE for LASSO Contextual applications Berkeley Center for New Media & EECS Dept., UC Berkeley with help from Guan-Cheng Li, Vu Pham, Viet-An Duong, Xinyu Dai New Directions in Management Science and Engineering Lecture MS& E Department Stanford University, May 15, 2012
Sparse ML for Text Outline 2/33 L. El Ghaoui Information Overload Topic imaging Predictive approach Visualizaations Information Overload Beyond co-occurence Examples Research Agenda Sparse PCA Topic imaging SAFE for LASSO Contextual applications Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA SAFE for LASSO Contextual applications
Sparse ML for Text Outline 3/33 L. El Ghaoui Information Overload Topic imaging Information Overload Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Topic imaging Sparse PCA Predictive approach SAFE for LASSO Contextual applications Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA SAFE for LASSO Contextual applications
Sparse ML for Text Information Overload 4/33 L. El Ghaoui Avalanche of “information” in text format, e.g. Information Overload ◮ News articles, press releases, RSS feeds, TV captioning data. Topic imaging ◮ 10-K filings, marketing brochures, financial analyst reports, and Predictive approach Visualizaations other company-related documents. Beyond co-occurence Examples ◮ Consumer reviews, blogs, emails, and other social media content. Research Agenda ◮ Scientific papers, patents, law documents, bills, literature. Sparse PCA SAFE for LASSO Contextual applications
Sparse ML for Text Information Overload 4/33 L. El Ghaoui Avalanche of “information” in text format, e.g. Information Overload ◮ News articles, press releases, RSS feeds, TV captioning data. Topic imaging ◮ 10-K filings, marketing brochures, financial analyst reports, and Predictive approach Visualizaations other company-related documents. Beyond co-occurence Examples ◮ Consumer reviews, blogs, emails, and other social media content. Research Agenda ◮ Scientific papers, patents, law documents, bills, literature. Sparse PCA SAFE for LASSO Contextual applications The top 20 most important news sources have generated ∼ 40,000 news articles yesterday.
Sparse ML for Text What might be useful? 5/33 L. El Ghaoui ◮ Summarize large text databases. Information Overload ◮ Detect and visualize trends in term usage. Topic imaging ◮ Compare how topics of interest are treated across different Predictive approach Visualizaations sources. Beyond co-occurence Examples ◮ Allow for quick translation of summaries if original data is in Research Agenda foreign-language. Sparse PCA SAFE for LASSO ◮ Cluster text documents. Contextual applications ◮ Provide interpretable visualizations .
Sparse ML for Text What might be useful? 5/33 L. El Ghaoui ◮ Summarize large text databases. Information Overload ◮ Detect and visualize trends in term usage. Topic imaging ◮ Compare how topics of interest are treated across different Predictive approach Visualizaations sources. Beyond co-occurence Examples ◮ Allow for quick translation of summaries if original data is in Research Agenda foreign-language. Sparse PCA SAFE for LASSO ◮ Cluster text documents. Contextual applications ◮ Provide interpretable visualizations . Approach: sparse machine learning tools to help in these tasks.
Sparse ML for Text Example 6/33 Discovery of emerging issues in flight security L. El Ghaoui After each commercial flight in the US, pilots generate “ASRS reports” Information Overload to document flight-related issues. Topic imaging Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA SAFE for LASSO Contextual applications Key problem: detect emerging issues that are not being classified into existing categories, e.g. : ◮ “Wake vortex” problem of the Boeing 757. ◮ Increased number of runway incursions at LAX.
Sparse ML for Text Example 6/33 Discovery of emerging issues in flight security L. El Ghaoui After each commercial flight in the US, pilots generate “ASRS reports” Information Overload to document flight-related issues. Topic imaging Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA SAFE for LASSO Key problem: detect emerging issues that are not being classified into Contextual applications existing categories, e.g. : ◮ “Wake vortex” problem of the Boeing 757. ◮ Increased number of runway incursions at LAX. Don’t search for a needle — picture the haystack!
Sparse ML for Text StatNews project 7/33 Statistical Analysis of News L. El Ghaoui Project started in 2007, with collaborators: Information Overload ◮ In statistics, optimization: Bin Yu (Stat, UCB), Alexandre Topic imaging Predictive approach d’Aspremont (Ecole Polytechnique), Francis Bach (INRIA). Visualizaations Beyond co-occurence ◮ In social sciences: Lee Fleming (IEOR), Sophie Clavier Examples (International Relations, SFSU). Research Agenda Sparse PCA SAFE for LASSO Contextual applications Sponsors: NSF, Google, CITRIS and INRIA.
Sparse ML for Text StatNews web site 8/33 Data L. El Ghaoui ◮ Archives: Information Overload ◮ New York Times, 1987-2007 (2.5 Million articles). Topic imaging ◮ NYT headlines from 1851 to present. Predictive approach ◮ headlines from 5 other sources since 1996. Visualizaations Beyond co-occurence ◮ English-speaking current news (from April 2011-present): Examples Research Agenda BBC, Ha’aretz, Moscow Times, Reuters, USA Today, Sparse PCA Associated Press, The Australian, China Daily, CNN, Financial SAFE for LASSO Times, The Guardian, India Times, Jerusalem Post, New York Contextual applications Times, Russian Times, Washington Post. ◮ Chinese-speaking current news (People’s Daily).
Sparse ML for Text StatNews project 9/33 Goals L. El Ghaoui ◮ Occurence analysis: Picture the relative weight (frequency) Information Overload given to different topics over time. Topic imaging Predictive approach ◮ Visualize the image (statistical associations) of a word or term Visualizaations Beyond co-occurence as painted in the news, and visualize the evolution of the image, Examples over time. Research Agenda Sparse PCA ◮ Visualize news sources relative to each other, the propagation SAFE for LASSO of concepts across news sources, and its dynamics. Contextual applications ◮ Provide a web-based service to analyze our text data, and allowing users to upload their own (medium-size) databases.
Sparse ML for Text Outline 10/33 L. El Ghaoui Information Overload Topic imaging Information Overload Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Topic imaging Sparse PCA Predictive approach SAFE for LASSO Contextual applications Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA SAFE for LASSO Contextual applications
Sparse ML for Text Topic imaging 11/33 L. El Ghaoui Task: topic imaging (subject-specific summarization) in a given corpus. Information Overload Topic imaging ◮ Sparse statistical prediction as surrogate. Predictive approach Visualizaations ◮ Human experiments to validate and find robust pre-processing Beyond co-occurence Examples schemes. Research Agenda Sparse PCA SAFE for LASSO Contextual applications
Sparse ML for Text Topic imaging 11/33 L. El Ghaoui Task: topic imaging (subject-specific summarization) in a given corpus. Information Overload Topic imaging ◮ Sparse statistical prediction as surrogate. Predictive approach Visualizaations ◮ Human experiments to validate and find robust pre-processing Beyond co-occurence Examples schemes. Research Agenda Sparse PCA SAFE for LASSO Contextual applications Result: a short list of terms that summarizes the topic as treated in the corpus.
Sparse ML for Text What is topic imaging? 12/33 L. El Ghaoui Topic image: A small set of terms that are semantically related to a given topic (“the query”). Information Overload Topic imaging Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA As a predictive problem: predict appearance of query term in a SAFE for LASSO Contextual applications document given the term use in that document.
Sparse ML for Text What is topic imaging? 12/33 L. El Ghaoui Topic image: A small set of terms that are semantically related to a given topic (“the query”). Information Overload Topic imaging Predictive approach Visualizaations Beyond co-occurence Examples Research Agenda Sparse PCA As a predictive problem: predict appearance of query term in a SAFE for LASSO Contextual applications document given the term use in that document. ◮ Predictive model must be interpretable: number of predictors (other terms) must be few (sparse modeling). ◮ Model must be obtained fast .
Recommend
More recommend