Visual Search and Analysis in Textual and Non-Textual Document Repositories Approaches, Applications, and Research Challenges Tobias Schreck Visual Analytics Group Computer and Information Science University of Konstanz, Germany CLEF 2012 Conference and Labs of the Evaluation Forum 2012 19.09.2012
1. Need for Search and Analysis in Large Data Technological progress: Information Overload – Acquisition, production, storage Share of digital information – Data integration, data mining 2000: 25% Large and increasing amounts of data 2002: 50% (Begin Digital Age ) 2007: 94% (300 Exabyte) Data-intensive application domains – Business Estimated growth rates (1986-2007) – Research Storage: 23% – Engineering Network: 28% Compute: 56% Need for new technologies Source: Science, according to – „… to unite the seemingly conflicting [F&L 3/2011] requirements of scalability and usability in making sense of the data“ [ VisMaster 2010 ] 2
1. Data Examples Textual Data Repositories – Digital Libraries Digital Libraries Customer Reviews – Web (Amazon.com) – Social Media www.facebook.com www.twitter.com Non-textual Data Repositories – Image repositories – 3D Object repositories Victoria State Library Image – Data repositories Collection (http://www.slv. vic.gov.au/) Sloan Digital Sky Survey PROBADO3D Archive (http://www.sdss.org/) 3 (http://www.probado.de/3d.html)
1. How to Make Use of Large Data Repositories? Searching – Find information entities of interest – Reusage, comparison – Based on specification of queries Analyzing – Find structures and abstractions (“Understand” data set as a whole) – Check hypotheses – Make interesting, actionable observations Interdependence – Cycles of searching and analyzing 4
1. Visual Search and Analysis Visual representation of the search and analysis process [ Shneiderman 1996 ] Goals of Visual Information Systems – Intuitive access, direct manipulation – Leverage human visual perception – Encourage exploration [Ahlberg and Shneiderman 1994] Classic visual search systems – Filmfinder [ Ahlberg and Shneiderman 1994 ] – Time Searcher [ Hochheiser and Shneiderman 2004 ] Classic visual analysis systems – Spire/In-Spire [ Wise et al 1995 ] – Visual decision tree construction and analysis [ Teoh and Ma 2003 ] 5 [ Wise et al 1995 ]
Propositions of this Talk 1. Emerging large, complex data sources pose new challenges to Information Retrieval and Understanding 2. Visual-interactive methods are useful to support retrieval and data understanding 3. Promising research opportunities at intersection of visualization, information retrieval, and evaluation 6
Outline 1. Introduction 2. Overview Visualization for Large Text 2.1 Feature-based Text Visualization 2.2 Attribute-based Text Visualization 2.3 Visual Document Summarization 2.4 Geo-referenced Micro Blogging Text 3. Visual Search in Non-Textual Data 4. Promising Research Opportunities 5. Conclusions 7
2.1 Sentiment Analysis • Opinion score derived from adjectives, nouns, and verbs • Identifies positive and negative sections Overview over large document corpora Find articles which suit the mood of the reader [Keim, Mansmann et al., 2008] 8
2.1. Sentiment Analysis: News Overview 9
2.1 Pixel-based Approach Feature: average sentence length [Oelke et al., 2008] 10
2.1 Readability Features [Oelke, Spretke et al., 2010] 11
2.1 Readability Features: Vocabulary Difficulty of 2009 German Election Programs Feature: Vocabulary Difficulty Die Linke Piraten [Oelke, Spretke et al., 2010] 12
2.2 Attribute-based: Story, Character Complexity King‘s IT Rowling‘s Harry Potter [Wanner, Fuchs et al., 2011] 13
2.2 Attribute-based: Visual Review Analysis • User opinions abundantly available – Forums, Blogs – E-commerce – … • Many application possibilities – Product reviews for customers – Market analysis – Customer relationship Amazon customer reviews management (amazon.com) 14
2.2 Attribute-based: Visual Review Analysis • Basic method – Identify product attributes – Identify positive/negative opinions – Calculate weighted attribute vector • Visual comparison of sets of reviews – Glyph matrix approach – Cluster analysis • Applied to printer product cartridge paper price printer scanner software reviews tray 0 -1 0 +1 0 +1 [Oelke, Hao et al., 2009] 15
2.2 Attribute-based: Visual Review Analysis 16 [Oelke, Hao et al., 2009]
2.2 Attribute-based: Customer Segmentation 17 [Oelke, Hao et al., 2009]
2.3 Visual Content Overviewing • Visual abstract for scientific articles – Extraction of important figures and keyword – Layout of elements in generalized word cloud • Overviewing • Navigation • Comparison [Strobelt, Oelke et al., 2009] 18
2.3 Visual Content Overviewing 19 [Strobelt, Oelke et al., 2009]
2.3 Visual Content Overviewing 20 [Strobelt, Oelke et al., 2009]
2.4 Georeferenced Microblogging Text • Microblogging Text (e.g., Twitter) – Short text messages Nice view, – Time stamp all fine … – GPS position • Potential analytic use Stuck in a jam after – Trend analysis traffic – Marketing, Reputation accident … monitoring – Situational awareness for civil [www.google.com] defense or crisis management 21
2.4 SensePlace2 Tool [MacEachren, Jaiswal et al., 2011] 22
2.4 VAST Micro Blogging Challenge • VAST Challenge 2011 – Fictitious city including street network and POIs – 1 mio microblogging messages for 20 days incl. spatial positon [http://hcil.cs.umd.edu/localphp/hcil/vast11/] – Fictitious hidden epidemic scenario • Task – Find possible epidemics and its characteristics 23
2.4 VAST Micro Blogging Challenge [Bertini, Buchmüller et al., 2011] 24
2.4 Concentration on Bridges 25
2.4 Concentration in Hospitals 26
2.4 Message Distribution (19.05.) – Filtered for Symptom Keywords 27
2.4 VAST Micro Blogging Challenge 28
Remainder of this Talk 1. Introduction 2. Overview Visualization for Large Text 2.1 Feature-based Text Visualization 2.2 Attribute-based Text Visualization 2.3 Visual Document Summarization 2.4 Geo-referenced Micro Blogging Text 3. Visual Search in Non-Textual Data 3.1 Sketch-based 3D Object Retrieval 3.2 Retrieval in Bivariate Measurement Data 4. Promising Research Opportunities 5. Conclusions 29
3. Visual Search in Non-Textual Data Multitude of complex document types – Images – Video – 3D Objects – Multivariate Research Data – Etc. PROBADO3D Archive [http://www.probado.de/3d.html] Research questions to address – Similarity functions? – Query types to support? – How to evaluate? Victoria State Library Image Collection Sloan Digital Sky Survey (http://www.slv.vic.gov.au/) (http://www.sdss.org/) 30
3.1 Query-by-Exampe and Sketch-Based Retrieval Problems: 1. How to compare structurally different views? 2. How to evaluate different sketching styles? 31
3.1 Gradient Features, Suggestive Contours [DeCarlo et at., 2003] [Yoon et al., 2010] 32
3.1 Sketch-Based 3D Object Retrieval 14 classes subset of Princeton Shape Benchmark [ Shilane et al 2004 ] Evaluation of retrieval performance (per class, given user sketch) Collection of 20 user [Yoon et al., 2010] sketches per class 33
3.1 SHREC’12 Track : Sketch-Based 3D Retrieval [SHREC 2012 Sketch-based 3D Retrieval Track] 34
3.1 Large-Scale Sketch Benchmark Crowd-sourced approach of [Eitz et al., 2012 a ] • 20.000 sketches from 1300 users • 250 representative object categories • Basis for improved benchmarking study [Eitz et al., 2012 b ] Recognition experiment • Avg. human accuracy: 73% • Avg. automatic accuracy: [Eitz et al., 2012 a ] 56% 35
3.2 Visual Search in Bivariate (Research) Data • Jim Gray‘s Fourth Paradigm and emerging research data repositories [Hey, Tansley, Tolle 2009] • Prominent type of quantitative data: bivariate and multivariate [Pangaea] data • Common visual representation – Scatter plot – Scatter plot matrix • Content-based support for visual search and analysis in this data? 36
3.2 Regressional Feature Vector for Comparing Scatter Plots Perform regressions (linear, square, log, …) Form feature vectors • Goodness of fit scores • Coefficient parameters [Scherer, Bernard et al., 2011] 37
3.2 Search and Analysis Application query by example [Scherer, Bernard et al., 2011] cluster altitude vs PPPP (pressure hPa) sort by similarity to f(x)=e^-x Spatial reference of data sets 38
3.2 A Benchmark for Earth Observation Data • But how to create a benchmark data set for automatic evaluation? • Input data – BSRN earth observation data (radiation, temperature, etc.) for 40 stations – 24.700 bivariate plots generated [Pangaea] • Tobler’s First Law of Geography for Similarity Class Formation Position x Month x Parameter – 18x6 Longitude/Lattitude grid – Month of year – pressure Parameters of measurement temp 1608 similarity classes alt CO2 • Evaluation of nine feature vectors O3 – Retrieval precision … – Timing 39 [Scherer, v. Landesberger et al., 2012]
Recommend
More recommend