• Information Visualization and Visual Data Intro: What is datamining? Mining , Keim, IEEE Transactions on Visualization and Computer Graphics 8(1), 2002. • Data are generated in large amount. E.g. transactions, telephone calls. • DataJewel: Tightly Integrating Visualization with Temporal Data Mining , Mihael Ankerst, • Data is collected because believed to be a potential David H. Jones, Anne Kao, Changzhou Wang. source of valuable info. ICDM Workshop on Visual Data Mining, • Datamining is finding useful and interesting info Melbourne, FL, 2003 [Archived version] from the data. • DEVise: Integrated Querying and Visual • Data can be "large" in two ways: width and height Exploration of Large Datasets , Miron Livny, of dataset. Raghu Ramakrishnan, Kevin Beyer, Guangshun • At the beginning, we have the computer analyze Chen, Donko Donjerkovic, Shilpa Lawande, Jussi the data and spit out result in text... Now we're Myllymaki, and Kent Wenger. Proc. SIGMOD moving towards "human-centred datamining," and 1997. visualization is one tool to do so. 1 2 Visual data mining: include the Classification of Visual Data human in the data exploration Mining Techniques process 1) Data type to be visualized (6) 2) Visualization technique (5) Combines 3) Interaction and distortion technique (5) 1) the flexibility, creativity and general knowledge of the human and These 3 dimensions of classification can be assumed orthogonal 2)Enormous storage capacity and computational power of computers 3 4 1. Data type to be visualized (2/2) 1. Data type to be visualized (1/2) 1.4) Text and hypertext, e.g. news articles 1.1) 1-D data, usually the dimension is very Most of the standard visualization techniques dense. cannot be applied. In most cases, a E.g. temporal data, like time series of stock prices. transformation of the data into description vectors is necessary first. 1.2) 2-D data. E.g. word counting, then principal component E.g. geographical maps analysis. 1.3) Multi-Dimension 1.5) Hierarchies and graphs E.g. tables from relational databases E.g. telephone calls No simple mapping of attributes to the two 1.6) Algorithms and software dimensions of the screen E.g. for debugging operations 5 6 1
2. Visualization technique 2.5) stacked displays Tailored to present data partitioned in a hierarchical fashion. 2.1) standard 2D/3D displays Embed one coordinate system inside another coordinate system. e.g. bar charts and x-y plots. Figure: by M. Ward, Worchestor Polytechnic 2.2) geometrically transformed displays e.g. parallel coordinates. 2.3) icon-based displays (glyphs) 2.4) dense pixel displays 7 8 3. Interaction and distortion 3. Interaction and distortion technique (1/2) technique (2/2) • Dynamic: changes to visualizations are 3.3) Interactive zooming made automatically On higher zoom levels, more details are shown. • Interactive: changes are made manually 3.4) Interactive distortion Show portions of the data with high level of detail while 3.1) Dynamic projections other s are shown with lower. e.g. To show all interesting two-dimensional E.g. spherical distortion and fisheye views. projections of a multi-dimensional dataset as a 3.5) Interactive Linking and Brushing series of scatter plots. – Combine different visualization methods to overcome 3.2) Interactive filtering the shortcomings of single techniques. browsing: direct selection of desired subset – Changes to one visualization are automatically reflected in the other visualization. querying: specify properties of desired subsets 9 10 • Information Visualization and Visual Data Mining, Critiques Daniel A. Keim, IEEE Transactions on Visualization and Computer Graphics 8(1), 2002. • DataJewel: Tightly Integrating Visualization with + Good summary of visual datamining and InfoVis Temporal Data Mining Mihael Ankerst, David H. in general. Jones, Anne Kao, Changzhou Wang. ICDM + Nice all-around introductory material. Concise. Workshop on Visual Data Mining, Melbourne, FL, 2003 [Archived version] + Great references. Supported his classifications with ample examples, and cites figures from other • DEVise: Integrated Querying and Visual papers. "see Fig. 5 in [10]" Exploration of Large Datasets Miron Livny, Raghu Ramakrishnan, Kevin Beyer, Guangshun + Good amount of pictures Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, and Kent Wenger. Proc. SIGMOD 1997. 11 12 2
User-centric Data Mining (1/3) DataJewel • The mining process is Main contribution: recursive • The DataJewel architecture tightly integrates a visualization component, an algorithmic • At least one attribute component and a database component for contains a timestamp temporal data mining. for each record. Call • Bridge the field of InfoVis with other research it "event date". communities e.g. datamining. • All attributes are • 2 aspects of temporal data mining: Need to add new mining algorithms easily; need to link tables "event attributes" together that have no primary key. • Attribute values are "events" 13 14 User-centric Data Mining (2/3) User-centric Data Mining (3/3) Assumptions: Using the above assumptions, one instance of a) number of event attributes is low. (<10) the visualization and the algorithmic Often, in one given analysis, the analyst selects a small number of event attributes which can be component are presented, and new ones can associated with each other in a particular domain. be easily integrated. b) number of different events of one event attribute is moderate. (<200) If this is not true, a concept of hierarchy can be defined for the event attribute. c) smallest time unit of interest in the event dates is one day 15 16 Visualization component: [A dense pixel display and a stacked display and CalendarView Linking and Brushing] • Multi-Dimensional, with Even Date as the "key" • Web-mining example: 17 18 3
Temporal Mining Component Interaction with CalendarView • These algorithms assign colour to events to allow • Selection: selected subset can be visualized users to observe patterns easily in the CalendarView. following the iterative process • LongestStreak: Discover one event of one event attribute with the longest consecutive streak of • Descending/Ascending order: good for significant days. (What about the longest N streaks?) finding "main" events and outlier events. • MatchingEvents extends LongestStreak: Return the [Interactive filtering and interactive zooming] LongestStreak event and the correlated event. • MatchingEvents2: returns the LongestStreak of the first event attribute and for each other event attribute, the event that is correlated. 19 20 Database Component (2/3) Database Component (1/3) • Generate "Sufficient statistics" for event • This component provide access to datasets attribute page_hits in tables from relational database(s). • Before • The critical task is to scale up to large databases. • Compute an aggregated version of the dataset such that it fits in main memory. • Query: • After 21 22 Experiment with airplane Database Component (3/3) maintenance datasets (1/2) • mem_init = c * number of days * average number of events per day (= 402 in aircraft maintenance • Pentium III/800Mhz and 1 GB main domain for one airline) memory • mem_new = c * number of days * average number • Datasets span 12-14 years, with sufficient of distinct events per day (= 32) statistics fit in main memory • Summary statistics always fit in main memory and 1) LongestStreak finds a system of an airplane: the computation of the proposed algorithm is "engine fuel". During the last five days of efficient. Authors believe it is true for most July 2000, we perceive many events, datasets which fulfill their assumptions. E.g. indicating problems with engine fuel. number of event attributes is low (<10). 23 24 4
Recommend
More recommend