Data Mining: Concepts and Techniques Chap 8. Data Streams, Time - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential Patterns Li Xiong Slides credits: Jiawei Han and Micheline Kamber and others 1 March 27, 2008 Data Mining: Concepts and Techniques

Mining Stream, Time-Series, and Sequence Data Mining data streams Mining time-series data Mining sequence data 2 March 27, 2008 Data Mining: Concepts and Techniques

Mining Data Streams � Stream data and stream data processing � Basic methodologies for stream data processing and mining � Stream frequent pattern analysis � Stream classification � Stream cluster analysis 3 March 27, 2008 Data Mining: Concepts and Techniques

Data Streams � Data Streams � A sequence of data in transmission � An ordered pair (s, ∆ ) where: s is a sequence of tuples, ∆ is the sequence of time intervals � Characteristics � Continuous � Huge volumes, possibly infinite � Fast changing and requires fast, real-time response � Random access is expensive—single scan algorithm � Low-level or multi-dimensional in nature 4 March 27, 2008 Data Mining: Concepts and Techniques

Stream Data Applications � Telecommunication calling records � Business: credit card transaction flows � Network monitoring and traffic engineering � Financial market: stock exchange � Engineering & industrial processes: power supply & manufacturing � Sensor, monitoring & surveillance: video streams, RFIDs � Security monitoring � Web logs and Web page click streams � Massive data sets (even saved but random access is too expensive) 5 March 27, 2008 Data Mining: Concepts and Techniques

Architecture: Stream Query Processing and Mining User/Application User/Application SDMS (Stream Data User/Application Management System) Continuous Query Continuous Query Results Results Multiple streams Multiple streams Stream Query Stream Query Processor Processor Scratch Space Scratch Space (Main memory and/or Disk) (Main memory and/or Disk) 6 March 27, 2008 Data Mining: Concepts and Techniques

DBMS versus DSMS Persistent relations Transient streams � � One-time queries Continuous queries � � Random access Sequential access � � “Unbounded” disk store Bounded main memory � � Only current state matters Historical data is important � � No real-time services Real-time requirements � � Relatively low update rate Possibly multi-GB arrival rate � � Data at any granularity Data at fine granularity � � Assume precise data Data stale/imprecise � � Access plan determined by Unpredictable/variable data � � query processor, physical DB arrival and characteristics design Ack. From Motwani’s PODS tutorial slides 7 March 27, 2008 Data Mining: Concepts and Techniques

Mining Data Streams � Stream data and stream data processing � Foundations for stream data mining � Stream frequent pattern analysis � Stream classification � Stream cluster analysis 8 March 27, 2008 Data Mining: Concepts and Techniques

Methodologies for Stream Data Processing � Major challenges � Keep track of a large universe � Methodology � Choosing a subset of data � Sampling � Sliding windows � Load shedding � Summarizing the data � Synopses (trade-off between accuracy and storage) 9 March 27, 2008 Data Mining: Concepts and Techniques

Random Sampling: Uniform Sampling � Uniform sampling � Data stream of size N � Assume all samples are equally likely � Example � a data stream of size 4 (also called population ) � possible samples of size 2 Slides: R. Gemulla, W. Lehner, P. J. Haas

Random Sampling: Reservoir Sampling � Reservoir sampling Single-scan algorithm � Compute a uniform sample of M elements without N � � Idea Maintain a reservoir, which form a random sample of � the elements seen so far in the stream � Algorithm add the first M elements � Afterwards at item i , flip a coin � a) ignore the element ( reject ) b) replace a random element in the sample ( accept ) sample size M = = P ( t is accepted ) i current population size i Slides: R. Gemulla, W. Lehner, P. J. Haas

Random Sampling: Reservoir Sampling (Example) � Example � data stream � sample size M = 2 1/3 1/3 1/3 2/4 1/4 1/4 2/4 1/4 1/4 2/4 1/4 1/4

Sliding Windows 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 � Sliding Windows � Make decisions based only on recent data of sliding window size w � An element arriving at time t expires at time t + w � Why? � Approximation technique for bounded memory � Natural in applications (emphasizes recent data) � Well-specified and deterministic semantics 13 PODS 2002

Load Shedding � Load shedding � Discards some data so the system can flow � Techniques � Filters (semantic drop) � Chooses what to shed based on QoS, selectivity � Drops (random drop) � Eliminates a random fraction of input � Hospital example � Load shedding based on condition Patients Doctors who can work on a patient Join Doctors Patients Condition Doctors who can work on a patient Filter Join Doctors

Synopsis � Synopsis 1 1 � Summaries for data 0 � Can be used to return approximate answers 0 1 � Trade off between space and accuracy � Techniques 0 1 � Histograms 1 1 � Wavelets 0 1 � Sketching � May require multiple passes Synopses/Data Structures March 27, 2008 15

Mining Data Streams � Stream data and stream data processing � Foundations for stream data mining � Stream frequent pattern analysis � Stream classification � Stream cluster analysis � Research issues 16 March 27, 2008 Data Mining: Concepts and Techniques

Frequent Pattern Mining for Data Streams � Issues � Multiple scans for training not feasible � Memory/space management � Concept drift � Methods � Approximate frequent patterns (Manku & Motwani VLDB’02) � Mining evolution of freq. patterns (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003) � Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05) 17 March 27, 2008 Data Mining: Concepts and Techniques

Mining Approximate Frequent Patterns Lossy Counting Algorithm (Manku & Motwani, VLDB’02) � Motivation � � Mining precise freq. patterns in stream data: unrealistic � Approximate answers are often sufficient (e.g., trend/pattern analysis) � Example: a router interested in all flows whose frequency is at least 1% ( σ ) of the entire traffic stream seen so far; � 1/10 of σ ( ε = 0.1%) error is comfortable Major ideas: approximation by tracing only “frequent” items � � Adv: guaranteed error bound � Disadv: keep a large set of traces 18 March 27, 2008 Data Mining: Concepts and Techniques

Lossy Counting for Frequent I tems Bucket 1 Bucket 2 Bucket 3 Input variables � ϭ : min_support, ε : error bound � Fixed variables � w=1/ ε : window size � Running variables � N: current stream length � b current = ε N: the current bucket � f e: the real frequency count of element e � Set of (e, f, ∆ ): (element, approximate frequency, max error) � 19 March 27, 2008 Data Mining: Concepts and Techniques

Lossy Counting for Frequent I tems Bucket 1 Bucket 2 Bucket 3 For each new element e � If an entry for e exists, then incrementing its frequency f by 1 � Otherwise, create a new entry (e, 1, bcurrent -1) � At bucket boundaries � Decrement frequency of all entries by 1 � Delete entries with f+ ∆ <= bcurrent � 20 March 27, 2008 Data Mining: Concepts and Techniques

I llustration b current =1 (e, f, ∆ ) Empty + (summary) b current (e, f, ∆ ) + 21 March 27, 2008 Data Mining: Concepts and Techniques

Approximation Guarantee � Output: items with frequency counts exceeding ( σ – ε ) N � Error analysis: how much do we undercount? If stream length seen so far = N and bucket-size = 1/ ε ≤ #buckets = ε N then frequency count error ≤ � Approximation guarantee � No false negatives � False positives have true frequency count at least ( σ – ε )N � Frequency count underestimated by at most ε N 22 March 27, 2008 Data Mining: Concepts and Techniques

Lossy Counting For Frequent I temsets Divide Stream into ‘Buckets’ as for itemsets Bucket 1 Bucket 2 Bucket 3 Set of (set, f, ∆ ): (itemset, approximate frequency, max error) � 23 March 27, 2008 Data Mining: Concepts and Techniques

Update of Summary Data Structure 2 4 3 2 4 3 1 + 2 10 9 1 2 1 2 1 0 Processing 3 buckets summary data summary data in memory 24 March 27, 2008 Data Mining: Concepts and Techniques

Summary of Lossy Counting � Strength � A simple idea � Can be extended to frequent itemsets � Weakness: � Space Bound is not good � For frequent itemsets, they do scan each record many times � The output is based on all previous data. But sometimes, we are only interested in recent data 25 March 27, 2008 Data Mining: Concepts and Techniques

Mining Evolution of Frequent Patterns for Stream Data Mining evolution and dramatic changes of frequent patterns � (Giannella, Han, Yan, Yu, 2003) � Use tilted time window frame � Use compressed form to store significant (approximate) frequent patterns and their time-dependent traces 26 March 27, 2008 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential Patterns Li Xiong Slides credits: Jiawei Han and Micheline Kamber and others 1 March 27, 2008 Data Mining: Concepts and Techniques Mining Stream,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Diversity & Inclusion in Physics Elizabeth H. Simmons University Distinguished Professor of

4Q & Full Year 2015 Earnings NASDAQ: TGEN Participants John Hatsopoulos Co-Chief

3Q 2016 Earnings NASDAQ: TGEN November 10, 2016 Participants John Hatsopoulos Co-Chief

The realm of stream reasoning G. Cugola E. Della Valle

2/22/2016 Wellness Program HUMAN RESOURCE UPDATES: Benefits, Compliance & Whats New in

Overview Background / Goals Channel Availability measurement System Data collection

Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/

Adaptive Query Processing Amol Deshpande, University of Maryland Zachary G. Ives, University of

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential Patterns Li Xiong Slides credits: Jiawei Han and Micheline Kamber and others 1 March 27, 2008 Data Mining: Concepts and Techniques Mining Stream,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Diversity &amp; Inclusion in Physics Elizabeth H. Simmons University Distinguished Professor of

4Q &amp; Full Year 2015 Earnings NASDAQ: TGEN Participants John Hatsopoulos Co-Chief

3Q 2016 Earnings NASDAQ: TGEN November 10, 2016 Participants John Hatsopoulos Co-Chief

The realm of stream reasoning G. Cugola E. Della Valle

2/22/2016 Wellness Program HUMAN RESOURCE UPDATES: Benefits, Compliance &amp; Whats New in

Overview Background / Goals Channel Availability measurement System Data collection

Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/

Adaptive Query Processing Amol Deshpande, University of Maryland Zachary G. Ives, University of

Diversity & Inclusion in Physics Elizabeth H. Simmons University Distinguished Professor of

4Q & Full Year 2015 Earnings NASDAQ: TGEN Participants John Hatsopoulos Co-Chief

2/22/2016 Wellness Program HUMAN RESOURCE UPDATES: Benefits, Compliance & Whats New in