mining approximate top k subspace anomalies in multi
play

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional - PowerPoint PPT Presentation

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Time Series Data Many applications produce time series data Intel stock 2


  1. Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1

  2. Time Series Data • Many applications produce time series data Intel stock 2

  3. Time Series Data • Many applications produce time series data 2

  4. Time Series Data • Many applications produce time series data 2

  5. Apple, Intel, NASDAQ Computers Stock Values 3

  6. Apple, Intel, NASDAQ Computers Stock Values 3

  7. Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Apple stock has a very different “trend” Intel stock had different magnitude 4

  8. Apple, Intel, NASDAQ Computers Stock Values Apple stock has a Compare time series to gather differences very different “trend” 4

  9. Apple, Intel, NASDAQ Computers Stock Values Compare time series to gather differences Intel stock had different magnitude 4

  10. Problem Statement Find anomalies in a data cube of multi-dimensional time series data 5

  11. Table of Contents 1. Time Series Examples 2. Problem Statement ☚ 3. Related Work 4. Observed/Expected Time Series and Anomaly Measure 5. Subspace Iterative Search i. Generating candidate subspaces ii. Discovering top-k anomaly cells 6. Experiments 7. Conclusion 6

  12. Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” 7

  13. Multi-Dimensional Attributes • Time series are not flat data; contains multi-dimensional attributes • Stock example ‣ Apple and Intel are a part of the NASDAQ Computers Index ‣ Apple is hardware/software; Intel is hardware ‣ Related to NASDAQ-100 Technology Stock Index • Sales example ‣ Multi-dimensional information collected for every sale (e.g., buyer age, product type, store location, purchase time) ‣ Compare sales by any combination of categories or sub-categories : “sales of sporting apparel to males with 3+ children have been declining compared to overall male sporting apparel sales” subset 7

  14. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  15. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  16. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  17. Problem Statement • Find anomalies in the data cube of multi-dimensional time series data • Input data: relation R with a set of time series S associated with each tuple ‣ Attributes of R form a data cube C R ‣ Each s i is a time series ‣ Each u i is a scalar indicating the count of the tuple Gender Education Income Product Profit Count Female Highschool 35k-45k Food s 1 u 1 Female Highschool 45k-60k Apparel s 2 u 2 Female College 35k-45k Apparel s 3 u 3 Female College 35k-45k Book s 4 u 4 Female College 45k-60k Apparel s 5 u 5 Female Graduate 45k-60k Apparel s 6 u 6 Male Highschool 35k-45k Apparel s 7 u 7 Male College 35k-45k Food s 8 u 8 8

  18. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  19. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  20. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  21. Data Cube Preliminaries • Given a relation R , a data cube (denoted as C R ) is the set of aggregates from all possible group-by’s ABC on R child • In a n -dimensional data cube, each cell has the AC form c = (a 1 , a 2 , ..., a n : m) where each a i is the AB BC value of i th attribute and m is the cube measure (e.g., profit) • A cell is k-dimensional if there are exactly k ( ≤ n) A B C values amongst a i which are not ∗ (i.e., all) parent ‣ 2-dimensional cell: (Female, ∗ , ∗ , Book: x) ‣ 3-dimensional cell: ( ∗ , College, 35k-45k, Apparel: All y) ‣ Base cell: none of a i is ∗ • Parent, descendant, sibling relationships 9

  22. Query Model • Given R , a probe cell p ∈ C R , and an anomaly function g , find the anomaly cells among C R descendants of p in C R as measured by g p ‣ Each abnormal cell must satisfy a minimum support (count) threshold ‣ Anomaly does not have to hold for entire time series ‣ Only the top k anomalies as ranked by g are needed base 10

  23. Query Model • Given R , a probe cell p ∈ C R , and an anomaly function g , find the anomaly cells among C R descendants of p in C R as measured by g p ‣ Each abnormal cell must satisfy a minimum support (count) threshold ‣ Anomaly does not have to hold for entire time series ‣ Only the top k anomalies as ranked by g are needed base 10

  24. Related Work • Exploratory Data Analysis ‣ [Sarawagi SIGMOD’00] explores OLAP anomaly but necessitates full cube materialization ‣ [Palpanas SSDBM’01] approximately finds interesting cells in data cube but still requires exponential calculations ‣ [Imielinski DMKD’02] requires anti-monotonic measure and does not focus on time series • Time Series Data Cube [Chen VLDB’02] ‣ Only suitable for low-dimensional data ‣ Requires user guidance • General outlier detection, subspace clustering, and time series similarity search does not address OLAP-style data 11

  25. Measuring Anomaly: Intuition 12

  26. Measuring Anomaly: Intuition 1.For every cell, compute the expected time series (with respect to the probe cell) 12

Recommend


More recommend