www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2
www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller, Know-Center 2 KDDM2
www.tugraz.at What are Outliers ? A recap from KDDM1 Maximilian Toller, Know-Center 3 KDDM2
www.tugraz.at What are Outliers ? Definitions An observation that appears to deviate markedly from other members of the sample in which it occurs . (Grubbs, 1969) An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. (Barnett and Lewis, 1974) An observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism . (Hawkins, 1980) Maximilian Toller, Know-Center 4 KDDM2
www.tugraz.at What are Outliers ? Examples (easy) 8 6 Inliers 4 2 Outliers Y 0 (Grubb, Barnett) −2 −4 Outliers −6 (Grubb, Barnett, −8 −6 −4 −2 0 2 4 6 X Hawkins) Maximilian Toller, Know-Center 5 KDDM2
www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 0.8 0.6 y 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 x Maximilian Toller, Know-Center 6 KDDM2
www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 110 0.8 100 0.6 90 y y 0.4 80 0.2 70 0.0 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 x x Maximilian Toller, Know-Center 6 KDDM2
www.tugraz.at What are Outliers ? Examples (more difficult) 1000 800 y 600 400 200 −50 0 50 100 150 x Maximilian Toller, Know-Center 7 KDDM2
www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 1000 0.8 800 0.6 y 600 y 0.4 400 0.2 200 0.0 −50 0 50 100 150 0.3 0.4 0.5 0.6 x x Maximilian Toller, Know-Center 7 KDDM2
www.tugraz.at What are Outliers ? Methods: Preview There are many outlier detection methods: Local outlier factor Angle-based outlier degree Artificial neural networks . . . Why are there so many? Maximilian Toller, Know-Center 8 KDDM2
www.tugraz.at What are Anomalies ? Maximilian Toller, Know-Center 9 KDDM2
www.tugraz.at What are Anomalies ? Difference from Outliers In literature, outlier and anomaly are used interchangeably For both, only vague definitions exist that are very similar However, the terms have different origins and different typical use: Outliers typically. . . Anomalies typically. . . . . . are motivated by statistics. . . . require context. . . . are unusual data. . . . are abnormal events. . . . are investigated by traditional . . . are investigated by data analysts researches and statisticians. and data scientists. Maximilian Toller, Know-Center 10 KDDM2
www.tugraz.at What are Anomalies ? Example: Credit card fraud Billions of dollars lost every year Fraudulent transactions often significantly different Difficult to disguise fraud s.t. it is not visible on any scale Maximilian Toller, Know-Center 11 KDDM2
www.tugraz.at What are Anomalies ? Example: Cancer One of the most common causes of human death Disease with abnormal cell growth Cancer has abnormal gene expression signature Maximilian Toller, Know-Center (Quinn et al., 2019) 12 KDDM2
www.tugraz.at What are Anomalies ? The role of context Abnormality is context-dependent Discordant data problem (credit card fraud example) Many normal observations Rare outlying data Anomaly class problem (cancer example) Normal data class Anomaly classes Can data define abnormality? Maximilian Toller, Know-Center 13 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data How to interpret suspicious data Maximilian Toller, Know-Center 14 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum accuses Mrs Hadlum of adultery Sole evidence: Birth of child 349 days after Mr Hadlum left the country Average human gestation period: 280 days Maximilian Toller, Know-Center 15 KDDM2 (Barnett and Lewis, 1974)
www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum conjectured different distribution (red) Judges did not find Mrs Hadlum guilty, since 349 days unlikely, but not impossible (blue) (Modern research showed that more than 340 days is impossible) Maximilian Toller, Know-Center (Zimek and Filzmoser, 2018) 16 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data The Antarctic Ozone Hole Ozone layer protects Earth from solar radiation Damaged by human emissions of chlorofuorocarbons High depletion (hole) above poles https://de.wikipedia.org/wiki/Datei:Ozone_layer.jpg Maximilian Toller, Know-Center 17 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data The (Ant)Arctic Ozone Hole Farman et al. (1985) discover hole in field study Authors hesitant to publish Nimbus satellite data showed no drop Problem: Largely deviating values discarded as NASA/JPL-Caltech measurement errors Maximilian Toller, Know-Center 18 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data Definition Unlikely data Discordant data Contamination Position of judges Position of Mr "Wrong day of Hadlum birth?” "Random drop of ozone not caused Ozone field study by Satellite by humans" Farman et al. (1985) measurement error Data unlikely but still Data too unlikely to Data incorrect or normal be normal misleading No correction Correction of model Correction of data Action: none Action: investigate Action: remove Maximilian Toller, Know-Center 19 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data Implications It is hard to classify data as unlikely , discordant or contaminated No universal decision criterion Domain knowledge as remedy Ultimately subjective Maximilian Toller, Know-Center 20 KDDM2
www.tugraz.at Unlikely, Discordant and Contaminated Data Strategies 1. Try to ignore anomalies (Not interesting) 2. Find anomalies for investigation or removal (Interesting) Maximilian Toller, Know-Center 21 KDDM2
www.tugraz.at Robust Statistics Data Analysis in Presence of Anomalies Maximilian Toller, Know-Center 22 KDDM2
www.tugraz.at Robust Statistics Introduction I Setting Potentially contaminated dataset Majority uncontaminated Cannot find or remove contamination, e.g. inserted by attacker Task: Analyze data in spite of contamination, understand what is normal Maximilian Toller, Know-Center 23 KDDM2
www.tugraz.at Robust Statistics Introduction II Challenges No prior information about data Contamination may be arbitrarily “bad” (adversarial) Question: Which methods are suitable? Maximilian Toller, Know-Center 24 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance Two common estimators � n x = 1 Sample mean ¯ j = 1 x j n � n 1 σ 2 j = 1 ( x j − ¯ x ) 2 Sample variance ˆ x = n − 1 Mean and variance are influenced by contamination σ 2 Original x = [ 1 , 3 , 2 , 1 , 9 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] x ≈ 2 . 58 ¯ x ≈ 4 . 63 ˆ ¯ σ 2 Clean y = [ 1 , 3 , 2 , 1 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] y = 2 y = 0 . 6 ˆ Maximilian Toller, Know-Center 25 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Maximilian Toller, Know-Center 26 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Maximilian Toller, Know-Center 26 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Maximilian Toller, Know-Center 26 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ Maximilian Toller, Know-Center 26 KDDM2
www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ → Mean and variance are not robust . Maximilian Toller, Know-Center 26 KDDM2
Recommend
More recommend