detecting outliers under
play

Detecting Outliers under Detecting Outliers . . . What We Plan To - PowerPoint PPT Presentation

Outlier Detection Is . . . Outlier Detection . . . Which Approach Is . . . Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty: Algorithm Number of . . . A New Algorithm Based on Justification of the .


  1. Outlier Detection Is . . . Outlier Detection . . . Which Approach Is . . . Detecting Outliers under Detecting Outliers . . . What We Plan To Do Interval Uncertainty: Algorithm Number of . . . A New Algorithm Based on Justification of the . . . Acknowledgments Constraint Satisfaction Title Page Evgeny Dantsin and Alexander Wolpert ◭◭ ◮◮ Department of Computer Science, Roosevelt University ◭ ◮ Chicago, IL 60605, USA, { edantsin,awolpert } @roosevelt.edu Page 1 of 10 Martine Ceberio, Gang Xiang, and Vladik Kreinovich Department of Computer Science, University of Texas at El Paso Go Back El Paso, TX 79968, USA, { mceberio,vladik } @cs.utep.edu Full Screen Close Quit

  2. 1. Outlier Detection Is Important Outlier Detection Is . . . Outlier Detection . . . • In many application areas, it is important to detect outliers , i.e., Which Approach Is . . . unusual, abnormal values. Detecting Outliers . . . • In medicine: outliers may mean disease. What We Plan To Do Algorithm • In geophysics: outlier may mean a mineral deposit. Number of . . . • In structural integrity testing: outlier may mean a structural fault. Justification of the . . . Acknowledgments • Traditional engineering approach to outlier detection: – collect measurement results x 1 , . . . , x n corresponding to nor- Title Page mal situations; ◭◭ ◮◮ n � √ = 1 def def = M − E 2 – compute E n · x i and σ = V , where V ◭ ◮ i =1 � n = 1 Page 2 of 10 def x 2 and M n · i ; i =1 Go Back – a value x is classified as an outlier if it is outside the interval def def Full Screen [ L, U ], where L = E − k 0 · σ , U = E + k 0 · σ , and k 0 > 1 is pre-selected (most frequently, k 0 = 2, 3, or 6). Close Quit

  3. 2. Outlier Detection Under Interval Uncertainty Outlier Detection Is . . . Outlier Detection . . . • In practice: often, we only have intervals x i = [ x i , x i ] of possible Which Approach Is . . . values of x i . Detecting Outliers . . . • Example: the value � x i measured by an instrument with a known What We Plan To Do upper bound ∆ i on the measurement error means that Algorithm Number of . . . x i ∈ [ � x i − ∆ i , � x i + ∆ i ] . Justification of the . . . Acknowledgments • Problem: for different values x i ∈ x i , we get different L and U . • Objective: given x i and k 0 , compute Title Page def ◭◭ ◮◮ L = [ L, L ] = { L ( x 1 , . . . , x n ) : x 1 ∈ x 1 , . . . , x n ∈ x n } ; ◭ ◮ def U = [ U, U ] = { U ( x 1 , . . . , x n ) : x 1 ∈ x 1 , . . . , x n ∈ x n } . Page 3 of 10 • A value x is a possible outlier if it is outside one of the possible k 0 -sigma intervals [ L, U ], i.e., if x �∈ [ L, U ]. Go Back • A value x is a guaranteed outlier if it is outside all possible k 0 - Full Screen sigma intervals [ L, U ], i.e., if , i.e., if x �∈ [ L, U ]. Close Quit

  4. 3. Which Approach Is More Reasonable? Outlier Detection Is . . . Outlier Detection . . . • Situation: our main objective is not to miss an outlier. Which Approach Is . . . Detecting Outliers . . . – Example: structural integrity tests. What We Plan To Do – Clarification: we do not want to risk launching a spaceship Algorithm with a faulty part. Number of . . . – Reasonable approach: look for possible outliers. Justification of the . . . • Situation: make sure that the value x is an outlier. Acknowledgments – Example: planning a surgery. Title Page – Clarification: we want to make sure that there is a micro- ◭◭ ◮◮ calcification before we start cutting the patient. – Reasonable approach: look for guaranteed outliers. ◭ ◮ Page 4 of 10 Go Back Full Screen Close Quit

  5. 4. Detecting Outliers Under Interval Uncertainty: What Outlier Detection Is . . . Is Known Outlier Detection . . . Which Approach Is . . . • Case of possible outliers: there exist efficient algorithms for com- Detecting Outliers . . . puting L and U . What We Plan To Do Algorithm • Case of guaranteed outliers: the computation of L and U is, in Number of . . . general, NP-hard. Justification of the . . . • Technical result: if 1 + (1 /k 0 ) 2 < n (e.g., if k 0 > 1 and n ≥ 2), Acknowledgments then the maximum U of U (and the minimum L of L ) is always attained at a combination of endpoints of x i . Title Page • Resulting algorithm: compute U and L by trying all 2 n combina- ◭◭ ◮◮ tions of x i and x i . ◭ ◮ def • Specific case: when all measured values � x i = ( x i + x i ) / 2 are defi- nitely different from each other, in the sense that the “narrowed” Page 5 of 10 intervals do not intersect � � Go Back x i − 1 + α 2 x i + 1 + α 2 � · ∆ i , � · ∆ i , n n Full Screen def where α = 1 /k 0 and ∆ i = ( x i − x i ) / 2 is the interval’s half-width. Close • Good news: in this case, we can compute U and L in feasible time. Quit

  6. 5. What We Plan To Do Outlier Detection Is . . . Outlier Detection . . . • More general case: no two narrowed intervals are proper subsets Which Approach Is . . . of one another. Detecting Outliers . . . • In precise terms: one of them is not a subset of the interior of the What We Plan To Do other. Algorithm Number of . . . • Objective: extend known efficient algorithms to this case. Justification of the . . . • Since L ( x i ) = − U ( − x i ), it suffices to be able to compute U . Acknowledgments • Main idea: reduce the interval computation problem to the con- Title Page straint satisfaction problem with the following constraints: ◭◭ ◮◮ – for every i , if in the maximizing assignment we have x i = x i , then replacing this value with x i = x i will either decrease U ◭ ◮ or leave U unchanged; Page 6 of 10 – for every i , if in the maximizing assignment we have x i = x i , then replacing this value with x i = x i will either decrease U Go Back or leave U unchanged; Full Screen – for every i and j , replacing both x i and x j with the oppo- site ends of the corresponding intervals x i and x j will either Close decrease U or leave U unchanged. Quit

  7. 6. Algorithm Outlier Detection Is . . . Outlier Detection . . . • General idea: Which Approach Is . . . – First, we sort of the values � x i into an increasing sequence. Detecting Outliers . . . What We Plan To Do – Without losing generality, we can assume that Algorithm � x 1 ≤ � x 2 ≤ . . . ≤ � x n . Number of . . . – Then, for every k from 0 to n , we compute the value V ( k ) = Justification of the . . . M ( k ) − ( E ( k ) ) 2 of the population variance V for the vec- Acknowledgments tor x ( k ) = ( x 1 , . . . , x k , x k +1 , . . . , x n ), and we compute U ( k ) = √ E ( k ) + k 0 · V ( k ) . Title Page – Finally, we compute U as the largest of n +1 values U (0) , . . . , U ( n ) . ◭◭ ◮◮ • Details: how to compute the values V ( k ) ◭ ◮ – First, we explicitly compute M (0) , E (0) , and Page 7 of 10 V (0) = M (0) − ( E (0) ) 2 . – Once we know the values M ( k ) and E ( k ) , we can compute Go Back M ( k +1) = M ( k ) + 1 n · ( x k +1 ) 2 − 1 Full Screen n · ( x k +1 ) 2 Close and E ( k +1) = E ( k ) + 1 n · x k +1 − 1 n · x k +1 . Quit

  8. 7. Number of Computation Steps Outlier Detection Is . . . Outlier Detection . . . • Sorting: requires O ( n · log( n )) steps. Which Approach Is . . . • Computing the initial values M (0) , E (0) , and V (0) requires linear Detecting Outliers . . . time O ( n ). What We Plan To Do Algorithm • For each k from 0 to n − 1, we need a constant number of steps Number of . . . to compute the next values M ( k +1) , E ( k +1) , and V ( k +1) as Justification of the . . . M ( k +1) = M ( k ) + 1 n · ( x k +1 ) 2 − 1 Acknowledgments n · ( x k +1 ) 2 and E ( k +1) = E ( k ) + 1 n · x k +1 − 1 Title Page n · x k +1 . ◭◭ ◮◮ √ • Computing U ( k ) = E ( k ) + k 0 · V ( k ) also requires a constant number ◭ ◮ of steps. • Finally, finding the largest of n +1 values U ( k ) requires O ( n ) steps. Page 8 of 10 Go Back • Overall: we need Full Screen O ( n · log( n )) + O ( n ) + O ( n ) + O ( n ) = O ( n · log( n )) steps . Close • Comment: if the measurement results � x i are already sorted, then we only need linear time to compute U . Quit

  9. 8. Justification of the Algorithm Outlier Detection Is . . . Outlier Detection . . . • Known: U = max U is attained at a vector x = ( x 1 , . . . , x n ) in Which Approach Is . . . which each value x i is equal either to x i or to x i . Detecting Outliers . . . • New result: this maximum is attained at one of the vectors x ( k ) What We Plan To Do in which all the lower bounds x i precede all the upper bounds x i . Algorithm Number of . . . • How we prove it: by reduction to a contradiction. Justification of the . . . • Assume: the maximum is attained at a vector x in which one of Acknowledgments the lower bounds follows one of the upper bounds. • Notation: let i be the largest upper bound index followed by the Title Page lower bound. ◭◭ ◮◮ • Conclusion: in x opt , we have x i = x i and x i +1 = x i +1 . ◭ ◮ • Following proof: since maximum is attained at x , each replacing: Page 9 of 10 – replacing x i with x i ; – replacing x i +1 with x i +1 ; and Go Back – replacing both Full Screen leads to ∆ U ≤ 0; we trace these changes ∆ U . Close • We then conclude that one of the narrowed intervals is a proper subset of another – contradiction to our assumption. Quit

Recommend


More recommend