Combining extreme value theory and machine learning for Luca Steyn novelty detection
• Two topics: Extreme value theory Novelty detection INTRODUCTION • A new idea for multivariate extreme value theory and multivariate anomaly detection • Brings together research from Statistics and Computer Science
• Novelty detection is the process of identifying when new observations differ from what is expected as normal behaviour. • Classification problem, i.e. normal or anomalous (positive or negative). • Conventional classification algorithms fail to What is novelty detect novel observations. detection? ⇒ • Use a one-class classification approach threshold a distribution representing the normal state of the system. (Is this a bad thing?) • Assumption: Novel observations are scarce and differ to some extent from the observations in the normal class.
Many algorithms for novelty detection have been proposed. Broad approaches are: • A distance-based approach - Modified KNN algorithm Methods to • A domain-based approach perform novelty - One-class support vector machines detection • A reconstruction-based approach - Neural networks or PCA • A probabilistic approach - Density estimation and thresholding
∈ p • Let and denote the probability density X ( ) ( ) = d function (pdf) by . f x F x dx = ∫ ( ) ( ) A probabilistic • Choose a threshold t such that F t f x dx ( ) S = is large, i.e. . 0.9 F t ( ) ≥ approach : x f x t S ( ∗ < ) ∗ • Then, a new observation is novel if . f x t x
A probabilistic approach
• If a new observation is below the threshold, how much certainty do we have that this observation A probabilistic is anomalous? approach • Extreme value theory estimates a probability that an observation is anomalous.
{ } , • Let be a sequence of , , X X X 1 2 3 independent and identically distributed (iid) Extreme value { } 1 = n max random variables and let . If M X = theory: n i i { } { } > sequences of constants and exist 0 a b Fisher-Tippett n n ( ) ( ) ( ) − − → → ∞ 1 such that , then , theorem a M b G x n G x n n n is necessarily the Generalized Extreme Value (GEV) distribution.
• The GEV distribution is given by { } ( ) ( ) − 1 − + γ γ ≠ + γ > γ exp 1 , 0, 1 0 x x ( ) = G x Extreme value γ { } { } − − γ = ∈ exp exp , 0, x x theory: • Move from a non-parametric to a parametric Fisher-Tippett setting (in the limit). theorem • Three types of GEV distributions: Frechét- Pareto, Gumbel, (extremal) Weibull. ( ) ( ) = − − • Note: . min max X X
• The distribution is in the domain of attraction F of the GEV distribution if and only if for some Extreme value ( ) ⋅ + γ > auxiliary function and for all , 1 0 x b theory: ( ) ( ) − + 1 F y b y x ( ) − 1 → + γ → ∞ γ Pickands- 1 as x y ( ) − 1 F y Balkema-de Haan Furthermore, theorem ( ) ( ) + b y b y x γ → = + γ 1 u x ( ) b y
• Essentially, this theorem states that there exists a high enough threshold such that the t Extreme value = − exceedances are approximately Z X t theory: Pickands- generalised Pareto (GP) distributed. Hence, for a Balkema-de Haan large threshold , t theorem − 1 γ ( ) z > > ≈ + γ 1 P Z z X t ( ) b t
Example: Uniform distribution
Other problems with EVT • The problem is multivariate • The distribution under normal conditions is multimodal Hence, one needs a method that transforms the data to overcome these issues.
• Redefine extreme value theory in terms of minimum probability density. An approach { } { } ( ) ( ) ( ) = = ≡ based on • Let such that argmin min min E f X f E f X Y n i n i i = i i ; 1, , X i n ( ) i minimum µ Σ • Assume , X N probability density • It can be shown that { } ( ) ≤ ≈ − − − ≡ 1 1 exp Weibull type GEV P f E y a y n n 1 1 ( ) ( ) = − Furthermore, we can choose where is the a G n G y n d d ( ) = known distribution of . Y f X
∗ • Hence, the probability that a new observation x is novel is given by the probability that the ( ) ∗ = ∗ An approach density estimate at this observation y f x based on is less than the distribution of minimum minimum probability density, i.e. : probability density ( ) ( ) { } ( ) ∗ ∗ − ∗ = > ≈ − 1 is novel exp P x P f E y a y n n
An approach based on minimum probability density
Problem: Gaussian assumption is too strict. An approach based on minimum probability density
• Gaussian assumption leads to analytical expression of parameter estimates. An approach • Minimum of GMM density bounded at zero. based on • Hence, density of GMM is in domain of minimum probability density attraction of Weibull type GEV. • However, parameters must be estimated via maximum likelihood.
Weibull density of GMM minimum density: An approach based on minimum probability density
Weibull density of GMM minimum density : An approach based on minimum probability density
Weibull density of GMM minimum density : An approach based on minimum probability density
• Dataset: Wavelet transform of banknotes – variables are variance, skewness, kurtosis and entropy of Wavelet transformed image. Banknote authentication • There are 600 real banknotes in the training example data. • There are 162 real and 610 forged banknotes in the test set.
• Select number of components in GMM with BIC criterion. • Optimal was 5 Gaussian components. Banknote • Estimate distribution of minimum density of real authentication example banknotes using Weibull GEV of minimum density. • Use this distribution to determine probability of forged banknote on test set.
• Results: Response Predicted Real Forged Banknote Real 162 1 authentication Forged 0 609 example • Clearly, the model does very well in detecting fake banknotes. • However, very easy data.
• Open-set recognition: Perform classification under the assumption that not all classes are Supervised known at training. novelty detection and Open-set • Use extreme value theory to detect new classes. Recognition • Similar concepts used for supervised novelty detection.
• Problem: Testing set possibly contains classes not seen at training. • Use a supervised model to classify known A new approach classes. based on the GP distribution • Use extreme value theory to adjust predicted probabilities to account for other classes. • Estimate the probability that an observation is from a new class not seen at training.
( ) , = = Consider a model that produces . 1,2, , P Y k x k K For each class: 1. Find the correctly classified training data ( ) = = = ˆ | , 1, x x y k j n jk j k ( ) µ = = − µ 2. Let and compute mean x d x k jk jk jk k A new approach = − 3. Fit a GP distribution to the exceedances Z D t jk jk k above a threshold . based on the GP t k The probability that an observation is not novel with respect to x distribution class k is: ( ) > > = − = − µ | where . P Z z D t and Z D t D X k k k k k Notice a per-class estimation strategy is followed.
Update probabilities: We update each class probability with ( ) ∗ = = new P Y k X x ( ) { } { } ∗ = = ∩ > = P Y k Z z X x k k A new approach ( ) ( ) = = = ∗ ⋅ > = = ∗ , P Y k X x P Z z Y k X x based on the GP k k − 1 ( ) γ z ∗ ≈ = = ⋅ + γ k distribution 1 P Y k X x k σ k k The probability that an observation is from none of the classes is then ( ) ∑ ( ) = − = = ∗ new novel 1 | P Y P Y k X x k Classify as class with maximum probability.
Approach: • Images of handwritten digits downloaded from Kaggle. • Use 0 to 7 as known classes in training data. • Use 0 to 9 in testing data, i.e. 8 and 9 are new Handwritten digits classes. example • Fit CNN on training data and find correctly classified training data. • Extract activations in final hidden layer for each classes’ correctly classified training data. • Use these features to estimate probability that an observation is from a new class.
Training data: Class 0 1 2 3 4 5 6 7 Observations 3285 3728 3382 3496 3243 3054 3312 3501 Handwritten digits example
Recommend
More recommend