Combining extreme value theory and machine learning for Luca Steyn - PowerPoint PPT Presentation

Combining extreme value theory and machine learning for Luca Steyn novelty detection

• Two topics:  Extreme value theory  Novelty detection INTRODUCTION • A new idea for multivariate extreme value theory and multivariate anomaly detection • Brings together research from Statistics and Computer Science

• Novelty detection is the process of identifying when new observations differ from what is expected as normal behaviour. • Classification problem, i.e. normal or anomalous (positive or negative). • Conventional classification algorithms fail to What is novelty detect novel observations. detection? ⇒ • Use a one-class classification approach threshold a distribution representing the normal state of the system. (Is this a bad thing?) • Assumption: Novel observations are scarce and differ to some extent from the observations in the normal class.

Many algorithms for novelty detection have been proposed. Broad approaches are: • A distance-based approach - Modified KNN algorithm Methods to • A domain-based approach perform novelty - One-class support vector machines detection • A reconstruction-based approach - Neural networks or PCA • A probabilistic approach - Density estimation and thresholding

∈  p • Let and denote the probability density X ( ) ( ) = d function (pdf) by . f x F x dx = ∫ ( ) ( ) A probabilistic • Choose a threshold t such that F t f x dx ( ) S = is large, i.e. . 0.9 F t ( ) ≥ approach : x f x t S ( ∗ < ) ∗ • Then, a new observation is novel if . f x t x

A probabilistic approach

• If a new observation is below the threshold, how much certainty do we have that this observation A probabilistic is anomalous? approach • Extreme value theory estimates a probability that an observation is anomalous.

{ } ,  • Let be a sequence of , , X X X 1 2 3 independent and identically distributed (iid) Extreme value { } 1 = n max random variables and let . If M X = theory: n i i { } { } > sequences of constants and exist 0 a b Fisher-Tippett n n ( ) ( ) ( ) − − → → ∞ 1 such that , then , theorem a M b G x n G x n n n is necessarily the Generalized Extreme Value (GEV) distribution.

• The GEV distribution is given by { }  ( ) ( ) − 1 − + γ γ ≠ + γ > γ exp 1 , 0, 1 0  x x ( ) =  G x Extreme value γ { } { }  − − γ = ∈  exp exp , 0, x x  theory: • Move from a non-parametric to a parametric Fisher-Tippett setting (in the limit). theorem • Three types of GEV distributions: Frechét- Pareto, Gumbel, (extremal) Weibull. ( ) ( ) = − − • Note: . min max X X

• The distribution is in the domain of attraction F of the GEV distribution if and only if for some Extreme value ( ) ⋅ + γ > auxiliary function and for all , 1 0 x b theory: ( ) ( ) − + 1 F y b y x ( ) − 1 → + γ → ∞ γ Pickands- 1 as x y ( ) − 1 F y Balkema-de Haan Furthermore, theorem ( ) ( ) + b y b y x γ → = + γ 1 u x ( ) b y

• Essentially, this theorem states that there exists a high enough threshold such that the t Extreme value = − exceedances are approximately Z X t theory: Pickands- generalised Pareto (GP) distributed. Hence, for a Balkema-de Haan large threshold , t theorem − 1   γ ( ) z > > ≈ + γ   1 P Z z X t ( )  b t 

Example: Uniform distribution

Other problems with EVT • The problem is multivariate • The distribution under normal conditions is multimodal Hence, one needs a method that transforms the data to overcome these issues.

• Redefine extreme value theory in terms of minimum probability density. An approach { } { } ( ) ( ) ( ) = = ≡ based on • Let such that argmin min min E f X f E f X Y n i n i i = i i  ; 1, , X i n ( ) i minimum µ Σ  • Assume , X N probability density • It can be shown that { } ( )    ≤  ≈ − − − ≡ 1 1 exp Weibull type GEV P f E y a y     n n 1 1 ( ) ( ) = − Furthermore, we can choose where is the a G n G y n d d ( ) = known distribution of . Y f X

∗ • Hence, the probability that a new observation x is novel is given by the probability that the ( ) ∗ = ∗ An approach density estimate at this observation y f x based on is less than the distribution of minimum minimum probability density, i.e. : probability density ( ) ( ) { } ( ) ∗ ∗ − ∗ = > ≈ − 1 is novel exp P x P f E y a y n n

An approach based on minimum probability density

Problem: Gaussian assumption is too strict. An approach based on minimum probability density

• Gaussian assumption leads to analytical expression of parameter estimates. An approach • Minimum of GMM density bounded at zero. based on • Hence, density of GMM is in domain of minimum probability density attraction of Weibull type GEV. • However, parameters must be estimated via maximum likelihood.

Weibull density of GMM minimum density: An approach based on minimum probability density

Weibull density of GMM minimum density : An approach based on minimum probability density

• Dataset: Wavelet transform of banknotes – variables are variance, skewness, kurtosis and entropy of Wavelet transformed image. Banknote authentication • There are 600 real banknotes in the training example data. • There are 162 real and 610 forged banknotes in the test set.

• Select number of components in GMM with BIC criterion. • Optimal was 5 Gaussian components. Banknote • Estimate distribution of minimum density of real authentication example banknotes using Weibull GEV of minimum density. • Use this distribution to determine probability of forged banknote on test set.

• Results: Response Predicted Real Forged Banknote Real 162 1 authentication Forged 0 609 example • Clearly, the model does very well in detecting fake banknotes. • However, very easy data.

• Open-set recognition: Perform classification under the assumption that not all classes are Supervised known at training. novelty detection and Open-set • Use extreme value theory to detect new classes. Recognition • Similar concepts used for supervised novelty detection.

• Problem: Testing set possibly contains classes not seen at training. • Use a supervised model to classify known A new approach classes. based on the GP distribution • Use extreme value theory to adjust predicted probabilities to account for other classes. • Estimate the probability that an observation is from a new class not seen at training.

( ) , = =  Consider a model that produces . 1,2, , P Y k x k K For each class: 1. Find the correctly classified training data ( ) = = =  ˆ | , 1, x x y k j n jk j k ( ) µ = = − µ 2. Let and compute mean x d x k jk jk jk k A new approach = − 3. Fit a GP distribution to the exceedances Z D t jk jk k above a threshold . based on the GP t k The probability that an observation is not novel with respect to x distribution class k is: ( ) > > = − = − µ | where . P Z z D t and Z D t D X k k k k k Notice a per-class estimation strategy is followed.

Update probabilities: We update each class probability with ( ) ∗ = = new P Y k X x ( ) { } { } ∗ = = ∩ > = P Y k Z z X x k k A new approach ( ) ( ) = = = ∗ ⋅ > = = ∗ , P Y k X x P Z z Y k X x based on the GP k k − 1 ( )   γ z ∗ ≈ = = ⋅ + γ k   distribution 1 P Y k X x k σ   k k The probability that an observation is from none of the classes is then ( ) ∑ ( ) = − = = ∗ new novel 1 | P Y P Y k X x k Classify as class with maximum probability.

Approach: • Images of handwritten digits downloaded from Kaggle. • Use 0 to 7 as known classes in training data. • Use 0 to 9 in testing data, i.e. 8 and 9 are new Handwritten digits classes. example • Fit CNN on training data and find correctly classified training data. • Extract activations in final hidden layer for each classes’ correctly classified training data. • Use these features to estimate probability that an observation is from a new class.

Training data: Class 0 1 2 3 4 5 6 7 Observations 3285 3728 3382 3496 3243 3054 3312 3501 Handwritten digits example

Combining extreme value theory and machine learning for Luca Steyn - PowerPoint PPT Presentation

Combining extreme value theory and machine learning for Luca Steyn novelty detection Two topics: Extreme value theory Novelty detection INTRODUCTION A new idea for multivariate extreme value theory and multivariate anomaly

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Multivariate Extreme Value models Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Multivariate Extreme Value models Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Extreme Value Analysis Amir AghaKouchak Email: amir.a@uci.edu Web:

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CSCI 104 Inheritance Mark Redekopp David Kempe Sandra Batista 2 Recall: Constructor

Your Friend Todd Klindt 10TH ANNUAL CONFERENCE ABOUT MODERN IT TECHNOLOGIES Todd Klindt 14

Publishing date: 20/ 01/ 2015 Document title: 4c - Additional summary slides from a member We

Upper-bounding Program Execution Time with Extreme Value Theory Francisco J. Cazorla, Eduardo

Extreme values: a renormalization group approach Eric Bertin Laboratoire de Physique, ENS Lyon

Multivariate Extreme Value models Michel Bierlaire Transport and Mobility Laboratory School of

Daizong Ding 1 Mi Zhang 1 Xudong Pan 1 Min

Extreme values for diffusion in random media Ivan Corwin Columbia University From pollen to