Anomal y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor
Anomalies and o u tliers S u per v ised Uns u per v ised DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Anomalies and o u tliers One of the t w o classes is v er y rare E x treme case of dataset shi � E x amples : c y bersec u rit y fra u d detection anti - mone y la u ndering fa u lt detection DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Uns u per v ised w orkflo w s Caref u l u se of a handf u l of labels : too fe w for training w itho u t o v er � � ing j u st eno u gh for model selection drop u nbiased estimate of acc u rac y Ho w to � t an algorithm w itho u t labels ? Ho w to estimate its performance ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
O u tlier : a datapoint that lies o u tside the Local o u tlier : a datapoint that lies in an range of the majorit y of the data isolated region w itho u t other data DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local o u tlier factor ( LoF ) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local o u tlier factor ( LoF ) from sklearn.neighbors import confusion_matrix( LocalOutlierFactor as lof y_pred, ground_truth) clf = lof() y_pred = clf.fit_predict(X) array([[ 5, 16], [ 0, 184]]) y_pred[:4] array([ 1, 1, 1, -1]) clf.negative_outlier_factor_[:4] array([-0.99, -1.02, -1.08 , -0.97]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Local o u tlier factor ( LoF ) clf = lof(contamination=0.02) y_pred = clf.fit_predict(X) confusion_matrix( y_pred, ground_truth) array([[ 5, 0], [ 0, 200]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Who needs labels an yw a y! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON
No v elt y detection D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor
One - class classification Training data w itho u t anomalies : F u t u re / test data w ith anomalies : DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
No v elt y LoF Workaro u nd No v elt y LoF preds = lof().fit_predict( clf = lof(novelty=True) np.concatenate([X_train, X_test])) clf.fit(X_train) preds = preds[X_train.shape[0]:] y_pred = clf.predict(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_pred = clf.predict(X_test) y_pred[:4] array([ 1, 1, 1, -1]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
One - class S u pport Vector Machine clf = OneClassSVM() clf.fit(X_train) y_scores = clf.score_samples(X_test) threshold = np.quantile(y_scores, 0.1) y_pred = y_scores <= threshold DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Isolation Forests clf = IsolationForest() clf.fit(X_train) y_scores = clf.score_samples(X_test) clf = LocalOutlierFactor(novelty=True) clf.fit(X_train) y_scores = clf.score_samples(X_test) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) roc_auc_score(y_test, clf_lof.score_samples(X_test) 0.9897 roc_auc_score(y_test, clf_isf.score_samples(X_test)) 0.9692 roc_auc_score(y_test, clf_svm.score_samples(X_test)) 0.9948 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
clf_lof = LocalOutlierFactor(novelty=True).fit(X_train) clf_isf = IsolationForest().fit(X_train) clf_svm = OneClassSVM().fit(X_train) accuracy_score(y_test, clf_lof.predict(X_test)) 0.9318 accuracy_score(y_test, clf_isf.predict(X_test)) 0.9545 accuracy_score(y_test, clf_svm.predict(X_test)) 0.5 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
What ' s ne w? D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON
Distance - based learning D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor
Distance and similarit y from sklearn.neighbors import DistanceMetric as dm dist = dm.get_metric('euclidean') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 2.82842712, 5. ], [2.82842712, 0. , 3.60555128], [5. , 3.60555128, 0. ]]) X = np.matrix(X) np.sqrt(np.sum(np.square(X[0,:] - X[1,:]))) 2.82842712 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Non - E u clidean Local O u tlier Factor clf = LocalOutlierFactor( novelty=True, metric='chebyshev') clf.fit(X_train) y_pred = clf.predict(X_test) dist = dm.get_metric('chebyshev') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0., 2., 5.], [2., 0., 3.], [5., 3., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Are all metrics similar ? Hamming distance matri x: dist = dm.get_metric('hamming') X = [[0,1], [2,3], [0,6]] dist.pairwise(X) array([[0. , 1. , 0.5], [1. , 0. , 1. ], [0.5, 1. , 0. ]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Are all metrics similar ? from scipy.spatial.distance import \ from scipy.spatial.distance import pdist squareform X = [[0,1], [2,3], [0,6]] squareform(pdist(X, 'cityblock')) pdist(X, 'cityblock') array([[0., 4., 5.], array([4., 5., 5.]) [4., 0., 5.], [5., 5., 0.]]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A real -w orld e x ample The Hepatitis dataset : Class AGE SEX STEROID ... 0 2.0 40.0 0.0 0.0 ... 1 2.0 30.0 0.0 0.0 ... 2 1.0 47.0 0.0 1.0 ... 1 h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Hepatitis DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A real -w orld e x ample E u clidean distance : Hamming distance : squareform(pdist(X_hep, 'euclidean')) squareform(pdist(X_hep, 'hamming')) [[ 0. 127. 64.1] [[0. 0.5 0.7] [127. 0. 128.2] [0.5 0. 0.6] [ 64.1 128.2 0. ]] [0.7 0.6 0. ]] 1 nearest to 3: w rong class 1 nearest to 2: right class DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
A bigger toolbo x D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON
Unstr u ct u red data D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor
Str u ct u red v ers u s u nstr u ct u red Class AGE SEX STEROID ... 0 2.0 50.0 2.0 1.0 ... 1 2.0 40.0 1.0 1.0 ... ... label sequence 0 VIRUS AVTVVPDPTCCGTLSFKVPKDAKKGKHLGTFDIRQAIMDYGGLHSQ... 1 IMMUNE SYSTEM QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLE... 2 IMMUNE SYSTEM QAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYANWVQEKPDHLF... 3 VIRUS MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWIAT... ... Can w e b u ild a detector that � ags v ir u ses as anomalo u s in this data ? DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
import stringdist stringdist.levenshtein('abc', 'acc') 1 stringdist.levenshtein('acc', 'cce') 2 label sequence 169 IMMUNE SYSTEM ILSALVGIV 170 IMMUNE SYSTEM ILSALVGIL stringdist.levenshtein('ILSALVGIV', 'ILSALVGIL') 1 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some deb u gging # This won't work pdist(proteins['sequence'].iloc[:3], metric=stringdist.levenshtein) Traceback (most recent call last): ValueError: A 2-dimensional array must be passed. DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some deb u gging sequences = np.array(proteins['sequence'].iloc[:3]).reshape(-1,1) # This won't work for a different reason pdist(sequences, metric=stringdist.levenshtein) Traceback (most recent call last): TypeError: argument 1 must be str, not numpy.ndarray DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Some deb u gging # This one works!! def my_levenshtein(x, y): return stringdist.levenshtein(x[0], y[0]) pdist(sequences, metric=my_levenshtein) array([136., 2., 136.]) DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Protein o u tliers w ith precomp u ted matrices # This takes 2 minutes for about 1000 examples M = pdist(sequences, my_levenshtein) LoF detector w ith a precomp u ted distance matri x: # This takes 3 seconds detector = lof(metric='precomputed', contamination=0.1) preds = detector.fit_predict(M) roc_auc_score(proteins['label'] == 'VIRUS', preds == -1) 0.64 DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Pick y o u r distance D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON
Concl u ding remarks D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON Dr . Chris Anagnostopo u los Honorar y Associate Professor
Concl u ding remarks Refresher of s u per v ised learning pipelines : feat u re engineering model � � ing model selection Risks of o v er � � ing Data f u sion Nois y labels and he u ristics Loss f u nctions costs of false positi v es v s costs of false negati v es DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Concl u ding remarks Uns u per v ised learning : anomal y detection no v elt y detection distance metrics u nstr u ct u red data Real -w orld u se cases : c y bersec u rit y healthcare retail banking DESIGNING MACHINE LEARNING WORKFLOWS IN PYTHON
Congrat u lations ! D E SIG N IN G MAC H IN E L E AR N IN G W OR K FL OW S IN P YTH ON
Recommend
More recommend