Modeling Heterogeneous Statistical Patterns in High- dimensional - PowerPoint PPT Presentation

Modeling Heterogeneous Statistical Patterns in High- dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework (FIRD) Han Zhang 1 Wenhao Zheng 3 Charley Chen 1 Kevin Gao 1 Yao Hu 3 Ling Huang 2 Wei Xu 1 1 Tsinghua University 2 AHI Fintech 3 Youku Cognitive and Intelligent Lab, Alibaba Group

Fraud Hurts E-commerce Platform in Many Ways F F A A K K E E Waste over $1,000,000,000 a Year Fake Review Identity Theft E-commerce Platform Payment Fraud, Merchant Fraud, … Coupon Hunting 2

Fraud Patterns V.S. Normal Patterns [1, 2] • Fraudsters display synchronized behaviors. </> Resource IP: 987.654.32.1 Sharing Phone No.: 12345 Similar Control Script • In contrast, normal users are usually randomly distributed. [1] Girish Keshav Palshikar. 2002. The hidden truth-frauds and their control: A critical application for business intelligence. Intelligent Enterprise 5, 9 (2002), 46–51. [2] S Benson Edwin Raj and A Annie Portia. 2011. Analysis on credit card fraud detection methods. In 2011 International 3 Conference on Computer, Communication and Electrical Technology (ICCCET). IEEE, 152–156.

Challenge 1: Fraud pattern changes after exposure. IP: 987.654.32.1 Fraud Labels Phone No.: 12345 IP: 987.654.32.1 Phone No.: 12345 Buy new IP, phone number Obsolete E-commerce Platform for training IP: 732.198.43.1 Phone No.: 54321 Use Unsupervised Methods! 4

Challenge 2: Different Local Clustering Patterns Feature combinations Only GPS City Only IP IP Phone No. GPS City Email Device ID A 123 A 0xa2 A A 624 A 0x4b A A A A 492 0x93 13.02 983 B B 3c@a 95.12 581 B B c7@b 043.7 B B mi@c 458 182.5 C ois 0x7d C Select Useful 72.81 id 0x39 C C Features! 86.14 mxi C C 0xfa 5

Challenge 3: Noisy Random Normal Users Good Job! Error! GS GS GS GS GS GS 1 2 3 4 5 6 GPS City Ideally Synchronization Reality Robust to noise! due to randomness 6

Problem Definition – Clustering + Feature Selection • Discrete feature space. $ • Given dataset 𝒠 = 𝒚 ! !"# , where each feature 𝑦 !% takes discrete values ' ! . from 𝑌 %& &"# • Local clustering patterns . ) • Data points are grouped into clusters 𝒣 ( ("# . ( , such that ∀𝒚, 𝒚 * ∈ • Within each cluster 𝒣 ( , there exists a feature subset ℱ * with high probability. 𝒣 ( , ∀𝑛 ∈ ℱ ( , 𝑦 % = 𝑦 % • Goal : find all 𝒣 ! and ℱ ! , while tolerating the noise. 7

Key Results • Applicable to a variety of applications. • Fraud detection + anomaly detection. • Superior fraud detection performance. • 18% AUC improvement. • Interpretable results. • Superior anomaly detection performance. • Over 5% AUC improvement in average. • Robust to noise and hyperparameters. 8

Feature Selection in Clustering • Idea : delete some feature, then cluster the data. • No feature should be deleted globally. Challenge 2: • 3 types of methods [3]: LOCAL clustering patterns! • Filter model : filter the low-quality features before clustering. • Wrapper model : enumerate feature combinations and evaluate clustering performance. • Hybrid model : select features during clustering. • *Suffer from identifiability issue in discrete space. * We provide a proof in our paper. [3] Salem Alelyani, Jiliang Tang, and Huan Liu. Feature Selection for Clustering: A Review. In Data Clustering: Algorithms 9 and Applications 2013. 29–60.

Dense Block Detection • Idea : high-density blocks in data are potential anomalies [4, 5]. • Steps : 1. Greedy search for the block with highest density. Challenge 3: Noise! 2. Delete the block. 3. Repeat the process on the remaining data. • Normal users with random synchronization significantly affect the detection performance. [4] Kijung Shin, Bryan Hooi, and Christos Faloutsos. M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees. ECML PKDD 2016. 264–280. [5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. D-Cube: Dense-Block Detection in Terabyte-Scale Tensors. WSDM 10 2017, 681–689.

FIRD: A Generative Probabilistic Model F eature I ndependence and adve R ersarial D istributions. 11

Enumerating Possible Feature Combinations? ⓧ Exponential feature combinations. IP Phone No. GPS City Email Active Time Device ID ⓧ Exponential feature value combinations. IP: A AT: A GC: A PN: A MA: A EM: A IP: B AT: B GC: B PN: B MA: B EM: B IP: C GC: C PN: C MA: C PN: D MA: D 12

A Decomposed Way of Feature Selection ü Conditional feature independence. l Features are independent within a cluster. l Linear complexity. ü Recognize clustering pattern on each feature, then combine. l Using the adversarial distributions to fit the data. 13

Fitting Patterns Using Adversarial Distributions in Each Feature • For synchronized features in a cluster Solved Challenge 2: Probability Sparse (B, B, B, B, B, …) Detecting Local Clustering Patterns! A B C D E • For non-synchronized features in a cluster Probability Nearly Random (A, D, C, B, E, …) A B C D E 14

Observation Generation Process 𝑒 ! • Choose a cluster 𝑒 ! ~Multinomial(𝝆) For each feature Face 𝑔 • For each feature 𝑛 : !" • Choose indicator variable 𝑔 !% ~𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(𝝂 𝒆 𝒐 ) Probability • If 𝑔 !% = 1 , generate observation 𝑦 !% from Head sparse multinomial distribution. A B C D E • If 𝑔 !% = 0 , generate observation 𝑦 !% from Probability nearly random multinomial distribution. Tail A B C D E

Noise Reduction • Noise : outliers that are unsimilar to all clusters. Solve Challenge 3: ? Noise from normal 𝑞 ( 𝑦 | 𝑒 = ! 𝑕 ) 𝒚 𝒐 ! users. • An information-theoretic rule to recognize an outlier: 𝐽 𝑦 " 𝑒 " = 𝑕 = − log 𝑞(𝑦 " |𝑒 " = 𝑕) < 1 + 𝜗 𝐼[𝑞(𝑦 " |𝑒 " = 𝑕)] 16

Probabilistic Inference Based on FIRD • Inferring label ℓ for each observation given the Cluster A Cluster B label of each cluster. & ℓ ! ≜ 𝔽 " ! ℓ 𝑦 ! = & 𝑞 ℓ 𝑒 ! = 𝑕 𝑞(𝑒 ! = 𝑕|𝑦 ! ) #$% 𝑞(𝑒 ! = 𝑕|𝑦 ! ) • Label of clusters 𝑞 ℓ 𝑒 ! = 𝑕 are easier to obtain: • #Clusters << #Observations From Clustering • Cluster patterns are easier to classify. to Fraud Label Observation Assignment 17

Experimental Evaluations Our Cython code of FIRD is available at https://github.com/fingertap/fird.cython. 18

Identify Fraud Groups • Dataset • We collect the registration records from an E-commerce platform. • An account is labeled as Fraud if any malicious behavior is observed. • Labels are used only for evaluation. • Objective • Good performance. • High interpretability. 19

Identify Fraud Groups - Performance • Compare with dense block detection methods [2, 3]: • N:F is the fraction between normal user and fraudsters. 18% AUC ↑ Robust to noise! • Higher N:F means larger noise. 20

Interpretability: Visualize Detected Clusters Fraud Groups 4000 Normal Users & 3500 Synchronized normal users Individual Fraudsters 3000 User Count 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Discrete Semantic Representations Filtered Fraudster Fraudster Filtered Normal Normal 21

Interpretability: Visualize One Fraud Cluster Importance ( 𝝂 𝟐 ) 1 0.8 Feature Fraud Signature 0.6 0.4 Channel Device ID Time IP IP City Phone Phone City OS Type B 180061 5 737.32.7.7 Coryborough 7037671 West Kristen sim B 405376 5 162.70.28.7 Amandaview 2916214 New Mariafurt android 10 Random Samples B 861328 5 162.70.28.7 Amandaview 1320211 East Erika sim B 201199 5 848.712.23.7 Port Heather 6571178 Valerieside android 3 Fraud groups B 162176 15 761.326.87.7 Luisstad 2064801 Thompsonbury android B 498726 5 761.326.87.7 Luisstad 7932753 Edwardsfurt android B 893969 5 654.21.270.7 Luisstad 6699477 New Mariafurt android New B 195884 5 654.21.270.7 Luisstad 1507813 android Robertland B 221445 5 654.21.270.7 Luisstad 2611409 West Kellyport android 22 B 148534 5 90.713.87.7 Luisstad 2999196 West Kristen android

Interpretability: Visualize One Fraud Feature 350 0.14 300 0.12 250 0.1 Mislabeled fraudster User Count 200 0.08 α 150 0.06 100 0.04 Synchronized Normal Users 50 0.02 0 0 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 . . . . . . . . . . . . . . . . . . . . 7 8 7 3 0 7 7 9 7 6 3 3 7 7 3 2 7 0 4 3 8 2 8 2 7 . . 7 2 9 9 9 . . 2 1 4 2 4 4 2 7 0 8 . . . . 2 . . 7 3 3 . 7 . 4 . 6 6 0 3 2 3 6 6 0 7 3 8 8 0 . . . . . . . 2 7 1 1 1 . 6 5 7 0 0 0 1 4 3 0 3 0 7 0 7 . 7 . 2 . . 3 7 2 4 1 6 7 8 6 8 6 1 7 3 2 8 1 9 . . . . . . 2 2 4 . . . . . 2 1 6 0 8 4 7 0 8 1 1 2 7 1 7 1 1 . . . . 6 1 9 4 5 9 4 1 0 1 1 5 1 3 0 3 0 1 1 7 8 6 8 5 9 5 7 1 7 1 5 5 7 7 Fraudster Normal User α 23

Modeling Heterogeneous Statistical Patterns in High- dimensional - PowerPoint PPT Presentation

Modeling Heterogeneous Statistical Patterns in High- dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework (FIRD) Han Zhang 1 Wenhao Zheng 3 Charley Chen 1 Kevin Gao 1 Yao Hu 3 Ling Huang 2 Wei Xu 1 1 Tsinghua

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

Software, Faster Patterns of Effective Delivery Dan North @tastapod Patterns of Effective

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Design Patterns (1) CSE 331 University of Washington Michael Ernst Outline Introduction to

Patterns of Projects: From Adrenaline Junkies to Template Zombies Tim Lister Patterns

ACCT 420: Logistic Regression for Corporate Fraud Session 7 Dr. Richard M. Crowley 1 Front

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

O ff ice Manager Luncheon March, 30 2016 Happy Doctors Day! Thank you to our Lunch Sponsors

Class Weighted Classification: Trade-offs and Robust Approaches Ziyu Xu (Neil), Chen Dan, Justin

Enrollment Introduction Start simple Avoid complex enrollment scenarios until after we

LUC HENDRIKS RADBOUD UNIVERSITY, NIJMEGEN (NL) VARIATIONAL

5 Selling Products From Code to Product gidgreen.com/course Lecture 5 Introduction

1 Privacy: Video Whose Information Is It? What is privacy? Examine a transaction of

Sambuz

Useful Links

Newsletter

Mail Us