Modeling Heterogeneous Statistical Patterns in High- dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework (FIRD) Han Zhang 1 Wenhao Zheng 3 Charley Chen 1 Kevin Gao 1 Yao Hu 3 Ling Huang 2 Wei Xu 1 1 Tsinghua University 2 AHI Fintech 3 Youku Cognitive and Intelligent Lab, Alibaba Group
Fraud Hurts E-commerce Platform in Many Ways F F A A K K E E Waste over $1,000,000,000 a Year Fake Review Identity Theft E-commerce Platform Payment Fraud, Merchant Fraud, … Coupon Hunting 2
Fraud Patterns V.S. Normal Patterns [1, 2] • Fraudsters display synchronized behaviors. </> Resource IP: 987.654.32.1 Sharing Phone No.: 12345 Similar Control Script • In contrast, normal users are usually randomly distributed. [1] Girish Keshav Palshikar. 2002. The hidden truth-frauds and their control: A critical application for business intelligence. Intelligent Enterprise 5, 9 (2002), 46–51. [2] S Benson Edwin Raj and A Annie Portia. 2011. Analysis on credit card fraud detection methods. In 2011 International 3 Conference on Computer, Communication and Electrical Technology (ICCCET). IEEE, 152–156.
Challenge 1: Fraud pattern changes after exposure. IP: 987.654.32.1 Fraud Labels Phone No.: 12345 IP: 987.654.32.1 Phone No.: 12345 Buy new IP, phone number Obsolete E-commerce Platform for training IP: 732.198.43.1 Phone No.: 54321 Use Unsupervised Methods! 4
Challenge 2: Different Local Clustering Patterns Feature combinations Only GPS City Only IP IP Phone No. GPS City Email Device ID A 123 A 0xa2 A A 624 A 0x4b A A A A 492 0x93 13.02 983 B B 3c@a 95.12 581 B B c7@b 043.7 B B mi@c 458 182.5 C ois 0x7d C Select Useful 72.81 id 0x39 C C Features! 86.14 mxi C C 0xfa 5
Challenge 3: Noisy Random Normal Users Good Job! Error! GS GS GS GS GS GS 1 2 3 4 5 6 GPS City Ideally Synchronization Reality Robust to noise! due to randomness 6
Problem Definition – Clustering + Feature Selection • Discrete feature space. $ • Given dataset = 𝒚 ! !"# , where each feature 𝑦 !% takes discrete values ' ! . from 𝑌 %& &"# • Local clustering patterns . ) • Data points are grouped into clusters ( ("# . ( , such that ∀𝒚, 𝒚 * ∈ • Within each cluster ( , there exists a feature subset ℱ * with high probability. ( , ∀𝑛 ∈ ℱ ( , 𝑦 % = 𝑦 % • Goal : find all ! and ℱ ! , while tolerating the noise. 7
Key Results • Applicable to a variety of applications. • Fraud detection + anomaly detection. • Superior fraud detection performance. • 18% AUC improvement. • Interpretable results. • Superior anomaly detection performance. • Over 5% AUC improvement in average. • Robust to noise and hyperparameters. 8
Feature Selection in Clustering • Idea : delete some feature, then cluster the data. • No feature should be deleted globally. Challenge 2: • 3 types of methods [3]: LOCAL clustering patterns! • Filter model : filter the low-quality features before clustering. • Wrapper model : enumerate feature combinations and evaluate clustering performance. • Hybrid model : select features during clustering. • *Suffer from identifiability issue in discrete space. * We provide a proof in our paper. [3] Salem Alelyani, Jiliang Tang, and Huan Liu. Feature Selection for Clustering: A Review. In Data Clustering: Algorithms 9 and Applications 2013. 29–60.
Dense Block Detection • Idea : high-density blocks in data are potential anomalies [4, 5]. • Steps : 1. Greedy search for the block with highest density. Challenge 3: Noise! 2. Delete the block. 3. Repeat the process on the remaining data. • Normal users with random synchronization significantly affect the detection performance. [4] Kijung Shin, Bryan Hooi, and Christos Faloutsos. M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees. ECML PKDD 2016. 264–280. [5] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. D-Cube: Dense-Block Detection in Terabyte-Scale Tensors. WSDM 10 2017, 681–689.
FIRD: A Generative Probabilistic Model F eature I ndependence and adve R ersarial D istributions. 11
Enumerating Possible Feature Combinations? ⓧ Exponential feature combinations. IP Phone No. GPS City Email Active Time Device ID ⓧ Exponential feature value combinations. IP: A AT: A GC: A PN: A MA: A EM: A IP: B AT: B GC: B PN: B MA: B EM: B IP: C GC: C PN: C MA: C PN: D MA: D 12
A Decomposed Way of Feature Selection ü Conditional feature independence. l Features are independent within a cluster. l Linear complexity. ü Recognize clustering pattern on each feature, then combine. l Using the adversarial distributions to fit the data. 13
Fitting Patterns Using Adversarial Distributions in Each Feature • For synchronized features in a cluster Solved Challenge 2: Probability Sparse (B, B, B, B, B, …) Detecting Local Clustering Patterns! A B C D E • For non-synchronized features in a cluster Probability Nearly Random (A, D, C, B, E, …) A B C D E 14
Observation Generation Process 𝑒 ! • Choose a cluster 𝑒 ! ~Multinomial(𝝆) For each feature Face 𝑔 • For each feature 𝑛 : !" • Choose indicator variable 𝑔 !% ~𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(𝝂 𝒆 𝒐 ) Probability • If 𝑔 !% = 1 , generate observation 𝑦 !% from Head sparse multinomial distribution. A B C D E • If 𝑔 !% = 0 , generate observation 𝑦 !% from Probability nearly random multinomial distribution. Tail A B C D E
Noise Reduction • Noise : outliers that are unsimilar to all clusters. Solve Challenge 3: ? Noise from normal 𝑞 ( 𝑦 | 𝑒 = ! ) 𝒚 𝒐 ! users. • An information-theoretic rule to recognize an outlier: 𝐽 𝑦 " 𝑒 " = = − log 𝑞(𝑦 " |𝑒 " = ) < 1 + 𝜗 𝐼[𝑞(𝑦 " |𝑒 " = )] 16
Probabilistic Inference Based on FIRD • Inferring label ℓ for each observation given the Cluster A Cluster B label of each cluster. & ℓ ! ≜ 𝔽 " ! ℓ 𝑦 ! = & 𝑞 ℓ 𝑒 ! = 𝑞(𝑒 ! = |𝑦 ! ) #$% 𝑞(𝑒 ! = |𝑦 ! ) • Label of clusters 𝑞 ℓ 𝑒 ! = are easier to obtain: • #Clusters << #Observations From Clustering • Cluster patterns are easier to classify. to Fraud Label Observation Assignment 17
Experimental Evaluations Our Cython code of FIRD is available at https://github.com/fingertap/fird.cython. 18
Identify Fraud Groups • Dataset • We collect the registration records from an E-commerce platform. • An account is labeled as Fraud if any malicious behavior is observed. • Labels are used only for evaluation. • Objective • Good performance. • High interpretability. 19
Identify Fraud Groups - Performance • Compare with dense block detection methods [2, 3]: • N:F is the fraction between normal user and fraudsters. 18% AUC ↑ Robust to noise! • Higher N:F means larger noise. 20
Interpretability: Visualize Detected Clusters Fraud Groups 4000 Normal Users & 3500 Synchronized normal users Individual Fraudsters 3000 User Count 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Discrete Semantic Representations Filtered Fraudster Fraudster Filtered Normal Normal 21
Interpretability: Visualize One Fraud Cluster Importance ( 𝝂 𝟐 ) 1 0.8 Feature Fraud Signature 0.6 0.4 Channel Device ID Time IP IP City Phone Phone City OS Type B 180061 5 737.32.7.7 Coryborough 7037671 West Kristen sim B 405376 5 162.70.28.7 Amandaview 2916214 New Mariafurt android 10 Random Samples B 861328 5 162.70.28.7 Amandaview 1320211 East Erika sim B 201199 5 848.712.23.7 Port Heather 6571178 Valerieside android 3 Fraud groups B 162176 15 761.326.87.7 Luisstad 2064801 Thompsonbury android B 498726 5 761.326.87.7 Luisstad 7932753 Edwardsfurt android B 893969 5 654.21.270.7 Luisstad 6699477 New Mariafurt android New B 195884 5 654.21.270.7 Luisstad 1507813 android Robertland B 221445 5 654.21.270.7 Luisstad 2611409 West Kellyport android 22 B 148534 5 90.713.87.7 Luisstad 2999196 West Kristen android
Interpretability: Visualize One Fraud Feature 350 0.14 300 0.12 250 0.1 Mislabeled fraudster User Count 200 0.08 α 150 0.06 100 0.04 Synchronized Normal Users 50 0.02 0 0 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 . . . . . . . . . . . . . . . . . . . . 7 8 7 3 0 7 7 9 7 6 3 3 7 7 3 2 7 0 4 3 8 2 8 2 7 . . 7 2 9 9 9 . . 2 1 4 2 4 4 2 7 0 8 . . . . 2 . . 7 3 3 . 7 . 4 . 6 6 0 3 2 3 6 6 0 7 3 8 8 0 . . . . . . . 2 7 1 1 1 . 6 5 7 0 0 0 1 4 3 0 3 0 7 0 7 . 7 . 2 . . 3 7 2 4 1 6 7 8 6 8 6 1 7 3 2 8 1 9 . . . . . . 2 2 4 . . . . . 2 1 6 0 8 4 7 0 8 1 1 2 7 1 7 1 1 . . . . 6 1 9 4 5 9 4 1 0 1 1 5 1 3 0 3 0 1 1 7 8 6 8 5 9 5 7 1 7 1 5 5 7 7 Fraudster Normal User α 23
Recommend
More recommend