Multi-Label Learning with Highly Incomplete Data via Collaborative Embedding Yufei Han 1 , Guolei Sun 2 , Yun Shen 1 , Xiangliang Zhang 2 1. Symantec Research Labs 2. King Abdullah University of Science and Technology
Outline • Introduction and Problem Definition • Our Methods • Experimental Results
Multi-Label Classification in Cyber Security • Multi-class classification, f ( x ) = c 1? c 2? c 3? or f(x)=apple f(x)=banana f(x)=orange • Multi-label classification, f ( x ) = { c 1? and c 2? and c 3? } Multi-label classification Collaborative embedding Incomplete feature
Existing popular solutions • Binary relevance – Construct classifier for each label independently – Not consider label dependency • Label power-set – Convert into multi-class classification – A,B: {}, {A}, {B}, {A,B} – 2 n : 40 labels, 2 40 =1,099,511,627,776 multi-class classification • Classifier Chains – Learn L binary classifiers by formatting the training problems as ( x i , y 1 , ..., y j − 1 ) → y j = { 0 , 1 } – Only capture the dependency of y i on y 1 , …, y i-1
Use Case of Multi-label Classification Train a prediction model for a Training given product Incomplete signature Incomple counts as features te labels 1 0 5 … 1 0 0 … 1 ? 0 … 0 1 2 … 0 2 1 … 1 1 0 … 3 1 0 … 0 0 1 0 … ? 0 1 … Machine days 0 0 1 … 0 9 0 1 … 1 ? 1 … 1 0 0 … 0 1 0 1 0 1 0 … ? 1 ? … 5
Our Problem: A Tale of Two Cities • Multi-label learning with incomplete feature values and weak labels – Training data (N instances with D X ∈ R N ∗ D features) is partially observed, with if X i,j Ω i,j = 1 is observed. Otherwise Ω i,j = 0 – Label assignment (M is the label Y ∈ { 0 , 1 } N ∗ M dimension) is a positive-unlabeled matrix, with indicating the corresponding instance X i,: is • Y i,j = 1 positively labeled in the j-th label indicating unobservable • Y i,j = 0
Our Problem: A Tale of Two Cities Classification Feature Matrix Label Matrix Model Corrupted / Incomplete data Weak supervision Limited coverage of sensors • Semi-Supervised information • Privacy control • Positive Unlabeled / Partially • Failure of sensors • observed Supervision Partial responses • Weak Pairwise / Triple-wise • constraint
Existing Approaches Methods Feature Values Labels Transductive/ Inductive BiasMC (ICML’15) Complete Positive (Weak) Both WELL (AAAI’10) Complete Positive (Weak) Transductive LEML (ICML’14) Complete Positive and Negative Inductive CoEmbed (AAAI’17) Complete Positive and Negative Transductive MC-1 (NIPS’10) Missing Positive and Negative Transductive DirtyIMC (NIPS’15) Noisy Positive and Negative Both Our study Missing Positive (Weak) Both Q: Give this column?
Outline • Introduction and Problem Definition • Our Methods • Experimental Results
Collaborative Embedding: A Transfer Learning Approach Incomplete Feature Partially observed T ) Matrix (signatures T = φ ( UV Label Matrix WH of security events) (security event class) Shared Embedding Space Cost-Sensitive Logistic Matrix Low-rank LSE based Matrix Factorization Factorization V T H T = Logit = + R(W) U X W Y
Collaborative Embedding: A Transfer Learning Approach Incomplete Feature Partially observed T ) Matrix (signatures T = φ ( UV Label Matrix WH of security events) (security event class) Shared Embedding Space Cost-Sensitive Logistic Matrix Low-rank LSE based Matrix Factorization Factorization V T H T = Logit = + R(W) U X W Y
Feature Matrix Completion • Low-rank Completion to Partially Observed Feature Matrix Ω x ∗ ( X − UV T ) U * , V * = argmin 2 U , V V T X U U : projected features of data instances V : spanning basis defining the projection subspace
Collaborative Embedding: A Transfer Learning Approach Incomplete Feature Partially observed T ) Matrix (signatures T = φ ( UV Label Matrix WH of security events) (security event class) Shared Embedding Space Cost-Sensitive Logistic Matrix Low-rank LSE based Matrix Factorization Factorization V T H T + R(W) = Logit = U X W Y
Label Matrix Reconstruction • Cost-sensitive Logistic Matrix Factorization on Positive- Unlabeled class assignment matrix 2 + H (1 − 2 Y i , j ) X i ,: ( WH T ) ,:, j ) W * , H * = argmin 2 ) ∑ Γ i , j log(1 + e + λ ( W W , H i , j Y i , j = 1 Observed and positively labeled entries Γ i , j = α Unobserved thus unlabeled entries Γ i , j = 1 − α Y i , j = 0 Y = I ( WH T )
ColEmbed: Collaborative Embedding • Collaborative Embedding as a solution to learning with incomplete feature and weak labels: Feature completion Label completion Functional Feature Extraction Tolerance to residual error
Upper Bound of Reconstruction Error • Provably reconstruction of the missing label entries – M, D: the number of labels and the dimensionality of feature vectors – N: the number of training samples – t : the upper bound of the spectral norm of H : maximum L2-norm of the row vectors in X – • The label reconstruction error is of the order of 1/(NM(1- ))
ColEmbed-L • Linear Collaborative Embedding: f ( ˆ X ) = ˆ XS T Flexible for both Transductive and Inductive setting
ColEmbed-NL • Non-linear Embedding: linear combination of random feature expansion Ali Rahimi and Ben Recht, Random Features for Large-Scale Kernel Machines, NIPS 2007
ColEmbed-NL • Non-linear Embedding: linear combination of random feature expansion
Training Process • Stochastic Gradient: Large-scale matrix factorization Non-linear case:
Outline • Introduction and Problem Definition • Our Methods • Experimental Results
Empirical Study • Empirical study aims at answering the following questions – Is it really helpful to reconstruct features and labels simultaneously ? – Do transductive and inductive classification present consistently high precision ? – Does the proposed method provide better classification compared to the state-of-the-art approaches ? – Does the proposed method scale well ?
Methods to Compare • Baseline approaches: – BiasMC (transductive)and BiasMC-I (inductive) , by PU-learning – LEML (cost-sensitive binomial loss) , need + and - labels With complete – LEML (least squared loss) Feature values – WELL , weak labels – CoEmbed , need + and - labels – MC-1 , need + and - labels With missing or noisy – DirtyIMC , need + and - labels feature values • Incomplete feature matrix is completed using the convex low- rank matrix completion approach, noted as MC-Convex
Evaluation Data Sets • Benchmark data sets Public benchmark data Real-world IOT device event detection data
Feature Reconstruction • Lower errors on estimating the missing feature values, comparing to baseline method
Transductive Classification Accuracy • Higher classification accuracy than baseline methods
Inductive Classification Accuracy • Higher classification accuracy than baseline methods
On Real-world Security Data • Consistent better performances on classifying real-world security data, comparing to baseline methods Transductive mode test Inductive mode test
Efficiency Evaluation • Run time in seconds, linear w.r.t. the No. of instances
Takeaway • Collaboratively reconstructing missing feature values and learning missing labels is beneficial for both tasks. • Our proposed method is applicable for both transductive and inductive classification setting. • Our proposed method has better performance than the state-of-the-art approaches.
Future Work • Learning with incomplete data streams • Deep Neural Nets as a more powerful functional mapping between features and labels • Structured feature / label missing patterns • Further extension to multi-task learning
Recommend
More recommend