Embe mbedding dding as a To Tool ol for Al Algorithm orithm De Design sign Le Song College of Computing Center for Machine Learning Georgia Institute of Technology
What is machine learning (ML) Design algorithms and systems that can improve their performance with data The best The best design pattern design pattern for big data? for big data? Embedding Embedding structures structures 2
Ex 1: Prediction for structured data Drug/materials effective/ineffective? Information spread viral/non-viral? code graphs benign/ malicious? Natural language positive/negative? 3
Big dataset, explosive feature space 2.3 million organic materials Structure Level 1 Level 2 elements … Feature … … … … 4 1 1 0 0 0 2 1 vector metho hod dimen mensio ion MAE Reduce model Reduce model Predict size by size by Level 6 1.3 billion 0.096 10,000 times! 10,000 times! Efficiency (PCE) (0 -12 %) Embedding 0.1 million 0.085 4
Ex 2: Social information network modeling who and when will do what? Jacob ob David vid Alice ce Christine ine 5
Complex behavior not well modeled ~2 million internet TV views 7,100 385 users programs Deal with Deal with Predict future Predict future no data? no data? event? event? No difference? No difference? Book Alice 𝑈 time How long? How long? 𝑢 1 𝑢 3 𝑢 2 Shoe David Epoch 1 Epoch 2 Epoch 3 Reduce error Reduce error by 5 folds! by 5 folds! Hours 100 ≈ + … user 10 1 time tensor factorization Return Time (MAE) 6
Ex 3: Combinatorial optimizations over graphs App pplicati lication on Optimiz imizat ation ion Problem oblem Influence maximization Minimum vertex/set cover Community discovery Maximum cut Resource scheduling Traveling salesman NP-hard problems 7
Simple heuristics do not exploit data Decision not 2 2 - app pproxim oximat ation ion for or minim nimum um vertex ex cover ver data-driven. Repeat till all edges covered: Can we learn 1. Select uncovered edge with largest total degree from data? 1.3 Learn to be Learn to be 1.2 near optimal! near optimal! 1.1 1 approximation ratio 8
Fundamental problems Structure attribute/ attribute/ 𝑌 5 𝑌 5 𝑌 6 𝑌 6 𝜓 raw info. raw info. 𝑌 1 𝑌 1 𝑌 4 𝑌 4 𝑌 2 𝑌 2 𝑌 3 𝑌 3 How to describe node? How to describe entire structure? How to incorporate various info.? How to do it efficiently? 9
Represent structure as latent variable model (LVM) Structure LVM Continuous Continuous 𝑌 5 𝑌 5 𝑌 6 𝑌 6 𝜓 𝐻 = (𝒲, ℇ) Latent Latent 𝑌 6 𝑌 6 𝑌 5 𝑌 5 Ψ 𝑓 𝐼 𝑗 , 𝐼 Ψ 𝑓 𝐼 𝑗 , 𝐼 𝑘 𝑘 Represent Represent 𝑌 1 𝑌 1 𝑌 4 𝑌 4 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 𝑌 4 𝑌 4 𝑌 1 𝑌 1 𝑌 2 𝑌 2 𝑌 3 𝑌 3 Categorical / Categorical / 𝑌 2 𝑌 2 𝑌 3 𝑌 3 Continuous/ Continuous/ Raw features Raw features Joint likelihood 𝑞 𝐼 𝑗 , 𝑌 𝑗 ∝ Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 |𝜄 𝑤 Ψ 𝑓 (𝐼 𝑗 , 𝐼 𝑘 |𝜄 𝑓 ) 𝑗∈𝒲 𝑗,𝑘 ∈ℇ Nonnegative Nonnegative node potential edge potential 10 [Dai, Dai & Song 2016]
Posterior distribution as features 𝑞 𝐼 1 𝑦 𝑘 Features of nodes LVM 𝐻 = (𝒲, ℇ) 𝜈 1 (𝜓, 𝑋) 𝑌 6 𝑌 6 𝑌 5 𝑌 5 + posterior 𝑞 𝐼 2 𝑦 𝑘 𝜈 2 (𝜓, 𝑋) + 𝑌 4 𝑌 4 𝑌 1 𝑌 1 ⋮ 𝑌 2 𝑌 2 𝑌 3 𝑌 3 = 𝜈 𝑏 (𝜓, 𝑋) 𝑞 𝐼 𝑘 , 𝑦 𝑘 𝑏𝑚𝑚 𝐼 𝑘 𝑓𝑦𝑑𝑓𝑞𝑢 𝐼 𝑗 Features of the 𝑞 𝐼 𝑗 𝑦 𝑘 = 𝑞 𝑦 𝑘 entire structure Capture both nodal and topological info. Aggregate information from distant nodes 11 [Dai, Dai & Song 2016]
Mean field algorithm aggregates information 𝑟 5 (𝐼 5 ) 𝑟 6 (𝐼 6 ) 𝑟 1 (𝐼 1 ) Approximate posterior 𝑞 𝐼 𝑗 𝑦 𝑘 ≈ 𝑟 𝑗 (𝐼 𝑗 ) 𝑌 6 𝑌 6 𝑌 5 𝑌 5 Ψ 𝑓 𝐼 𝑗 , 𝐼 Ψ 𝑓 𝐼 𝑗 , 𝐼 𝑘 𝑘 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 𝑟 2 (𝐼 2 ) via fixed point update 𝑌 4 𝑌 4 𝑌 1 𝑌 1 1. Initialize 𝑟 𝑗 𝐼 𝑗 , ∀ 𝑗 𝑌 2 𝑌 2 𝑌 3 𝑌 3 2. Iterate many times 𝑟 𝑗 𝐼 𝑗 ← Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 ⋅ exp 𝑟 𝑘 𝐼 𝑘 log Ψ 𝑓 𝐼 𝑗 , 𝐼 𝑒𝐼 , ∀ 𝑗 𝑘 𝑘 𝓘 𝑘∈𝓞 𝑗 𝓤 ∘ 𝑌 𝑗 , 𝑟 𝑘 (𝐼 𝓤 ∘ 𝑌 𝑗 , 𝑟 𝑘 (𝐼 𝑘 ) 𝑘∈𝒪 𝑗 𝑘 ) 𝑘∈𝒪 𝑗 [Song et al. 11a,b] 12 [Song et al. 10a,b]
Embedding of distribution 𝑌 Density 𝑞(𝑌) Feature 𝑌 2 𝔽 𝑞 𝜚 𝑌 𝜚 𝑌 = space 𝑌 3 space ⋮ 𝜈 𝑌 Mean, Variance, 𝑟(𝑌) higher 𝔽 𝑟 𝜚 𝑌 order ′ 𝜈 𝑌 moment ⋮ Injective for rich nonlinear feature 𝜚(𝑦) 𝜈 𝑌 is a sufficient statistic of 𝑞(𝑌) Operator View ∘ 𝜈 𝑌 𝓤 ∘ 𝑞 𝑦 = 𝓤 13 [Smola, Gretton, Song & Scholkopf. 2007]
Structure2vec (S2V): embedding mean field (0) 𝜈 6 (0) 𝜈 5 Approximate embedding of (0) 𝜈 1 (0) 𝜈 4 𝑞 𝐼 𝑗 𝑦 𝑘 ↦ 𝜈 𝑗 𝑌 6 𝑌 6 𝑌 5 𝑌 5 (0) 𝜈 2 via fixed point update 𝑌 4 𝑌 4 𝑌 1 𝑌 1 1. Initialize 𝜈 𝑗 , ∀ 𝑗 (0) 𝜈 3 𝑌 2 𝑌 2 𝑌 3 𝑌 3 2. Iterate many times ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 , ∀ 𝑗 , ∀ 𝑗 14
Structure2vec (S2V): embedding mean field (1) 𝜈 6 (1) 𝜈 5 Approximate embedding of (1) 𝜈 1 (1) 𝜈 4 𝑞 𝐼 𝑗 𝑦 𝑘 ↦ 𝜈 𝑗 𝑌 6 𝑌 6 𝑌 5 𝑌 5 (1) 𝜈 2 via fixed point update 𝑌 4 𝑌 4 𝑌 1 𝑌 1 1. Initialize 𝜈 𝑗 , ∀ 𝑗 (1) 𝜈 3 𝑌 2 𝑌 2 𝑌 3 𝑌 3 2. Iterate many times ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 , ∀ 𝑗 , ∀ 𝑗 15
Structure2vec (S2V): embedding mean field (2) 𝜈 6 (2) 𝜈 5 Approximate embedding of (2) 𝜈 1 (2) 𝜈 4 𝑞 𝐼 𝑗 𝑦 𝑘 ↦ 𝜈 𝑗 𝑌 6 𝑌 6 𝑌 5 𝑌 5 (2) 𝜈 2 via fixed point update 𝑌 4 𝑌 4 𝑌 1 𝑌 1 1. Initialize 𝜈 𝑗 , ∀ 𝑗 (2) 𝜈 3 𝑌 2 𝑌 2 𝑌 3 𝑌 3 2. Iterate many times ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 , ∀ 𝑗 , ∀ 𝑗 ? ? How to parametrize 𝓤 How to parametrize 𝓤 Depends on unknown Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 and Ψ 𝑓 𝐼 𝑗 , 𝐼 Depends on unknown Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 and Ψ 𝑓 𝐼 𝑗 , 𝐼 𝑘 𝑘 16
Directly parameterize nonlinear mapping ∘ 𝑌 𝑗 , 𝜈 𝑘 𝑘∈𝒪 𝑗 𝜈 𝑗 ← 𝓤 Any universal nonlinear function will do Eg. assume 𝜈 𝑗 ∈ 𝓢 𝑒 , 𝑌 𝑗 ∈ 𝓢 𝑜 , neural network parameterization 𝜈 𝑗 ← 𝜏 𝑋 1 𝑌 𝑗 + 𝑋 2 𝜈 𝑘 𝑘∈𝒪 𝑗 max 0,⋅ tanh (⋅) 𝑒 × 𝑜 𝑒 × 𝑒 sigmoid (⋅) matrix matrix Learn with supervision, unsupervised Learn with supervision, unsupervised learning, or reinforcement learning learning, or reinforcement learning 17
Embedding belief propagation Approximate 𝑞 𝐼 𝑗 𝑦 𝑘 , 𝜄 as 𝐻 = (𝒲, ℇ) 𝑟 𝑗 𝐼 𝑗 = Ψ 𝑤 𝐼 𝑗 , 𝑦 𝑗 |𝜄 ⋅ 𝑌 6 𝑌 6 𝑌 5 𝑌 5 𝑛 𝑘𝑗 (𝐼 𝑗 ) Ψ 𝑓 𝐼 𝑗 , 𝐼 Ψ 𝑓 𝐼 𝑗 , 𝐼 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 Ψ 𝑤 𝐼 𝑗 , 𝑌 𝑗 𝑘 𝑘 𝑘∈𝓞 𝑗 𝑌 1 𝑌 1 𝑌 4 𝑌 4 with iterative messages updates 𝓤′ ∘ 𝑌 𝑗 , 𝑛 ℓ𝑗 (𝐼 𝑗 ) ℓ∈𝒪 𝑗 𝓤′ ∘ 𝑌 𝑗 , 𝑛 ℓ𝑗 (𝐼 𝑗 ) ℓ∈𝒪 𝑗 1. Initialize 𝑛 𝑗𝑘 𝐼 𝑘 , ∀𝑗, 𝑘 𝑌 2 𝑌 2 𝑌 3 𝑌 3 2. Iterate many times 𝑛 𝑗𝑘 𝐼 𝑘 ← Ψ 𝑤 (𝐼 𝑗 , 𝑌 𝑗 |𝜄)Ψ 𝑓 𝐼 𝑗 , 𝐼 𝑘 |𝜄 ⋅ 𝑛 ℓ𝑗 𝐼 𝑗 𝑒𝐼 𝑗 , ∀𝑗, 𝑘 𝓘 ℓ∈𝒪 𝑗 \𝑘 𝓤 ∘ 𝑌 𝑗 , 𝑛 ℓ𝑗 (𝐼 𝑗 ) ℓ∈𝒪 𝑗 \𝑘 𝓤 ∘ 𝑌 𝑗 , 𝑛 ℓ𝑗 (𝐼 𝑗 ) ℓ∈𝒪 𝑗 \𝑘 [Song et al. 11a,b] [Song et al. 10a,b] 18
Ex 1: Prediction for structured data Drug/materials effective/ineffective? Information spread viral/non-viral? code graphs benign/ malicious? Natural language positive/negative? 19
Algorithm learning Given 𝑛 data points 𝜓 1 , 𝜓 2 , … , 𝜓 𝑛 And their labels 𝑧 1 , 𝑧 2 , … , 𝑧 𝑛 Estimate parameters 𝑋 and 𝑊 via 𝑛 𝑊,𝑋 𝑀 𝑊, 𝑋 : = 𝑧 𝑗 − 𝑊 ⊤ 𝜈 𝑏 (𝑋, 𝜓 𝑗 ) 2 min 𝑗=1 Comp omputation tion Ope perati ation on Similar ar to to Objective A sequence of nonlinear Graphical model inference 𝑀 𝑊, 𝑋 mappings over graph Gradient Back propagation in deep Chain rule of derivatives in 𝜖𝑀 learning 𝜖𝑋 reverse order 20
10,000x smaller model but accurate prediction Harvard clean energy project: predict material efficiency (0-12) 2.3 million organic molecules 90% for training, 10% data for testing Test t MAE Test RMSE MSE # pa parame amete ters Mean predictor 1.986 2.406 1 WL level-3 0.143 0.204 1.6 m WL level-6 0.096 0.137 1.3 b S2V-MF 0.091 0.125 0.1 m S2V-BP 0.085 0.117 0.1 m ~4% relative error ~4% relative error 21
Ex 2: Social information network modeling who and when will do what? Jacob ob David vid Alice ce Christine ine 22
Recommend
More recommend