What deep generative models can do for you: Opportunities, challenges, and open questions Giulia Fanti Carnegie Mellon University 1
Kiran Zinan Lin Hao Liang Alankar Jain Thekumparampil Chen Wang Sewoong Oh Vyas Sekar 2
Classifying Classification network traffic Common ML Reinforcement tools in Traffic learning engineering networking Unsupervised Clustering methods signals 3
This talk: Generative models • What are generative models? • Why are they relevant now? • How can they be useful in networking? • What are the limitations? 4
What is a generative model? • Models the joint probability distribution 𝑞(𝑦) of a dataset • Example: 𝑦[0] 𝑦[𝑢 − 1] Time t 𝑦 𝑢 = 𝑔 𝑦 0: 𝑢 − 1 , 𝜄 + 𝑜[𝑢] How do we pick 𝑔 ? How to combine noise? Learned Model Noise parameters 5
How are they used in the networking community? Design data to Use dom domain n kno nowledg edge Use da par param ametric model to extract high-level populate to model those insights parameters insights Network traffic has temporal patterns ! " = 1 day 𝑦 𝑢 = sin 𝜄𝑢 + 𝑜[𝑢] Melamed (1993), Denneulin et al (2004), Swing, BURSE, Hierarchical bundling, Di et al (2014), …
Poor flexibility •Requires new design for Problems every type of data with this Poor fidelity approach •Doesn’t capture properties that were not explicitly modeled 7
Deep generative models Design neur eural networ ork Use da data to to produce data of the populate right dimensionality parameters 𝜄 ∈ 𝑆 ! 8
Generative Adversarial Networks (GANs): Breakthrough in generative modeling • Prior approaches • Likelihood-based • Heavily rely on domain knowledge • GANs • Adversarial learning • Limited a priori assumptions 9
Generative Adversarial Networks (GANs) Noise Generator Discriminator FAKE! REAL z G D 10
How can we use these tools in networking? • Sharing synthetic data • Discovering malicious inputs to black-box systems • Understanding complex datasets 11
Zinan Lin Vyas Sekar Alankar Jain Chen Wang Sharing synthetic data Use case 1 github.com/fjxmlzn/DoppelGANger 12
Key stumbling block: Access to data Enterprises Researchers Division A Division B Collaborative opportunities Unreproducible research go untapped Limited potential 13
(Not a new) idea: Synthetic data models Enterprises Researchers Generative Generative Generative Model Model Model Data Clearinghouse (ISAC, ISAO) 14
Two main problems Fideli lity Pri rivacy Real Business User secrets data Generative Model Generated 15
Existing methods DoppelGANger Generating synthetic time Expert-designed series data with GANs parametric models Machine-learned models Pr Privacy Anonymized, raw data Fi Fide delity 16
What kinds of data are we interested in? Multi- dimensional time series With (U.S., mobile traffic) metadata 17
Datasets: Networking, security, and systems • Cluster traces • Go Google : task resource usage logs from 12.k machines (2011) • IB IBM : resource usage measurements from 100k containers • Traffic measurements • Wi Wikipedia web traffic : # daily views of Wikipedia articles (2016) • FC FCC Meas asuring Broad adban and America ca : Internet traffic and performance measurements from consumer devices around the country 18
DoppelGANger: Time series generation Noi Noise Noi Noise Noi Noise Mi Min/Ma Max Me Metadata Gener Generator Gener Generator … RN RNN RNN RN (MLP) (M (M (MLP) … (A 1 , …, (A …, A m ) (mi min ± ma max/2 /2 R 1 ,… ,…,R ,R S R T-s+ s+1 ,… ,…,R ,R T Au Auxiliary Di Discrimina nator 1: re 1: real Di Discrimina nator 1: re 1: real 0: 0: fa fake 0: fa 0: fake 19
Part I: RNN + Batched Generation Unbatched Batched 20
Challenge: Training on high-dynamic-range time series 21 Day
Part II: Auto normalization • Standard normalization: Normalize by gl global min/max max min • DoppelGANger: Normalize each timeseries individually • Store min/max as “fake” metadata (min, max) (min, max) (min, max) 22
23
Challenge: Complex relationships in metadata • Need to capture relation between metadata and time series • E.g., Cable vs Mobile users • Straw man: Joint generator of metadata and time series • Problem: too hard for a single generator Before: Single generator Count Time series min value 24
Part III: Decoupled Generation, Auxiliary Discriminator • Two stage decoupling • Generate metadata (using a standard MLP) • Generate measurements conditioned on metadata • Auxiliary discriminator for metadata alone 25
Histogram of !"#$!%& per time series ' hout auxiliary Withou Count discriminator Wi With auxiliary Count discriminator Time series min value 26
Putting it together Noi Noise Noi Noise Noi Noise Mi Min/Ma Max Me Metadata Gener Generator Gener Generator … RN RNN RNN RN (MLP) (M (M (MLP) … (A 1 , …, (A …, A m ) (mi min ± ma max/2 /2 R 1 ,… ,…,R ,R S R T-s+ s+1 ,… ,…,R ,R T Au Auxiliary Di Discrimina nator 1: re 1: real Di Discrimina nator 1: re 1: real 0: 0: fa fake 0: fa 0: fake 27
Temporal Correlations Microbenchmark 28
Predicting job failures in a compute cluster Downstream task • Train on synthetic, test on real 29
Evaluating privacy • Protecting business secrets • Aggregate functions of the data • User privacy • Differential privacy • Robustness against membership inference 30
Differentially-private SGD kills fidelity in GANs 31
Open questions: Synthetic data generation • Fi Fidelity • Long sequences of data • Stateful protocols • Pr Privacy • Differentially-private GANs • New privacy metrics? 32
Zinan Lin Hao Liang Vyas Sekar Identifying malicious inputs to black-box systems Use case 2 33
Bl Black-bo box De Devi vices es and System ems Abound IoT Devices Control Units in Vehicles / Manufacturing Servers / Routers NO NO so source co code / bi bina nary / pr protoco col fo format / de design do doc 11/14/2019 Towards Oblivious Network Analysis using GANs HotNets'19 34
Identifying At Id Attack Pa Packets is is Ha Hard Send packets Attacker We We wa want to to id identif tify at attack ack pack packets, but do but do NO NOT ha have so source co code or system descr cript ption 11/14/2019 Towards Oblivious Network Analysis using GANs HotNets'19 35
Motivating example Can an attacker identify many y • Packet classification packets with hi high classification times? • Vamanan et al [SIGCOMM 2010] • Singh et al [SIGCOMM 2013] • Yingchareonthawornchai et al [TON 2018] • Liang et al [SIGCOMM 2019] • Rashelbach et al [SIGCOMM 2020] • Many more… Classification Time 36
Random packet generation • NeuroCuts, Liang et al [SIGCOMM 2019] Can can we generate many, d , diverse s slo low Thre Threshold hold packets? Fa Fast pac packets Number of 2,000 total packets packets Slo low p packets Classification Time (ms) 37
Common approaches • Fuzzing tools • Random sampling • Optimization of black-box functions • Bayesian optimization • Genetic algorithms • Simulated annealing GANs can help! 38
Ap Approa oach 1: 1: Va Vanilla GA GAN • Challenge: too little training data Training Dataset “Fast” packets Classification Random Packets 1% 1% decision tree “Slow” packets GAN 11/14/2019 Towards Oblivious Network Analysis using GANs HotNets'19 16
Am AmpGAN AN: Tr Training with Feedback Training Dataset “Fast” packets Classification Random Packets decision tree “Slow” packets Generate packets with condition=“slow” AmpGAN Am AN GAN 11/14/2019 Towards Oblivious Network Analysis using GANs HotNets'19 18
Results Thre Threshold hold Number of Random om packets pac packets Number of Am AmpGAN AN packets Classification Time (ms) 41
Results AmpGAN 2.5x jump 10x jump Genetic Algorithms Fraction of Simulated Annealing “slow” packets Generalized SA Bayesian Optimization AmpGAN System Calls 42
Open questions • Sequences of inputs • Can we use this to optimize systems as well as finding attacks • E.g., CherryPick [NSDI 2017] 43
Kiran Sewoong Oh Zinan Lin Thekumparampil Extracting insights from unstructured data Use case 3 github.com/fjxmlzn/InfoGAN-CR 44
Disentangled GANs 𝑙 factors 𝑨 $ • Hair color 𝑒 input 𝑨 % • Rotation Generator noise … • Background 𝑨 & • Bangs • How do 𝑨 , s control the factors? Vanilla GANs Disentangled GANs 𝑨 $ Factor $ 𝑑 $ Factor $ 𝑨 % Factor % 𝑑 % Factor % … … … … 𝑨 & Factor ' 𝑑 ' Factor ' 𝑨 ( s Code & Paper: https://github.com/fjxmlzn/InfoGAN-CR 45
Recommend
More recommend