a brief introduction to machine learning with
play

A Brief Introduction to Machine Learning (With Applications to - PowerPoint PPT Presentation

A Brief Introduction to Machine Learning (With Applications to Communications) Osvaldo Simeone Kings College London 11 June 2018 Osvaldo Simeone A Brief Intro to ML + Comm 1 / 126 Goals and Learning Outcomes Goals: Provide an


  1. When the True Distribution p ( x , t ) is Known... ... we don’t need data D ... and we have a standard inference problem, i.e., estimation (regression) or detection (classification). The solution can be directly computed from the posterior distribution p ( t | x ) = p ( x , t ) p ( x ) as t ∗ ( x ) = arg min ˆ t E t ∼ p t | x [ ℓ ( t , ˆ t ) | x ] ˆ Osvaldo Simeone A Brief Intro to ML + Comm 28 / 126

  2. When the True Distribution p ( x , t ) is Known... ... we don’t need data D ... and we have a standard inference problem, i.e., estimation (regression) or detection (classification). The solution can be directly computed from the posterior distribution p ( t | x ) = p ( x , t ) p ( x ) as t ∗ ( x ) = arg min ˆ t E t ∼ p t | x [ ℓ ( t , ˆ t ) | x ] ˆ Osvaldo Simeone A Brief Intro to ML + Comm 28 / 126

  3. When the Model p ( x , t ) is Known... With quadratic loss, conditional mean: ˆ t ∗ ( x ) = E t ∼ p t | x [ t | x ] With probability of error, maximum a posteriori (MAP): ˆ t ∗ ( x ) = arg max t p ( t | x ) x \ t 0 1 Example: with joint distribution 0 0.05 0.45 , we have 1 0.4 0.1 p ( t = 1 | x = 0 ) = 0 . 9 and ˆ t ∗ ( x = 0 ) = 0 . 9 × 1 + 0 . 1 × 0 = 0 . 9 for quadratic loss, ˆ t ∗ ( x = 0 ) = 1 for probability of error (MAP) . Osvaldo Simeone A Brief Intro to ML + Comm 29 / 126

  4. When the Model p ( x , t ) is Known... With quadratic loss, conditional mean: ˆ t ∗ ( x ) = E t ∼ p t | x [ t | x ] With probability of error, maximum a posteriori (MAP): ˆ t ∗ ( x ) = arg max t p ( t | x ) x \ t 0 1 Example: with joint distribution 0 0.05 0.45 , we have 1 0.4 0.1 p ( t = 1 | x = 0 ) = 0 . 9 and ˆ t ∗ ( x = 0 ) = 0 . 9 × 1 + 0 . 1 × 0 = 0 . 9 for quadratic loss, ˆ t ∗ ( x = 0 ) = 1 for probability of error (MAP) . Osvaldo Simeone A Brief Intro to ML + Comm 29 / 126

  5. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  6. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  7. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  8. Logistic Regression Example: Binary classification ( t ∈ { 0 , 1 } ) 1. Model selection (inductive bias): logistic regression (discriminative model) φ ( x ) = [ φ 1 ( x ) ··· φ D ′ ( x )] T is a vector of features (e.g., bag-of-words model for a text). Osvaldo Simeone A Brief Intro to ML + Comm 31 / 126

  9. Logistic Regression Parametric probabilistic model: p ( t = 1 | x , w ) = σ ( w T φ ( x )) where σ ( a ) = ( 1 +exp( − a )) − 1 is the sigmoid function. Osvaldo Simeone A Brief Intro to ML + Comm 32 / 126

  10. Logistic Regression 2. Learning: To be discussed 3. Inference: With probability of error loss, MAP classification t = 1 w T φ ( x ) ≷ 0 � �� � t = 0 logit or LLR Osvaldo Simeone A Brief Intro to ML + Comm 33 / 126

  11. Multi-Layer Neural Networks 1. Model selection (inductive bias): multi-layer neural network (discriminative model) Multiple layers of learnable weights enable feature learning. Osvaldo Simeone A Brief Intro to ML + Comm 34 / 126

  12. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 35 / 126

  13. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 36 / 126

  14. Learning: Maximum Likelihood ML selects a value of θ that is the most likely to have generated the observed training set D : maximize p ( D | θ ) ⇐ ⇒ maximize ln p ( D | θ ) ( log-likelihood, or LL) ⇐ ⇒ minimize − ln p ( D | θ ) ( negative log-likelihood, or NLL) For discriminative models: N ∑ minimize − ln p ( t D | x D , θ ) = − ln p ( t n | x n , θ ) n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 37 / 126

  15. Learning: Maximum Likelihood ML selects a value of θ that is the most likely to have generated the observed training set D : maximize p ( D | θ ) ⇐ ⇒ maximize ln p ( D | θ ) ( log-likelihood, or LL) ⇐ ⇒ minimize − ln p ( D | θ ) ( negative log-likelihood, or NLL) For discriminative models: N ∑ minimize − ln p ( t D | x D , θ ) = − ln p ( t n | x n , θ ) n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 37 / 126

  16. Learning: Maximum Likelihood The problem rarely has analytical solutions and is typically addressed by Stochastic Gradient Descent (SGD). For discriminative models, we have θ new ← θ old + γ ∇ θ ln p ( t n | x n , θ ) | θ = θ old γ is the learning rate. With multi-layer neural networks, this approach yields the backpropagation algorithm. Osvaldo Simeone A Brief Intro to ML + Comm 38 / 126

  17. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 39 / 126

  18. Model Selection How to select a model (inductive bias)? Model selection typically requires the model order, i.e., the capacity of the model. Ex.: For logistic regression, ◮ Model order M : Number of features Osvaldo Simeone A Brief Intro to ML + Comm 40 / 126

  19. Model Selection How to select a model (inductive bias)? Model selection typically requires the model order, i.e., the capacity of the model. Ex.: For logistic regression, ◮ Model order M : Number of features Osvaldo Simeone A Brief Intro to ML + Comm 40 / 126

  20. Model Selection Example: Regression using a discriminative model p ( t | x ) M w m x m ∑ + N ( 0 , 1 ) m = 0 � �� � ˆ t ( x ): polynomial of order M 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 41 / 126

  21. Model Selection With M = 1, using ML learning of the coefficients – 3 2 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 42 / 126

  22. Model Selection: Underfitting... With M = 1, the ML predictor ˆ t ( x ) underfits the data: ◮ the model is not rich enough to capture the variations present in the data; ◮ large training loss N L D ( θ ) = 1 t ( x n )) 2 ∑ ( t n − ˆ N n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 43 / 126

  23. Model Selection With M = 9, using ML learning of the coefficients – 3 2 = 9 M 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 44 / 126

  24. Model Selection: ... vs Overfitting With M = 9, the ML predictor overfits the data: ◮ the model is too rich and, in order to account for the observations in the training set, it appears to yield inaccurate predictions outside it; ◮ presumably we have a large generalization loss L p (ˆ t ) = E ( x , t ) ∼ p xt [( t − ˆ t ( x )) 2 ] Osvaldo Simeone A Brief Intro to ML + Comm 45 / 126

  25. Model Selection M = 3 seems to be a resonable choice... ... but how do we know given that we have no data outside of the training set? 3 2 = 9 M 1 M = 1 0 -1 M = 3 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 46 / 126

  26. Model Selection: Validation Keep some data (validation set) to estimate the generalization error for different values of M (See cross-validation for a more efficient way to use the data.) Osvaldo Simeone A Brief Intro to ML + Comm 47 / 126

  27. Model Selection: Validation Validation allows model order selection. 1.6 underfitting overfitting 1.4 root average squared loss 1.2 1 0.8 generalization 0.6 (via validation) 0.4 training 0.2 0 1 2 3 4 5 6 7 8 9 Validation can also be used more generally to select other hyperparameters (e.g., learning rate). Osvaldo Simeone A Brief Intro to ML + Comm 48 / 126

  28. Model Selection: Validation Validation allows model order selection. 1.6 underfitting overfitting 1.4 root average squared loss 1.2 1 0.8 generalization 0.6 (via validation) 0.4 training 0.2 0 1 2 3 4 5 6 7 8 9 Validation can also be used more generally to select other hyperparameters (e.g., learning rate). Osvaldo Simeone A Brief Intro to ML + Comm 48 / 126

  29. Model Selection: Validation Model order selection should depend on the amount of data... It is a problem of bias (asymptotic error) versus generalization gap. 1 generalization (via validation) 0.8 root average quadratic loss = 1 0.6 M 0.4 = 7 M 0.2 training 0 0 10 20 30 40 50 60 70 Osvaldo Simeone A Brief Intro to ML + Comm 49 / 126

  30. Application to Communication Networks Fog network architecture [5GPPP] Core Core Cloud Network Cloud Edge Edge Access Cloud Network Wireless Edge Osvaldo Simeone A Brief Intro to ML + Comm 50 / 126

  31. At the Edge: Overview At the edge: ◮ PHY: Detection and decoding, precoding and power allocation, modulation recognition, localization, interference cancelation, joint source channel coding, equalization in the presence of non-linearities ◮ MAC/ Link: Radio resource allocation, scheduling, multi-RAT handover, dynamic spectrum access, admission control ◮ Network: Proactive caching ◮ Application: Computing resource allocation, content request prediction Osvaldo Simeone A Brief Intro to ML + Comm 51 / 126

  32. At the Edge: PHY Channel detection and decoding – classification [Cammerer et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 52 / 126

  33. At the Edge: PHY Channel detection and decoding – classification [Farsad and Goldsmith '18] Osvaldo Simeone A Brief Intro to ML + Comm 53 / 126

  34. At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for optical links – regression [ Wang et al ‘16 ] Osvaldo Simeone A Brief Intro to ML + Comm 54 / 126

  35. At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for satellite links with non-linear ampliers – regression [ Bouchired et al ’98] Osvaldo Simeone A Brief Intro to ML + Comm 55 / 126

  36. At the Edge: PHY Channel decoding for modulation schemes with complex optimal decoders, e.g., continuous phase modulation – classification [De Veciana and Zakhor '92] Osvaldo Simeone A Brief Intro to ML + Comm 56 / 126

  37. At the Edge: PHY Channel decoding – classification Leverage domain knowledge to set up the parametrized model to be learned [ Nachmani et al ‘16] Osvaldo Simeone A Brief Intro to ML + Comm 57 / 126

  38. At the Edge: PHY Modulation recognition – classification [Agirman-Tosun et al '11] Osvaldo Simeone A Brief Intro to ML + Comm 58 / 126

  39. At the Edge: PHY Localization – regression (coordinates) [Fang and Lin ‘08] Osvaldo Simeone A Brief Intro to ML + Comm 59 / 126

  40. At the Edge: PHY Precoding and power allocation – regression [Sun et al ’17] Osvaldo Simeone A Brief Intro to ML + Comm 60 / 126

  41. At the Edge: PHY Interference cancellation – regression [Balatsoukas- Stimming ‘17 ] Osvaldo Simeone A Brief Intro to ML + Comm 61 / 126

  42. At the Edge: MAC/ Link Spectrum sensing – classification [Tumuluru et al '10] Osvaldo Simeone A Brief Intro to ML + Comm 62 / 126

  43. At the Edge: MAC/ Link Mmwave channel quality prediction using depth images – regression [ Okamoto et al '18] Osvaldo Simeone A Brief Intro to ML + Comm 63 / 126

  44. At the Edge: Network and Application Content prediction for proactive caching – classification [Chen et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 64 / 126

  45. At the Cloud: Overview At the cloud: ◮ Network: Routing (classification vs look-up tables), SDN flow table updating, proactive caching, congestion control ◮ Application: Cloud/ fog computing, Internet traffic classification Osvaldo Simeone A Brief Intro to ML + Comm 65 / 126

  46. At the Cloud: Network Link prediction for wireless routing – classification/ regression [Wang et al 06] Osvaldo Simeone A Brief Intro to ML + Comm 66 / 126

  47. At the Cloud: Network Link prediction for optical routing – classification/ regression [Musumeci et al ’18] Osvaldo Simeone A Brief Intro to ML + Comm 67 / 126

  48. At the Cloud: Network Congestion prediction for smart routing – classification [Tang et al ‘ 17] Osvaldo Simeone A Brief Intro to ML + Comm 68 / 126

  49. At the Cloud: Network and Application Traffic classification – classification [Nguyen et al '08] Osvaldo Simeone A Brief Intro to ML + Comm 69 / 126

  50. Overview Supervised Learning Unsupervised Learning Reinforcement Learning Osvaldo Simeone A Brief Intro to ML + Comm 70 / 126

  51. Unsupervised Learning Unsupervised learning tasks operate over unlabelled data sets. General goal: discover properties of the data, e.g., for compressed representation “Some of us see unsupervised learning as the key towards machines with common sense.” (Y. LeCun) Osvaldo Simeone A Brief Intro to ML + Comm 71 / 126

  52. Unsupervised Learning Unsupervised learning tasks operate over unlabelled data sets. General goal: discover properties of the data, e.g., for compressed representation “Some of us see unsupervised learning as the key towards machines with common sense.” (Y. LeCun) Osvaldo Simeone A Brief Intro to ML + Comm 71 / 126

  53. “Defining” Unsupervised Learning Training set D : x n ∼ i.i.d. p ( x ) , n = 1 ,..., N Goal: Learn some useful properties of the distribution p ( x ) Alternative viewpoints to frequentist framework: Bayesian and MDL Osvaldo Simeone A Brief Intro to ML + Comm 72 / 126

  54. “Defining” Unsupervised Learning Training set D : x n ∼ i.i.d. p ( x ) , n = 1 ,..., N Goal: Learn some useful properties of the distribution p ( x ) Alternative viewpoints to frequentist framework: Bayesian and MDL Osvaldo Simeone A Brief Intro to ML + Comm 72 / 126

  55. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  56. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  57. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  58. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  59. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 74 / 126

  60. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 75 / 126

  61. Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 76 / 126

  62. Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 77 / 126

  63. (a) Directed Generative Models Model data x as being caused by z: p ( x | θ ) = ∑ p ( z | θ ) p ( x | z , θ ) z Osvaldo Simeone A Brief Intro to ML + Comm 78 / 126

  64. (a) Directed Generative Models Ex.: Document clustering ◮ x is a document, and z is (interpreted as) topic ◮ p ( z | θ ) = distribution of topics ◮ p ( x | z , θ ) = distribution of words in document given topic Basic representatives: ◮ Mixture of Gaussians ◮ Likelihood-free models Osvaldo Simeone A Brief Intro to ML + Comm 79 / 126

  65. (a) Directed Generative Models Ex.: Document clustering ◮ x is a document, and z is (interpreted as) topic ◮ p ( z | θ ) = distribution of topics ◮ p ( x | z , θ ) = distribution of words in document given topic Basic representatives: ◮ Mixture of Gaussians ◮ Likelihood-free models Osvaldo Simeone A Brief Intro to ML + Comm 79 / 126

  66. (d) Autoencoders Model encoding from data to hidden variables, as well as decoding from hidden variables back to data: p ( z | x , θ ) and p ( x | z , θ ) , Osvaldo Simeone A Brief Intro to ML + Comm 80 / 126

  67. (d) Autoencoders Ex.: Compression ◮ x is an image and z is (interpreted as) a compressed (e.g., sparse) representation ◮ p ( z | x , θ ) = compression of image to representation ◮ p ( x | z , θ ) = decompression of representation into an image Basic representative: Principal Component Analysis (PCA), dictionary learning, neural network-based autoencoders Osvaldo Simeone A Brief Intro to ML + Comm 81 / 126

  68. (d) Autoencoders Ex.: Compression ◮ x is an image and z is (interpreted as) a compressed (e.g., sparse) representation ◮ p ( z | x , θ ) = compression of image to representation ◮ p ( x | z , θ ) = decompression of representation into an image Basic representative: Principal Component Analysis (PCA), dictionary learning, neural network-based autoencoders Osvaldo Simeone A Brief Intro to ML + Comm 81 / 126

  69. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 82 / 126

Recommend


More recommend