Machine Learning Outline Model Based Metaheuristics DM812 METAHEURISTICS Lecture 10 1. Machine Learning Machine Learning and Estimation of Distribution Algorithms 2. Model Based Metaheuristics Estimation of Distribution Algorithms Marco Chiarandini Department of Mathematics and Computer Science University of Southern Denmark, Odense, Denmark <marco@imada.sdu.dk> Machine Learning Machine Learning Outline Machine Learning Model Based Metaheuristics Model Based Metaheuristics Machine Learning: algorithms to make computers learn from experience and modify their activities. Sub-field of artificial intelligence and, hence, of computer science 1. Machine Learning But related also with statistics (statistical learning, learning from data) computational complexity, information theory, biology, cognitive sciences, philosophy 2. Model Based Metaheuristics Estimation of Distribution Algorithms Components of a machine learning model: environment, learning element, performance element Statistical learning methods: least squares, maximum likelihood, decision trees, Bayesian networks, artificial neural networks, evolutionary learning, reinforcement learning
Machine Learning Machine Learning Learning Tasks and Domains Model Based Metaheuristics Model Based Metaheuristics Supervised learning: predict the value of an outcome measure based on a number of a number of input measures Classification: recognition of classes Probability space P , input variables x ∈ R p and output variable y ∈ R Regression, Interpolation and Density Estimation: functional description Learning machine implementing a function F ( x , θ ) , θ ∈ W Learning sequence of actions: robots, chess � ( F ( x , θ ) − y ) 2 ] Find F ( x , θ ) such that, for example, min E Data mining: learning patterns and regularities from data Unsupervised learning: find association patterns among a set of input without any output being supplied Two phases : training and testing (or learning and generalizing) Reinforcement Learning: learn an input-output mapping through Issues : which algorithm for which task? how many training samples continued interaction with the environment rather than learning from a teacher. Learning Algorithms Machine Learning Machine Learning Information Theory Model Based Metaheuristics Model Based Metaheuristics Overview Information theory, founded by Shannon, is a branch of probability theory and statistics whose aim is quantifying information The information entropy H measures the information contained in Decision trees random variables: Inductive Logic Programming X discrete random variable X , is the set of all messages x that X could be, Bayesian Learning p ( x ) = Pr( X = x ) probability of X given x Reinforcement Learning I ( x ) self-information, the entropy contribution of an individual message Neural Networks bit based on binary logarithm is the unit of information Evolutionary Learning � H ( X ) = E X [ I ( x )] = − p ( X ) log p ( X ) . x ∈ X The entropy is maximized when all the messages in the message space are equiprobable p ( x ) = 1 /n in which case H ( X ) = log n
Machine Learning Machine Learning Outline EDA Model Based Metaheuristics Model Based Metaheuristics � H ( X ) = E X [ I ( x )] = − p ( X ) log p ( X ) . x ∈ X 1. Machine Learning Example : Random variable with two outcomes H b ( p ) = − p log p − (1 − p ) log(1 − p ) 2. Model Based Metaheuristics Estimation of Distribution Algorithms A fair coin flip (2 equally likely outcomes) will have less entropy than a roll of a die (6 equally likely outcomes). Machine Learning Machine Learning Model Based Metaheuristics Stochastic Gradient Method EDA EDA Model Based Metaheuristics Model Based Metaheuristics {P ( s, θ ) | θ ∈ Θ } family of probability functions defined on s ∈ S Θ ⊂ R m m-dimensional parameter space P continuous and differentiable Key idea Solutions generated using a parameterized probabilistic model The original problem may be replaced with the following continuous one updated using previously seen solutions. argmin θ ∈ Θ E θ [ f ( s )] 1. Candidate solutions are constructed using some parameterized Gradient Method: probabilistic model, that is, a parameterized probability distribution Step 1: start from some initial guess θ 0 over the solution space. Step 2: at stage t , calculate the gradient ∇ E θ t [ f ( s )] and update θ t +1 to be θ t + α t ∇ E θ t [ f ( s )] where α t is a step-size 2. The candidate solutions are used to modify the model in a way that parameter. is deemed to bias future sampling towards low cost solutions. Computing ∇ E θ t [ f ( s )] requires summation over all the search space. Stochastic Gradient Method: � Step 2: θ t +1 = θ t + α t f ( s ) ∇ ln P ( s, θ t ) s ∈ S t
Machine Learning Machine Learning Estimation of Distribution EDA EDA Model Based Metaheuristics Model Based Metaheuristics Estimation of Distribution Algorithms (EDAs) Key idea : estimate a probability distribution over the search space and then use it to sample new solutions Estimation of Distribution Algorithm (EDA): Attempt to learn the structure of a problem and thus avoid the problem generate an initial population sp of breaking good building blocks of EAs while termination criterion is not satisfied: do select sc from sp EDAs originate from evolutionary computation by replacing estimate the probability distribution p i ( x i ) of solution component i recombination and mutation with from the highest quality solutions of sc generate a new sp by sampling according to p i ( x i ) Construct a parameterized probabilistic model from selected promising solutions New solutions are generated according to the constructed model Needed: A probabilistic model An update rule for the model’s parameter and/or structure Machine Learning Machine Learning Probabilistic Models EDA Probabilistic Models (cont.) EDA Model Based Metaheuristics Model Based Metaheuristics No Interaction Simplest form of EDA: univariate marginal distribution algorithm [Mühlenbein and Paaß, 1996] . Assumes variables are independent. Pairwise Interaction Works asymptotically the same as uniform crossover. Chain distribution of neighboring variables conditional probabilities constructed using sample frequencies Probabilities are chosen as weighted frequencies over the population solutions replaced by a vector of all pairwise probabilities Solutions are replaced by their probability vector Chains (= ordering of variables) found by a greedy algorithm In binary representations, all probabilities are initialized to 0.5 Dependency trees A mutation operator can be applied to the probability Forest Classical selection procedures Incremental learning with binary strings: p t +1 ,i ( x i ) = (1 − ρ ) p t,i ( x i ) + ρr t,i ( x i ) with x i ∈ S best
Machine Learning Machine Learning Probabilistic Models (cont.) EDA Bayesian Optimization EDA Model Based Metaheuristics Model Based Metaheuristics [Pelikan and Goldberg (GECCO-1999)] The Bayesian Optimization Algorithm (BOA) Multivariate Independent clusters based on minimum description length Step 1: set t = 0 randomly generate initial population P (0) found by a greedy algorithm that merges groups with least increase in the metric Step 2: select a set of promising strings S ( t ) from P ( t ) Bayesian optimization algorithm: learns a Bayesian network that Step 3: construct the network B using a chosen metric and models good solutions constraints metric to select among models (Bayesian-Dirichlet looks only at accuracy) Step 4: generate a set of new strings O ( t ) according to the joint greedy algorithm to search the network distribution encoded by B Increased computational time spent by learning the models Step 5: create a new population P ( t + 1) by replacing some Prior information can be included strings from P ( t ) with O ( t ) set t = t + 1 Should work for decomposable problems Step 6: if the termination criteria are not met, go to (2) Machine Learning Machine Learning Bayesian Networks EDA Bayesian Networks (cont.) EDA Model Based Metaheuristics Model Based Metaheuristics Example of measures of the quality of networks (related to the concept of model selection in statistics) BN encodes relationships between the variables contained in the modeled data, representing the structure of the problem. Minimum Description Length: (implements sort of Occam’s razor) “length of the shortest two-part code for a string s that uses less Each node corresponds to one variable X i ( i th position in the string) than α bits. It consists of the number of bits needed to encode the model m p that defines a distribution and the negative log likelihood Directed edges represent relationships between variables of s under this distribution. “ Restriction to acyclic graphs with max in-degree k � β s ( α ) = min − log p ( s | m p ) + K ( m p ) : p ( s | m p ) > 0 , K ( m p ) ≤ α ] m p K ( s ) Kolomogorov complexity: shortest program that can output s X = ( X 1 , . . . , X n ) vectors of variables on the Universal Turing Machine Π X i set of parents of X i in the network Bayesian Dirichlet metric: combines the prior knowledge about the n � problem and the statistical data from a given data set. p ( X ) = p ( X i | Π X i ) i =1 Many others: eg, Akaike Information Criterion, ...
Recommend
More recommend