Defining a Learning Problem } Suppose we have three basic components: Set of tasks, T 1. A performance measure, P 2. Data describing some experience, E 3. A computer program learns if its performance at tasks in Class #02: Types of Learning; T , as measured by P , improves based on E . Information Theory From: Tom M. Mitchell, Machine Learning (1997) Machine Learning (COMP 135): M. Allen, 09 Sept. 19 2 Monday, 9 Sep. 2019 Machine Learning (COMP 135) An Example Problem The Expert Systems Approach } One (older) approach used } Suppose we want to build a system, like Siri or Alexa, that expert-generated rules : responds to voice commands Find someone with advanced 1. } What are our components? Task: knowledge of linguistics Tasks, T 1. Take system actions, Get them to devise the 2. Performance measure, P structural rules of language’s based upon speech 2. grammar and semantics Experience, E 3. Performance: Encode those rules in program 3. for parsing written language How often correct action Build another program to is taken during testing 4. translate speech into written language, and tie that to Experience? another program for taking actions based upon the parsing This is the tricky part! 3 4 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) 1
Another Approach: Supervised Learning Another Approach: Supervised Learning } In supervised learning, we: } Collect a large set of } For each, map it to a sample things a set of test correct outcome action Provide a set of correct answers to a problem 1. users say to our system the system should take Use algorithms to find (mostly) correct answers to 2. similar problems call(555-123-4567) “Call my wife” } We can still use experts, but their job is different: alarm_set(04:00) “Set an alarm for 4:00 AM” } Don’t need to devise complex rules for understanding speech “Play Pod Save America ” podcast_play(“Pod } Instead , they just have to be able to tell what the correct Save America”) … results of understanding look like … } A large set of such ( speech, action ) pairs can be created } This can then form the experience, E , the system needs 5 6 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) Inductive Learning Decisions to Make } When collecting our training example pairs, ( x , f ( x )) , we } In its simplest form, induction is the task of learning a still have some decisions to make function on some inputs from examples of its outputs } Example : Medical Informatics } For a function, f , that we want to learn, each of these } We have some genetic information about patients training examples is a pair } Some get sick with a disease and some don’t ( x , f ( x )) } Patients live for a number of years (sick or not) } We assume that we do not yet know the actual form of the } Question : what do we want to learn from this data? function f (if we did, we don’t need to learn) } Depending upon what we decide, we may use: } Different models of the data } Learning problem : find a hypothesis function, h , such that } Different machine learning approaches h ( x ) = f ( x ) (at least most of the time), based on a } Different measurements of successful learning training set of example input-output pairs 7 8 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) 2
One Approach: Regression Another Approach: Classification } We decide that we want } We decide instead that we to try to learn to predict simply want to decide how long patients will live whether a patient will get the disease or not } We base this upon information about the } We base this upon degree to which they information about express a specific gene expression of two genes } A regression problem: the } A classification problem: function we learn is the learned function separates “best (linear) fit” to the individuals into 2 groups data we have (binary classes) Image source: https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/ Image source: https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/ 9 10 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) Which is the Correct Approach? Uncertainty and Learning } Often, when learning, we deal with uncertainty: } Incomplete data sets, with missing information } Noisy data sets, with unreliable information } Stochasticity: causes and effects related non-deterministically } And many more… } Probability theory gives us mathematics for such cases } A precise mathematical theory of chance and causality } The approach we use depends upon what we want to achieve, and what works best based upon the data we have } Much machine learning involves investigating different approaches 11 12 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) 3
Basic Elements of Probability Properties of Probability } Suppose we have some event, e : some fact about the world } Every event must either occur , or not occur : that may be true or false P ( e ∨ ¬ e ) = 1 } We write P ( e ) for the probability that e occurs: P ( e ) = 1 − p ( ¬ e ) 0 ≤ P ( e ) ≤ 1 } Furthermore, suppose that we have a set of all possible events, each with its own probability: } We can understand this value as: E = { e 1 , e 2 , . . . , e k } P ( e ) = 1: e will certainly happen 1 . P ( e ) = 0: e will certainly not happen 2 . P = { p 1 , p 2 , . . . , p k } P ( e ) = k , 0 < k < 1: over an arbitrarily long stretch of 3 . } This set of probabilities is called a probability distribution, and time, we will observe the fraction it must have the following property: Event e occurs Total # of events = k X p i = 1 i 13 14 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) Probability Distributions Information Theory } A uniform distribution is one in which every event occurs with } Claude Shannon created equal probability, which means that we have: information theory in his 1948 ∀ i, p i = 1 paper, “A mathematical theory P = { p 1 , p 2 , . . . , p k } ∧ k of communication” } Such distributions are common in games of chance, e.g. where } A theory of the amount of we have a fair coin-toss: information that can be carried E = { Heads, Tails } by communication channels P 1 = { 0 . 5 , 0 . 5 } } Has implications in networks, encryption, compression, and } Not every distribution is uniform, and we might have a coin many other areas that comes up tails more often than heads (or even always !) } Also the source of the term P 2 = { 0 . 25 , 0 . 75 } “bit” (credited to John Tukey) Photo source: Konrad Jacobs P 3 = { 0 . 0 , 1 . 0 } (https://opc.mfo.de/detail?photo_id=3807) 15 16 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) 4
Information Carried by Events Amount of Information } Information is relative to our uncertainty about an event } From N. Abramson (1963): If an event e i occurs with probability p i , the amount of information carried is: } If we do not know whether an event has happened or not, then learning that fact is a gain in information 1 I ( e i ) = log 2 } If we already know this fact, then there is no information p i gained when we see the outcome } (The base of the logarithm doesn’t really matter, but if we } Thus, if we have a fixed coin that always comes up tails, use base-2, we are measuring information in bits) actually flipping it tells us nothing we don’t already know } Thus, if we flip a fair coin, and it comes up tails, we have } Flipping a fair coin does tell us something, on the other gained information equal to: hand, since we can’t predict the outcome ahead of time 1 1 I ( Tails ) = log 2 P ( Tails ) = log 2 0 . 5 = log 2 2 = 1 . 0 17 18 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) Biased Data Carries Less Information Entropy: Total Average Information } While flipping a fair coin yields 1.0 bit of information, } Shannon defined the entropy of a probability distribution flipping one that is biased gives us less as the average amount of information carried by events: P = { p 1 , p 2 , . . . , p k } } If we have a somewhat biased coin, then we get: 1 E = { Heads, Tails } X X H ( P ) = p i log 2 = − p i log 2 p i p i P 2 = { 0 . 25 , 0 . 75 } i i 1 1 } This can be thought of in a variety of ways, including: I ( Tails ) = log 2 P ( Tails ) = log 2 0 . 75 = log 2 1 . 33 ≈ 0 . 415 } How much uncertainty we have about the average event } If we have a totally biased coin, then we get: } How much information we get when an average event occurs } How many bits on average are needed to communicate about P 3 = { 0 . 0 , 1 . 0 } the events (Shannon was interested in finding the most efficient 1 1 overall encodings to use in transmitting information) I ( Tails ) = log 2 P ( Tails ) = log 2 1 . 0 = log 2 1 . 0 = 0 . 0 19 20 Monday, 9 Sep. 2019 Machine Learning (COMP 135) Monday, 9 Sep. 2019 Machine Learning (COMP 135) 5
Recommend
More recommend