For Thursday • Read chapter 23, sections 1-3 • Homework: – Chapter 18, exercise 25, parts a and b only
Program 4 • Any questions?
PAC Learning • The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept. • In the PAC model, we specify two small parameters, ε and δ , and require that with probability at least (1 δ ) a system learn a concept with error at most ε .
Version Space • Bounds on generalizations of a set of examples
Consistent Learners • A learner L using a hypothesis H and training data D is said to be a consistent learner if it always outputs a hypothesis with zero error on D whenever H contains such a hypothesis. • By definition, a consistent learner must produce a hypothesis in the version space for H given D . • Therefore, to bound the number of examples needed by a consistent learner, we just need to bound the number of examples needed to ensure that the version-space contains no hypotheses with unacceptably high error.
ε -Exhausted Version Space • The version space, VS H , D , is said to be ε -exhausted iff every hypothesis in it has true error less than or equal to ε. • In other words, there are enough training examples to guarantee than any consistent hypothesis has error at most ε. • One can never be sure that the version-space is ε -exhausted, but one can bound the probability that it is not. • Theorem 7.1 (Haussler, 1988): If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples for some target concept c , then for any 0 ε 1, the probability that the version space VS H , D is not ε - exhausted is less than or equal to: | H | e – ε m
Sample Complexity Analysis • Let δ be an upper bound on the probability of not exhausting the version space. So: m ( consist ( , )) P H D H e bad m e H ln( ) m H ln / (flip inequality ) m H H ln / m 1 ln ln / m H
Sample Complexity Result • Therefore, any consistent learner, given at least: 1 ln ln / H examples will produce a result that is PAC. • Just need to determine the size of a hypothesis space to instantiate this result for learning specific classes of concepts. • This gives a sufficient number of examples for PAC learning, but not a necessary number. Several approximations like that used to bound the probability of a disjunction make this a gross over-estimate in practice.
Sample Complexity of Conjunction Learning • Consider conjunctions over n boolean features. There are 3 n of these since each feature can appear positively, appear negatively, or not appear in a given conjunction. Therefore |H|= 3 n, so a sufficient number of examples to learn a PAC concept is: 1 1 n ln ln 3 / ln ln 3 / n • Concrete examples: – δ=ε=0.05, n =10 gives 280 examples – δ=0.01, ε=0.05, n =10 gives 312 examples – δ=ε=0.01, n =10 gives 1,560 examples – δ=ε=0.01, n =50 gives 5,954 examples • Result holds for any consistent learner.
Sample Complexity of Learning Arbitrary Boolean Functions • Consider any boolean function over n boolean features such as the hypothesis space of DNF or decision trees. There are 2 2^ n of these, so a sufficient number of examples to learn a PAC concept is: 1 1 n 2 n ln ln 2 / ln 2 ln 2 / • Concrete examples: – δ=ε=0.05, n =10 gives 14,256 examples – δ=ε=0.05, n =20 gives 14,536,410 examples – δ=ε=0.05, n =50 gives 1.561 x10 16 examples
COLT Conclusions • The PAC framework provides a theoretical framework for analyzing the effectiveness of learning algorithms. • The sample complexity for any consistent learner using some hypothesis space, H , can be determined from a measure of its expressiveness | H | or VC( H ), quantifying bias and relating it to generalization. • If sample complexity is tractable, then the computational complexity of finding a consistent hypothesis in H governs its PAC learnability. • Constant factors are more important in sample complexity than in computational complexity, since our ability to gather data is generally not growing exponentially. • Experimental results suggest that theoretical sample complexity bounds over-estimate the number of training instances needed in practice since they are worst-case upper bounds.
COLT Conclusions (cont) • Additional results produced for analyzing: – Learning with queries – Learning with noisy data – Average case sample complexity given assumptions about the data distribution. – Learning finite automata – Learning neural networks • Analyzing practical algorithms that use a preference bias is difficult. • Some effective practical algorithms motivated by theoretical results: – Winnow – Boosting – Support Vector Machines (SVM)
Beyond a Single Learner • Ensembles of learners work better than individual learning algorithms • Several possible ensemble approaches: – Ensembles created by using different learning methods and voting – Bagging – Boosting
Bagging • Random selections of examples to learn the various members of the ensemble. • Seems to work fairly well, but no real guarantees.
Boosting • Most used ensemble method • Based on the concept of a weighted training set. • Works especially well with weak learners. • Start with all weights at 1. • Learn a hypothesis from the weights. • Increase the weights of all misclassified examples and decrease the weights of all correctly classified examples. • Learn a new hypothesis. • Repeat
Why Neural Networks?
Why Neural Networks? • Analogy to biological systems, the best examples we have of robust learning systems. • Models of biological systems allowing us to understand how they learn and adapt. • Massive parallelism that allows for computational efficiency. • Graceful degradation due to distributed represent- ations that spread knowledge representation over large numbers of computational units. • Intelligent behavior is an emergent property from large numbers of simple units rather than resulting from explicit symbolically encoded rules.
Neural Speed Constraints • Neuron “switching time” is on the order of milliseconds compared to nanoseconds for current transistors. • A factor of a million difference in speed. • However, biological systems can perform significant cognitive tasks (vision, language understanding) in seconds or tenths of seconds.
What That Means • Therefore, there is only time for about a hundred serial steps needed to perform such tasks. • Even with limited abilties, current AI systems require orders of magnitude more serial steps. • Human brain has approximately 10 11 neurons each connected on average to 10 4 others, therefore must exploit massive parallelism.
Real Neurons • Cells forming the basis of neural tissue – Cell body – Dendrites – Axon – Syntaptic terminals • The electrical potential across the cell membrane exhibits spikes called action potentials. • Originating in the cell body, this spike travels down the axon and causes chemical neuro- transmitters to be released at syntaptic terminals. • This chemical difuses across the synapse into dendrites of neighboring cells.
Real Neurons (cont) • Synapses can be excitory or inhibitory. • Size of synaptic terminal influences strength of connection. • Cells “add up” the incoming chemical messages from all neighboring cells and if the net positive influence exceeds a threshold, they “fire” and emit an action potential.
Model Neuron (Linear Threshold Unit) • Neuron modelled by a unit ( j ) connected by weights, w ji , to other units ( i ): • Net input to a unit is defined as: net j = S w ji * o i • Output of a unit is a threshold function on the net input: – 1 if net j > T j – 0 otherwise
Neural Computation • McCollough and Pitts (1943) show how linear threshold units can be used to compute logical functions. • Can build basic logic gates – AND: Let all w ji be (T j /n)+ where n = number of inputs – OR: Let all w ji be T j + – NOT: Let one input be a constant 1 with weight T j +e and the input to be inverted have weight -T j
Neural Computation (cont) • Can build arbitrary logic circuits, finite-state machines, and computers given these basis gates. • Given negated inputs, two layers of linear threshold units can specify any boolean function using a two-layer AND-OR network.
Learning • Hebb (1949) suggested if two units are both active (firing) then the weight between them should increase: w ji = w ji + o j o i – is a constant called the learning rate – Supported by physiological evidence
Recommend
More recommend