for thursday
play

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, - PowerPoint PPT Presentation

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, exercise 25, parts a and b only Program 4 Any questions? PAC Learning The only reasonable expectation of a learner is that with high probability it learns a close


  1. For Thursday • Read chapter 23, sections 1-3 • Homework: – Chapter 18, exercise 25, parts a and b only

  2. Program 4 • Any questions?

  3. PAC Learning • The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept. • In the PAC model, we specify two small parameters, ε and δ , and require that with probability at least (1  δ ) a system learn a concept with error at most ε .

  4. Version Space • Bounds on generalizations of a set of examples

  5. Consistent Learners • A learner L using a hypothesis H and training data D is said to be a consistent learner if it always outputs a hypothesis with zero error on D whenever H contains such a hypothesis. • By definition, a consistent learner must produce a hypothesis in the version space for H given D . • Therefore, to bound the number of examples needed by a consistent learner, we just need to bound the number of examples needed to ensure that the version-space contains no hypotheses with unacceptably high error.

  6. ε -Exhausted Version Space • The version space, VS H , D , is said to be ε -exhausted iff every hypothesis in it has true error less than or equal to ε. • In other words, there are enough training examples to guarantee than any consistent hypothesis has error at most ε. • One can never be sure that the version-space is ε -exhausted, but one can bound the probability that it is not. • Theorem 7.1 (Haussler, 1988): If the hypothesis space H is finite, and D is a sequence of m  1 independent random examples for some target concept c , then for any 0  ε  1, the probability that the version space VS H , D is not ε - exhausted is less than or equal to: | H | e – ε m

  7. Sample Complexity Analysis • Let δ be an upper bound on the probability of not exhausting the version space. So:      m ( consist ( , )) P H D H e bad     m e H     ln( ) m H        ln / (flip inequality ) m    H    H     ln / m        1      ln ln / m H   

  8. Sample Complexity Result • Therefore, any consistent learner, given at least:   1     ln ln / H    examples will produce a result that is PAC. • Just need to determine the size of a hypothesis space to instantiate this result for learning specific classes of concepts. • This gives a sufficient number of examples for PAC learning, but not a necessary number. Several approximations like that used to bound the probability of a disjunction make this a gross over-estimate in practice.

  9. Sample Complexity of Conjunction Learning • Consider conjunctions over n boolean features. There are 3 n of these since each feature can appear positively, appear negatively, or not appear in a given conjunction. Therefore |H|= 3 n, so a sufficient number of examples to learn a PAC concept is:     1 1          n ln ln 3 / ln ln 3 / n       • Concrete examples: – δ=ε=0.05, n =10 gives 280 examples – δ=0.01, ε=0.05, n =10 gives 312 examples – δ=ε=0.01, n =10 gives 1,560 examples – δ=ε=0.01, n =50 gives 5,954 examples • Result holds for any consistent learner.

  10. Sample Complexity of Learning Arbitrary Boolean Functions • Consider any boolean function over n boolean features such as the hypothesis space of DNF or decision trees. There are 2 2^ n of these, so a sufficient number of examples to learn a PAC concept is:     1 1  n      2    n ln ln 2 / ln 2 ln 2 /       • Concrete examples: – δ=ε=0.05, n =10 gives 14,256 examples – δ=ε=0.05, n =20 gives 14,536,410 examples – δ=ε=0.05, n =50 gives 1.561 x10 16 examples

  11. COLT Conclusions • The PAC framework provides a theoretical framework for analyzing the effectiveness of learning algorithms. • The sample complexity for any consistent learner using some hypothesis space, H , can be determined from a measure of its expressiveness | H | or VC( H ), quantifying bias and relating it to generalization. • If sample complexity is tractable, then the computational complexity of finding a consistent hypothesis in H governs its PAC learnability. • Constant factors are more important in sample complexity than in computational complexity, since our ability to gather data is generally not growing exponentially. • Experimental results suggest that theoretical sample complexity bounds over-estimate the number of training instances needed in practice since they are worst-case upper bounds.

  12. COLT Conclusions (cont) • Additional results produced for analyzing: – Learning with queries – Learning with noisy data – Average case sample complexity given assumptions about the data distribution. – Learning finite automata – Learning neural networks • Analyzing practical algorithms that use a preference bias is difficult. • Some effective practical algorithms motivated by theoretical results: – Winnow – Boosting – Support Vector Machines (SVM)

  13. Beyond a Single Learner • Ensembles of learners work better than individual learning algorithms • Several possible ensemble approaches: – Ensembles created by using different learning methods and voting – Bagging – Boosting

  14. Bagging • Random selections of examples to learn the various members of the ensemble. • Seems to work fairly well, but no real guarantees.

  15. Boosting • Most used ensemble method • Based on the concept of a weighted training set. • Works especially well with weak learners. • Start with all weights at 1. • Learn a hypothesis from the weights. • Increase the weights of all misclassified examples and decrease the weights of all correctly classified examples. • Learn a new hypothesis. • Repeat

  16. Why Neural Networks?

  17. Why Neural Networks? • Analogy to biological systems, the best examples we have of robust learning systems. • Models of biological systems allowing us to understand how they learn and adapt. • Massive parallelism that allows for computational efficiency. • Graceful degradation due to distributed represent- ations that spread knowledge representation over large numbers of computational units. • Intelligent behavior is an emergent property from large numbers of simple units rather than resulting from explicit symbolically encoded rules.

  18. Neural Speed Constraints • Neuron “switching time” is on the order of milliseconds compared to nanoseconds for current transistors. • A factor of a million difference in speed. • However, biological systems can perform significant cognitive tasks (vision, language understanding) in seconds or tenths of seconds.

  19. What That Means • Therefore, there is only time for about a hundred serial steps needed to perform such tasks. • Even with limited abilties, current AI systems require orders of magnitude more serial steps. • Human brain has approximately 10 11 neurons each connected on average to 10 4 others, therefore must exploit massive parallelism.

  20. Real Neurons • Cells forming the basis of neural tissue – Cell body – Dendrites – Axon – Syntaptic terminals • The electrical potential across the cell membrane exhibits spikes called action potentials. • Originating in the cell body, this spike travels down the axon and causes chemical neuro- transmitters to be released at syntaptic terminals. • This chemical difuses across the synapse into dendrites of neighboring cells.

  21. Real Neurons (cont) • Synapses can be excitory or inhibitory. • Size of synaptic terminal influences strength of connection. • Cells “add up” the incoming chemical messages from all neighboring cells and if the net positive influence exceeds a threshold, they “fire” and emit an action potential.

  22. Model Neuron (Linear Threshold Unit) • Neuron modelled by a unit ( j ) connected by weights, w ji , to other units ( i ): • Net input to a unit is defined as: net j = S w ji * o i • Output of a unit is a threshold function on the net input: – 1 if net j > T j – 0 otherwise

  23. Neural Computation • McCollough and Pitts (1943) show how linear threshold units can be used to compute logical functions. • Can build basic logic gates – AND: Let all w ji be (T j /n)+  where n = number of inputs – OR: Let all w ji be T j +  – NOT: Let one input be a constant 1 with weight T j +e and the input to be inverted have weight -T j

  24. Neural Computation (cont) • Can build arbitrary logic circuits, finite-state machines, and computers given these basis gates. • Given negated inputs, two layers of linear threshold units can specify any boolean function using a two-layer AND-OR network.

  25. Learning • Hebb (1949) suggested if two units are both active (firing) then the weight between them should increase: w ji = w ji +  o j o i –  is a constant called the learning rate – Supported by physiological evidence

Recommend


More recommend