Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004
Nearest Neighbor Example II • Credit Rating: Name L R G/P – Classifier: Good / A 0 1.2 G Poor B 25 0.4 P – Features: C 5 0.7 G • L = # late payments/yr; D 20 0.8 P • R = Income/Expenses E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P
Nearest Neighbor Example II Name L R G/P A 0 1.2 G 1 A F B 25 0.4 P G R E C 5 0.7 G H D C D 20 0.8 P E 30 0.85 P B F 11 1.2 G G 7 1.15 G 10 20 30 L H 15 0.8 P
Nearest Neighbor Example II Name L R G/P I 6 1.15 G 1 A F K J 22 0.45 P I G ?? E K 15 1.2 D H R C J B Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2)) 10 20 30 L - Scaled distance
Nearest Neighbor: Issues • Prediction can be expensive if many features • Affected by classification, feature noise – One entry can change prediction • Definition of distance metric – How to combine different features • Different types, ranges of values • Sensitive to feature selection
Efficient Implementations • Classification cost: – Find nearest neighbor: O(n) • Compute distance between unknown and all instances • Compare distances – Problematic for large data sets • Alternative: – Use binary search to reduce to O(log n)
Efficient Implementation: K-D Trees • Divide instances into sets based on features – Binary branching: E.g. > value – 2^d leaves with d split path = n • d= O(log n) – To split cases into sets, • If there is one element in the set, stop • Otherwise pick a feature to split on – Find average position of two middle objects on that dimension » Split remaining objects based on average position » Recursively split subsets
K-D Trees: Classification R > 0.825? Yes No L > 17.5? L > 9 ? No Yes Yes No R > 0.6? R > 0.75? R > 1.175 ? R > 1.025 ? No Yes No Yes No No Yes Yes Poor Good Good Poor Good Good Poor Good
Efficient Implementation: Parallel Hardware • Classification cost: – # distance computations • Const time if O(n) processors – Cost of finding closest • Compute pairwise minimum, successively • O(log n) time
Nearest Neighbor: Analysis • Issue: – What features should we use? • E.g. Credit rating: Many possible features – Tax bracket, debt burden, retirement savings, etc.. – Nearest neighbor uses ALL – Irrelevant feature(s) could mislead • Fundamental problem with nearest neighbor
Nearest Neighbor: Advantages • Fast training: – Just record feature vector - output value set • Can model wide variety of functions – Complex decision boundaries – Weak inductive bias • Very generally applicable
Summary: Nearest Neighbor • Nearest neighbor: – Training: record input vectors + output value – Prediction: closest training instance to new data • Efficient implementations • Pros: fast training, very general, little bias • Cons: distance metric (scaling), sensitivity to noise & extraneous features
Learning: Perceptrons Artificial Intelligence CSPP 56553 February 4, 2004
Agenda • Neural Networks: – Biological analogy • Perceptrons: Single layer networks • Perceptron training • Perceptron convergence theorem • Perceptron limitations • Conclusions
Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “ fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses
Artificial Neural Nets • Simulated Neuron: – Node connected to other nodes via links • Links = axon+synapse+link • Links associated with weight (like synapse) – Multiplied by output of node – Node combines input via activation function • E.g. sum of weighted inputs passed thru threshold • Simpler than real neuronal processes
Artificial Neural Net w x w Sum Threshold x + w x
Perceptrons • Single neuron-like element – Binary inputs – Binary outputs • Weighted sum of inputs > threshold
Perceptron Structure y w 0 w n w 1 w 3 w 2 x 0 =1 x 1 x 2 x 3 x n . . . n 1 if 0 > w i x ∑ i = y i = 0 0 otherwise x 0 w 0 compensates for threshold
Perceptron Example • Logical-OR: Linearly separable – 00: 0; 01: 1; 10: 1; 11: 1 x 2 x 2 + + + + 0 0 + + x 1 x 1 or or
Perceptron Convergence Procedure • Straight-forward training procedure – Learns linearly separable functions • Until perceptron yields correct output for all – If the perceptron is correct, do nothing – If the percepton is wrong, • If it incorrectly says “yes”, – Subtract input vector from weight vector • Otherwise, add input vector to weight vector
Perceptron Convergence Example • LOGICAL-OR: • Sample x0 x1 x2 Desired Output • 1 1 0 0 0 • 2 1 0 1 1 • 3 1 1 0 1 • 4 1 1 1 1 • Initial: w=(000);After S2, w=w+s2=(101) • Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111) • Pass3: S1:w=w-s1=(011)
Perceptron Convergence Theorem n • If there exists a vector W s.t. 1 if 0 > w i x ∑ i = y • Perceptron training will find it i = 0 0 otherwise • Assume ⋅ x > δ v r r for all +ive examples x ... , = + + + ⋅ > δ w x x x v w k r r r r r r 1 2 k • ||w||^ 2 increases by at most ||x||^2, in each iteration • || w+x| |^2 <= || w|| ^2+||x||^2 <= k ||x||^2 k / δ x k • v.w/||w|| > <= 1 ( ) 2 1 δ / Converges in k <= O steps
Perceptron Learning • Perceptrons learn linear decision boundaries x 2 x 2 0 • E.g. 0 0 + 0 0 But not + + 0 0 + 0 + + + 0 + x 1 x 1 xor X1 X2 -1 -1 w1x1 + w2x2 < 0 1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false -1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example • Digit recognition – Assume display= 8 lightable bars – Inputs – on/off + threshold – 65 steps to recognize “8”
Perceptron Summary • Motivated by neuron activation • Simple training procedure • Guaranteed to converge – IF linearly separable
Neural Nets • Multi-layer perceptrons – Inputs: real-valued – Intermediate “hidden” nodes – Output(s): one (or more) discrete-valued X1 Y1 X2 X3 Y2 X4 Inputs Hidden Hidden Outputs
Neural Nets • Pro: More general than perceptrons – Not restricted to linear discriminants – Multiple outputs: one classification each • Con: No simple, guaranteed training procedure – Use greedy, hill-climbing procedure to train – “Gradient descent”, “Backpropagation”
Solving the XOR Problem o 1 w 11 Network w 13 x 1 Topology: w 01 w 21 y 2 hidden nodes -1 w 23 w 12 w 03 1 output w 22 x 2 -1 w 02 o 2 Desired behavior: -1 x1 x2 o1 o2 y Weights: 0 0 0 0 0 w11= w12=1 1 0 0 1 1 w21=w22 = 1 0 1 0 1 1 w01=3/2; w02=1/2; w03=1/2 1 1 1 1 0 w13=-1; w23=1
Neural Net Applications • Speech recognition • Handwriting recognition • NETtalk: Letter-to-sound rules • ALVINN: Autonomous driving
ALVINN • Driving as a neural network • Inputs: – Image pixel intensities • I.e. lane lines • 5 Hidden nodes • Outputs: – Steering actions • E.g. turn left/right; how far • Training: – Observe human behavior: sample images, steering
Backpropagation • Greedy, Hill-climbing procedure – Weights are parameters to change – Original hill-climb changes one parameter/step • Slow – If smooth function, change all parameters/step • Gradient descent – Backpropagation: Computes current output, works backward to correct error
Producing a Smooth Function • Key problem: – Pure step threshold is discontinuous • Not differentiable • Solution: – Sigmoid (squashed ‘s’ function): Logistic fn 1 n ( ) = = z w x s z ∑ i i 1 + − z e i
Neural Net Training • Goal: – Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • Compute actual output: o • Compare to desired output: d • Determine effect of each weight w on error = d-o • Adjust weights
Recommend
More recommend