Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization ( E in ≈ E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI 4100/6100
recap: The Two Questions of Learning 1. Can we make sure that E out ( g ) is close enough to E in ( g )? 2. Can we make E in ( g ) small enough? out-of-sample error The Hoeffding generalization bound : model complexity � Error 2 N ln 2 |H| 1 E out ( g ) ≤ E in ( g ) + δ in-sample error � �� � generalization error bar |H| |H| ∗ E in : training (eg. the practice exam) There is a tradeoff when picking |H| . E out : testing (eg. the real exam) M Training Versus Testing : 2 /18 � A c L Creator: Malik Magdon-Ismail Goal of generalization theory − →
What Will The Theory of Generalization Achieve? out-of-sample error model complexity � Error 2 N ln 2 |H| 1 E out ( g ) ≤ E in ( g ) + δ in-sample error |H| |H| ∗ ↓ out-of-sample error model complexity � Error N ln 4 m H 8 E out ( g ) ≤ E in ( g ) + δ in-sample error model complexity The new bound will be applicable to infinite H . M Training Versus Testing : 3 /18 � A c L Creator: Malik Magdon-Ismail |H| is overkill − →
Why is |H| an Overkill How did |H| come in? B ad events B 2 B 1 B g = {| E out ( g ) − E in ( g ) | > ǫ } B m = {| E out ( h m ) − E in ( h m ) | > ǫ } We do not know which g , so use a worst case union bound. |H| � P [ B g ] ≤ P [any B m ] ≤ P [ B m ] . B 3 m =1 • B m are events (sets of outcomes); they can overlap. • If the B m overlap, the union bound is loose. • If many h m are similar, the B m overlap. • There are “effectively” fewer than |H| hypotheses,. • We can replace |H| by something smaller. |H| fails to account for similarity between hypotheses. M Training Versus Testing : 4 /18 � A c L Creator: Malik Magdon-Ismail Measuring diversity on N points − →
Measuring the Diversity (Size) of H We need a way to measure the diversity of H . A simple idea: Fix any set of N data points. If H is diverse it should be able to implement all functions . . . on these N points. M Training Versus Testing : 5 /18 � A c L Creator: Malik Magdon-Ismail Example: large H − →
A Data Set Reveals the True Colors of an H H M Training Versus Testing : 6 /18 � A c L Creator: Malik Magdon-Ismail . . . through the eyes of D − →
A Data Set Reveals the True Colors of an H H H through the eyes of the D M Training Versus Testing : 7 /18 � A c L Creator: Malik Magdon-Ismail Just one dichotomy − →
A Data Set Reveals the True Colors of an H From the point of view of D , the entire H is just one dichotomy . M Training Versus Testing : 8 /18 � A c L Creator: Malik Magdon-Ismail An effective number of hypotheses − →
An Effective Number of Hypotheses If H is diverse it should be able to implement many dichotomys. |H| only captures the maximum possible diversity of H . Consider an h ∈ H , and a data set x 1 , . . . , x N . h gives us an N -tuple of ± 1’s: ( h ( x 1 ) , . . . , h ( x N )). A dichotomy of the inputs. If H is diverse, we get many different dichotomies. dichotomy If H contains similar functions, we only get a few dichotomies. The growth function quantifies this. M Training Versus Testing : 9 /18 � A c L Creator: Malik Magdon-Ismail Growth function − →
The Growth Function m H ( N ) Define the the restriction of H to the inputs x 1 , x 2 , . . . , x N : H ( x 1 , . . . , x N ) = { ( h ( x 1 ) , . . . , h ( x N )) | h ∈ H} (set of dichotomies induced by H ) The Growth Function m H ( N ) The largest set of dichotomies induced by H : m H ( N ) = max x 1 ,..., x N |H ( x 1 , . . . , x N ) | . m H ( N ) ≤ 2 N . Can we replace |H| by m H , an effective number of hypotheses? � � � • Replacing |H| with 2 N is no help in the bound. (why?) 2 N ln 2 |H| 1 the error bar is δ • We want m H ( N ) ≤ poly(N) to get a useful error bar. M Training Versus Testing : 10 /18 � A c L Creator: Malik Magdon-Ismail Example: 2-d perceptron − →
Example: 2-D Perceptron Model Cannot implement Can implement all 8 Can implement at most 14 m H (3) = 8 = 2 3 . m H (4) = 14 < 2 4 . What is m H (5)? M Training Versus Testing : 11 /18 � A c L Creator: Malik Magdon-Ismail Example: 1-d positive ray − →
Example: 1-D Positive Ray Model + w 0 · · · x 1 x 2 · · · x N • h ( x ) = sign ( x − w 0 ) • Consider N points. • There are N + 1 dichotomies depending on where you put w 0 . • m H ( N ) = N + 1. M Training Versus Testing : 12 /18 � A c L Creator: Malik Magdon-Ismail Example: 2-d positive rectangle − →
Example: Positive Rectangles in 2-D N = 4 N = 5 x 2 x 2 x 1 x 3 x 1 x 3 x 4 x 4 x 4 H implements all dichotomies some point will be inside a rectangle defined by others m H (4) = 2 4 m H (5) < 2 5 We have not computed m H (5) – not impossible, but tricky. M Training Versus Testing : 13 /18 � A c L Creator: Malik Magdon-Ismail The growth functions summarized − →
Example Growth Functions N 1 2 3 4 5 · · · 2-D perceptron 2 4 8 14 · · · 1-D pos. ray 2 3 4 5 · · · < 2 5 · · · 2-D pos. rectangles 2 4 8 16 • m H ( N ) drops below 2 N – there is hope for the generalization bound. • A break point is any n for which m H ( n ) < 2 n . M Training Versus Testing : 14 /18 � A c L Creator: Malik Magdon-Ismail Combinatorial puzzle: dichotomys on 3 points − →
A Combinatorial Puzzle x 1 x 2 x 3 ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ • • A set of dichotomys M Training Versus Testing : 15 /18 � A c L Creator: Malik Magdon-Ismail Two points shattered − →
A Combinatorial Puzzle x 1 x 2 x 3 ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ ◦ • • Two points are shattered M Training Versus Testing : 16 /18 � A c L Creator: Malik Magdon-Ismail Another set of dichotomys − →
A Combinatorial Puzzle x 1 x 2 x 3 ◦ ◦ ◦ ◦ ◦ • ◦ • ◦ • ◦ ◦ No pair of points is shattered M Training Versus Testing : 17 /18 � A c L Creator: Malik Magdon-Ismail What about N = 4? − →
A Combinatorial Puzzle x 1 x 2 x 3 x 1 x 2 x 3 x 4 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ • . . ◦ • ◦ . • ◦ ◦ 4 dichotomies is max. If N = 4 how many possible dichotomys with no 2 points shattered? M Training Versus Testing : 18 /18 � A c L Creator: Malik Magdon-Ismail
Recommend
More recommend