Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice
Definitions • x – feature vector • c – number of classes • L – number of classifiers • { ω 1 , ω 2 , …., ω c } – Set of class labels • { ω 1 , ω 2 , …., ω c } – Set of class labels • {D 1 , D 2 , …., D L } – Set of classifiers ▫ All c outputs from D i are in interval [0,1] • DP(x) – Decision Profile matrix d ( x ) ... d ( x ) ... d ( x ) 1 , 1 1 , j 1 , c = DP ( x ) d ( x ) ... d ( x ) ... d ( x ) . i , 1 i , j i , c d ( x ) ... d ( x ) ... d ( x ) L , 1 L , j L , C
Approaches • Class Conscience � Use one column of DP(x) at a time ▫ Ex) Simple/Weighted Averages • Class Indifferent – Treat DP(x) as a whole • Class Indifferent – Treat DP(x) as a whole new feature space, Use new classifier to make final decision. d ( x ) ... d ( x ) ... d ( x ) 1 , 1 1 , j 1 , c = DP ( x ) d ( x ) ... d ( x ) ... d ( x ) . i , 1 i , j i , c d ( x ) ... d ( x ) ... d ( x ) L , 1 L , j L , C
Discriminant to Continuous • Non6continuous classifiers produce label • {g 1 (x), g 2 (x), … g c (x)} – output of D ▫ Would like to normalize to [0,1] interval • {g’ 1 (x), g’ 2 (x), … g’ c (x)}, where c ∑ = ' ( ) 1 g j x = j 1 • Softmax Method exp{ g ( x )} j = g ' ( x ) Normalizes to [0,1] j c ∑ • Better if g’(x) would exp{ g ( x )} k be a probability = 1 k
Converting Linear Discriminant • Assuming normal densities = ω ω g ( x ) log{ P ( ) p ( x | )} j j j • Let C be the constant additive terms we drop A = exp{ C } ω ω = P ( ) p ( x | ) A exp{ g ( x )} j j j • Plug into Bayes Rule, and it simplifies to the softmax function × A exp{ g ( x )} exp{ g ( x )} j j ( ω = = P | x ) j c c ∑ ∑ × exp{ ( )} exp{ ( )} A g x g x k k = = k 1 k 1
Neural Networks • Consider a NN, with c outputs, {y 1 , …, y c } • When trained using squared error rate, the outputs can be used for an approximation of posterior probability. posterior probability. • Normalize them to [0,1] interval using softmax function. • Normalization function independent of Neural network training, only occurs on outputs.
Laplace Estimator for Decision Tree • In Decision Trees, you use entropy to split the distribution based on a single feature per level • Normally, you continue to split until there is a single class in each leaf of the tree single class in each leaf of the tree • In Probability Estimating Trees , instead of splitting until a single class is in a leaf, split until around K points are in each leaf, and use various methods to calculate the probability of each class at each leaf.
Count based probability, Laplace • {k 1 , k 2 , …, k c } – Number of sample points of class {w 1 , w 2 , …., w c } respectively in leaf • K = k 1 + k 2 + …+ k c • Maximum Likelihood (ML) estimate of • Maximum Likelihood (ML) estimate of k ˆ j ω = = P ( | x ) , j 1 ,..., c j K • When K is too small, estimates are unpredictable
Laplace Estimator • Laplace Correction + k 1 ˆ ω j = P ( | x ) j + K c • m6estimation: • m6estimation: ˆ + × k m P ( w ) ˆ ω j j = P ( | x ) j + K m • best to set m so ˆ × ω ≈ ( ) 10 m P j
Ting and William Laplace estimator • Ting and William ▫ ω * is majority class ∑ ∑ k l + + k 1 1 l ≠ l j * − = 1 ( ) if w w j + K 2 ˆ ω = ( | ) P x k [ ] j ˆ j * − ω × 1 ( ) P otherwise ∑ k l ≠ l j
Weighted Distance Laplace Estimate • Take the average distance from x to all samples of class w j , over the average distance to all samples 1 ∑ ( j ) ( , ) d x x ( j ) ∈ x w ˆ ω = j P ( | x ) j k 1 ∑ ( i ) d ( x , x ) = 1 i
Example
Class Conscious Combiners • Non6trainable Combiners ▫ No extra parameters, all defined up front ▫ Function of classifier output for specific class µ = ( x ) F [ d ( x ), d ( x ),... d ( x )] j 1 , j 2 , j L , j • Simple mean L 1 ∑ µ = ( x ) d ( x ) , j i j L = i 1
Popular Class Conscious Combiners • Minimum/Maximum/Median µ = ( x ) { d ( x )} max j i , j i • Trimmed Mean: ▫ L degrees of support sorted, X percent of values ▫ L degrees of support sorted, X percent of values are dropped. Mean taken of remaining. • Product L ∏ µ = ( x ) d ( x ) j i , j = i 1
Generalized Mean Function α 1 / L 1 ∑ α µ α = ( x , ) d ( x ) j i , j L = 1 i • Generalized Mean is defined as above except for the following special cases. the following special cases. ▫ a → �∞ , Minimum, 1 / L ▫ a = 61, Harmonic Mean L = ∏ µ ( x ) d ( x ) ▫ a = 0, Geometric mean j i , j = i 1 ▫ a = 1, Simple Arithmetic Mean ▫ a → ∞, Maximum • a is chosen before hand, level of optimism
Class Conscious Combiner Example
Example: Effect of Optimism α • 100 training / test sets ▫ Training set (a), 200 samples ▫ Testing set (b), 1000 ▫ Testing set (b), 1000 samples • For each ensemble ▫ 10 bootstrap samples (200 values) ▫ Train classifier on each (Parzen)
Example: Effect of Optimism α • Generalized mean ▫ 50 <= α <= +50, steps of 1 ▫ 61 <= α <= +1, steps of 0.1 • Simple mean combiner gives best result
Interpreting Results • Mean classifier isn’t always the best • Shape of the error curve depends upon ▫ Problem ▫ Base classifier used ▫ Base classifier used • Average and product are most intensely studied combiners ▫ For some problems, average may be… � Less accurate, but � More stable
Ordered Weight Averaging • Generalized, non6trainable • L coefficients (one for each classifier) • Order the results of ω j classifiers, descending j • Multiply by vector of coefficients b (weights) ▫ i 1 , …, i L is a permutation of the indices 1, …, L L ∑ ( ) ( ) µ = x b d x k i [ k ] j = k 1
Ordered Weight Averaging: Example • Consider a jury assessing sport performance (diving) [ ] ▫ Reduce subjective bias T = d . 6 . 7 . 2 . 6 . 6 j � Trimmed mean � Trimmed mean 1 1 1 1 1 1 = b 0 0 � Drop lowest, highest scores 3 3 3 � Average the remaining 1 1 1 [ ] T µ = = 0 0 . 7 . 6 . 6 . 6 . 2 0 . 6 j 3 3 3
Ordered Weight Averaging • General form of trimmed mean ▫ b = [ 0, 1/(L62), 1/(L62), …, 1/(L62), 0] T • Other operations may be modeled with careful selection of b selection of b ▫ Minimum: b = [0, 0, …, 1] T ▫ Maximum: b = [1, 0, …, 0] T ▫ Average: b = [ 1/L, 1/L, …, 1/L] T • Many resources spent on developing new aggregation connectives ▫ Bigger question: when to use which one?
Trainable Combiners • Combiners with additional parameters to be trained ▫ Weighted Average ▫ Fuzzy Integral ▫ Fuzzy Integral
Weighted Average • 3 groups, based on number of weights • L6weights ▫ One weight per classifier L ∑ ∑ ( ) ( ) µ = x w d x j j i i i i , , j j = i 1 ▫ Similar to equation we saw for ordered weight average, except we’re trying to optimize w i here (and we’re not reordering d i,j ) ▫ w i for classifier D i usually based on its estimated error rate
Weighted Average • c x L weights ▫ Weights are specific to each class L ∑ ∑ ( ) ( ) µ = x w d x j j ij ij i i , , j j = i 1 ▫ Only j th column is used in calc ▫ Linear regression commonly used to derive optimal weights ▫ “class conscious” combiner
Weighted Average • c x c x L weights ▫ Support for each class determined from entire decision profile DP(x) L c ∑∑ ∑∑ ( ) ( ) µ = x w d x j ikj i , k = = i 1 k 1 ▫ Different weight space for each class ω j ▫ Whole decision profile is intermediate feature space � “class indifferent” combiner
Weighted Average: Class Conscious L ∑ ( ) ( ) µ = x w d x j ij i , j = i 1 • d i,j (x) are point estimates of P(ω j | x) ▫ If estimates are unbiased, ▫ If estimates are unbiased, � Q(x) is nonbiased minimum variance estimate of P(ω j | x), conditional upon… � restriction of coefficients w i to sum to 1 L ∑ = w 1 i = i 1
Weighted Average: Class Conscious L ∑ ( ) ( ) µ = x w d x j ij i , j = i 1 • Weights derived to minimize variance of Q(x) • Weights derived to minimize variance of Q(x) ▫ Q(x) variance <= variance of any single classifier • We assume point estimates are unbiased ▫ Variance of d i,j (x) = expected squared error of d i,j (x) • When coefficients w i minimize variance ▫ Q(x) is better estimate of P(ω j | x) than any d i,j (x)
Recommend
More recommend