Time-series-based Ensemble Modeling for Bio-Medical Applications Maciej Ogorzałek 1 , 2 in collaboration with: Christian Merkwirth, Grzegorz Surowka, Leszek Nowak, Katarzyna Grzesiak-Kopec 1 Joerg Wichard 3 1 Department of Information Technologies, Jagiellonian University, Kraków 2 Chair of Bio-signals and Systems, Hong Kong Polytechnic University (under DSS) 3 FMP Berlin, Germany M. Ogorzałek – p. 1/1
Learning a Dependency from Data x µ , y µ ) with µ = 1 , . . . , N Given: A sample of input-output-pairs ( � A functional dependence y ( � x ) (maybe corrupted by noise) Choosing a model (function) ˆ f out of hypothesis space H Aim: close to true dependency f as possible f : R D �→ { 0 , 1 , 2 , ... } Classification discrete classes f : R D �→ R Regression continuous output Implementation usually via solution of an appropriate optimization problem: • Matrix inversion in case of linear regression • Minimization of a loss function on the training data • Quadratic programming problem for SVMs M. Ogorzałek – p. 2/1
Validation and Model Selection • Generalization error: How does the model perform on unseen data (samples) ? • Exact generalization error is not accessible since we have only limited number of observations ! • Training on small data set tends to overfit, causing generalization error to be significantly higher than training error • Consequence of mismatch between the capacity of the hypothesis space H (VC (Vapnik-Cervonenkis)-Dimension) and the number of training observations • Validation: Estimating the generalization error using just the given data set – Needed for choosing optimal model structure or learning parameters (step sizes etc.) • Model Selection: Selecting the model with lowest (estimated) generalization error • But estimation of generalization error is very unreliable on small data sets M. Ogorzałek – p. 3/1
Improving Generalization for Single Models • Remedies: – Manipulating training algorithm (e.g. early stopping) – Regularization by adding a penalty to the loss function – Using algorithms with built-in capacity control (e.g. SVM) – Rely on criteria like BIC (Bayesian Information Criteria), AIC (Akaike), GCV (Generalized Cross-Validation ) or Cross Validation to select optimal model complextiy – Reformulate the loss function : • ǫ -insensitive loss • Huber loss • SVM loss for classification M. Ogorzałek – p. 4/1
Question • Are there any other methods to improve generalization error ? M. Ogorzałek – p. 5/1
Question • Are there any other methods to improve generalization error ? • Yes, by combining several individual models! M. Ogorzałek – p. 5/1
Ensemble Methods Ensemble: Averaging the output of several separately trained models • Simple average ¯ � K x ) = 1 f ( � k =1 f k ( � x ) K • Weighted average ¯ f ( � x ) = � k w k f k ( � x ) with � k w k = 1 M. Ogorzałek – p. 6/1
Ensemble Methods Ensemble: Averaging the output of several separately trained models • Simple average ¯ � K x ) = 1 f ( � k =1 f k ( � x ) K • Weighted average ¯ f ( � x ) = � k w k f k ( � x ) with � k w k = 1 M. Ogorzałek – p. 6/1
Ensemble Methods Ensemble: Averaging the output of several separately trained models • Simple average ¯ � K x ) = 1 f ( � k =1 f k ( � x ) K • Weighted average ¯ f ( � x ) = � k w k f k ( � x ) with � k w k = 1 M. Ogorzałek – p. 6/1
Ensemble Methods Error decomposition: Ensemble: Averaging the output of several separately trained models x ) − ¯ x )) 2 e ( � x ) = ( y ( � f ( � • Simple average K 1 ¯ � K x ) = 1 f ( � k =1 f k ( � x ) � x )) 2 ¯ ǫ ( � x ) = ( y ( � x ) − f k ( � K K k =1 • Weighted average K 1 x ) − ¯ � x )) 2 ¯ a ( � ¯ x ) = ( f k ( � f ( � f ( � x ) = � k w k f k ( � x ) with � k w k = 1 K k =1 Interpretation: e ( � x ) = ǫ ( � ¯ x ) − ¯ a ( � x ) • The ensemble generalization error is always smaller than the expected error of the individual models Integrating over input space: • An ensemble should consist of well trained but diverse models E = ¯ E − ¯ A • An ensemble often outperforms the best constituting model M. Ogorzałek – p. 6/1
Decorrelating Models E = ¯ E − ¯ A How can we obtain models that have low gen- eralization error (small ¯ E ), but are mutually un- correlated (large ¯ A )? • Varying model structure (e.g. topology) • Exploiting the disadvantage of getting stuck in local minima: – Varying initial conditions – Varying parameters of the training procedure – Using ǫ -insensitive loss function • Train a large population of models • Applying resampling or sequencing tech- niques: M. Ogorzałek – p. 7/1
Decorrelating Models • Resampling: Generating new data sets E = ¯ E − ¯ by omitting or duplicating samples of the A original data set. These techniques can How can we obtain models that have low gen- be used to estimate generalization errors eralization error (small ¯ E ), but are mutually un- and for model construction correlated (large ¯ A )? Bootstraping Generate bootstrap • Varying model structure (e.g. topology) replicates by randomly drawing • Exploiting the disadvantage of getting samples from training set stuck in local minima: Cross-Validation Divide data set – Varying initial conditions repeatedly in training and test part – Varying parameters of the training Bumping Construct models on bootstrap procedure replicates and choose best model on – Using ǫ -insensitive loss function full data set Bagging Bootstrap aggregation, create • Train a large population of models several models on bootstrap • Applying resampling or sequencing tech- replicates and average these niques: Boosting Create sequence of models where training of next model depends on output of previous model M. Ogorzałek – p. 7/1
Crosstraining – Constructing Ensembles • Finesse: Efficiently reuse samples by combining training, validation and selection of models • Additional benefit of reduced correlation between models • Repeatedly partition data set randomly into two sample classes – Training set, used for training and stopping criteria – Test set, used only for accessing generalization error after model has been trained M. Ogorzałek – p. 8/1
Crosstraining – Constructing Ensembles • Finesse: Efficiently reuse samples • Train population of (heterogenous) by combining training, validation models, select best ones and selection of models according to error on test set • Additional benefit of reduced • Repartition data set, taking care correlation between models that test sets are mutually disjunct • Repeatedly partition data set • Combine best models of all randomly into two sample classes partitionings to ensemble – Training set, used for training • Optionally weight models accord- and stopping criteria ing to the estimated generalization – Test set, used only for error on the total data set accessing generalization error after model has been trained M. Ogorzałek – p. 8/1
Pros and Cons of Ensembles Ensemble Methods • Advantages – Straightforward extension of existing modeling algorithms – Almost fool-proof minimization of generalization error – Makes no assumptions on the structure of the underlying models – Simplifies the problem of model selection • Disadvantages – Increased computational effort – Interpretation of ensemble is even harder than drawing conclusions from a single model M. Ogorzałek – p. 9/1
Pros and Cons of Ensembles Ensemble Methods Combining Heterogenous Models • Advantages • Advantages – Straightforward extension of – Often one model type existing modeling algorithms performs superior on the given data set – Almost fool-proof minimization of generalization error – Probability of using an unsuited model type – Makes no assumptions on the decreases structure of the underlying models – Inherent decorrelation even without manipulating data set – Simplifies the problem of or training parameters model selection • Disadvantages • Disadvantages – Accessing the generalization – Increased computational effort performance of heterogenous – Interpretation of ensemble is models is even more difficult even harder than drawing than for models of same type conclusions from a single model M. Ogorzałek – p. 9/1
The ENTOOL Toolbox for Statistical Learning • The ENTOOL toolbox for statistical learning is designed to make state-of-the-art machine learning algorithms available under a common interface • Allows construction of single models or ensembles of (heterogenous) models • Supports decorrelation of models by offering resampling techniques • Though primarily designed for re- gression, it is possible to construct ensembles of classifiers with EN- TOOL M. Ogorzałek – p. 10/1
The ENTOOL Toolbox for Statistical Learning • Requirements: • The ENTOOL toolbox for statistical learning is designed to – Matlab (TM) make state-of-the-art machine • Operating systems: learning algorithms available – Windows under a common interface – Linux • Allows construction of single – Solaris (limited) models or ensembles of (heterogenous) models • Supports decorrelation of models by offering resampling techniques • Though primarily designed for re- gression, it is possible to construct ensembles of classifiers with EN- TOOL M. Ogorzałek – p. 10/1
Recommend
More recommend