Deep Learning Hyperparameter Optimization with Competing Objectives GTC 2018 - S8136 Scott Clark scott@sigopt.com
OUTLINE 1. Why is Tuning Models Hard? 2. Common Tuning Methods 3. Deep Learning Example 4. Tuning Multiple Metrics 5. Multi-metric Optimization Examples
Deep Learning / AI is extremely powerful Tuning these systems is extremely non-intuitive
Photo: Joe Ross
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
Photo: Tammy Strobel
STANDARD METHODS FOR HYPERPARAMETER SEARCH
STANDARD TUNING METHODS Manual Search Parameter Configuration - Weights - Thresholds Training ML / AI - Window sizes Data Model - Transformations Grid Search Random Search ? Cross Testing Validation Data
OPTIMIZATION FEEDBACK LOOP New configurations Training ML / AI Data Model Better Results Objective Metric Cross Testing Validation Data REST API
DEEP LEARNING EXAMPLE
SIGOPT + MXNET ● Classify movie reviews using a CNN in MXNet https://aws.amazon.com/blogs/machine-learning/fast-cnn-tuning-with-aws-gpu-instances-and-sigopt/
TEXT CLASSIFICATION PIPELINE Hyperparameter Configurations and Feature ML / AI Transformations Training Model Text (MXNet) Better Results Accuracy Testing Validation Text REST API
STOCHASTIC GRADIENT DESCENT ● Comparison of several RMSProp SGD parametrizations
ARCHITECTURE PARAMETERS
MULTIPLICATIVE TUNING SPEED UP
SPEED UP #1: CPU -> GPU
SPEED UP #2: RANDOM/GRID -> SIGOPT
CONSISTENTLY BETTER AND FASTER
TUNING MULTIPLE METRICS What if we want to optimize multiple competing metrics? Complexity Tradeoffs ● Accuracy vs Training Time ○ Accuracy vs Inference Time ○ Business Metrics ● Fraud Accuracy vs Money Lost ○ Conversion Rate vs LTV ○ Engagement vs Profit ○ Profit vs Drawdown ○
PARETO OPTIMAL What does it mean to optimize two metrics simultaneously? Pareto efficiency or Pareto optimality is a state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least one individual or preference criterion worse off.
PARETO OPTIMAL What does it mean to optimize two metrics simultaneously? The red points are on the Pareto Efficient Frontier, they strictly dominate all of the grey points. You can do no better in one metric without sacrificing performance in the other. Point N is Pareto Optimal compared to Point K.
PARETO EFFICIENT FRONTIER Goal is to have best set of feasible solutions to select from After optimization the expert picks one or more of the red points from the Pareto Efficient Frontier to further study or put into production.
TOY EXAMPLE
MULTI-METRIC OPTIMIZATION
DEEP LEARNING EXAMPLES
MULTI-METRIC OPT IN DEEP LEARNING https://devblogs.nvidia.com/sigopt-deep-learning-hyperparameter-optimization/
DEEP LEARNING TRADEOFFS Deep Learning pipelines are time ● consuming and expensive to run Application and deployment ● conditions may make certain configurations less desirable Tuning for both accuracy and ● complexity metrics like training or inference time allows expert to make best decision for production
STOCHASTIC GRADIENT DESCENT ● Comparison of several RMSProp SGD parametrizations ● Different configurations converge differently
TEXT CLASSIFICATION PIPELINE Hyperparameter Configurations and Feature ML / AI Transformations Training Model Text (MXNet) Better Results Accuracy Testing Validation Training Time Text REST API
FINDING THE FRONTIER
SEQUENCE CLASSIFICATION PIPELINE Hyperparameter Configurations and Feature ML / AI Transformations Training Model Sequences (Tensorflow) Better Results Accuracy Testing Validation Inference Time Sequences REST API
TEXT CLASSIFICATION PIPELINE
FINDING THE FRONTIER
FINDING THE FRONTIER
LOAN CLASSIFICATION PIPELINE Hyperparameter Configurations and Feature ML / AI Transformations Training Model Data (LightGBM) Better Results AUCPR Testing Validation Avg $ Lost Data REST API
GRID SEARCH CAN MISLEAD Best grid search point (wrt ● accuracy) loses >$35 / transaction Best grid search point (wrt loss) ● has 70% accuracy Points of the Pareto Frontier give ● user more information about what is possible and more control of trade-offs
DISTRIBUTED TRAINING/SCHEDULING SigOpt serves as a distributed ● scheduler for training models across workers Workers access the SigOpt API ● for the latest parameters to try for each model Enables easy distributed ● training of non-distributed algorithms across any number of models
TAKEAWAYS One metric may not paint the whole picture - Think about metric trade-offs in your model pipelines - Optimizing for the wrong thing can be very expensive Not all optimization strategies are equal - Pick an optimization strategy that gives the most flexibility - Different tools enable you to tackle new problems
Questions? contact@sigopt.com https://sigopt.com @SigOpt
Recommend
More recommend