Data Farming Getting the Most Out of Moore’s Law and Cluster Computing
Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation
Simulation in DoD • DoD uses complex high-dimensional simulation models as an important tool in its decision-making processes for diverse areas such as: logistics, humanitarian aid, peace support operations, anti- piracy & anti-terrorist efforts, future force planning, and combat modeling • Many simulations involve dozens, hundreds, or thousands of “factors” that can be set to different levels
Abstracting Simulation O I u n Simulation t p p u Model u t t s s • A computer simulation transforms inputs to outputs • Pareto Principle - a small subset of the inputs dominate in determining the outputs
Design of Experiments “The idea behind [simulation]…is to [replace] theory by experiment whenever the former falters—Hammersley and Handscomb
Design of Experiments “The idea behind [simulation]…is to [replace] theory by experiment whenever the former falters—Hammersley and Handscomb But simulation experiments are different...
Design of Experiments “The idea behind [simulation]…is to [replace] theory by experiment whenever the former falters—Hammersley and Handscomb But simulation experiments are different... Typical assumptions for physical experiments – Small/ moderate # of factors – Univariate response – Homogeneous error – Linear – Sparse effects – Higher order interactions negligible – Normal errors
Design of Experiments “The idea behind [simulation]…is to [replace] theory by experiment whenever the former falters—Hammersley and Handscomb But simulation experiments are different... Typical assumptions for Characteristics of typical physical experiments simulation models – Large # of factors – Small/ moderate # of factors – Many output measures of interest – Univariate response – Heterogeneous error – Homogeneous error – Non-linear – Linear – Many significant effects – Sparse effects – Significant higher order interactions – Higher order interactions negligible – Varied error structure – Normal errors
Design of Experiments “The idea behind [simulation]…is to [replace] theory by experiment whenever the former falters—Hammersley and Handscomb But simulation experiments are different... Typical assumptions for Characteristics of typical physical experiments simulation models – Large # of factors – Small/ moderate # of factors – Many output measures of interest – Univariate response – Heterogeneous error – Homogeneous error – Non-linear – Linear – Many significant effects – Sparse effects – Significant higher order interactions – Higher order interactions negligible – Varied error structure – Normal errors
Why Do We Need DOE? Without a good plan for changing multiple factors simultaneously: • We limit the insights possible (can’t “untangle” effects) • Haphazardly choosing scenarios can use up a lot of time without yielding answers to the fundamental questions
Why Do We Need DOE? Without a good plan for changing multiple factors simultaneously: • We limit the insights possible (can’t “untangle” effects) • Haphazardly choosing scenarios can use up a lot of time without yielding answers to the fundamental questions A Simple Example: Capture the Flag
Why Do We Need DOE? Without a good plan for changing multiple factors simultaneously: • We limit the insights possible (can’t “untangle” effects) • Haphazardly choosing scenarios can use up a lot of time without yielding answers to the fundamental questions A Simple Example: Capture the Flag Speed Stealth Success? Low Low No Stealth High High Yes Speed
Why Do We Need DOE? Without a good plan for changing multiple factors simultaneously: • We limit the insights possible (can’t “untangle” effects) • Haphazardly choosing scenarios can use up a lot of time without yielding answers to the fundamental questions A Simple Example: Capture the Flag Speed Stealth Success? Low Low No Which is more important, Stealth High High Yes stealth or speed? Speed
Why Do We Need DOE? Without a good plan for changing multiple factors simultaneously: • We limit the insights possible (can’t “untangle” effects) • Haphazardly choosing scenarios can use up a lot of time without yielding answers to the fundamental questions A Simple Example: Capture the Flag Speed Stealth Success? Low Low No Which is more important, Stealth High High Yes stealth or speed? Speed No way to tell! The factors are “confounded”
One-at-a-Time Variation?
One-at-a-Time Variation? Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed
One-at-a-Time Variation? Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed If we vary Speed and Stealth separately, we (incorrectly) conclude neither contributes to success!
One-at-a-Time Variation? No! Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed
One-at-a-Time Variation? No! Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed
One-at-a-Time Variation? No! Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed By varying Speed and Stealth together rather than separately, we see there is an “interaction”
One-at-a-Time Variation? No! Speed Stealth Success? Low Low No Stealth High Low No Low High No Speed By varying Speed and Stealth together rather than separately, we see there is an “interaction” This is a “factorial” or “gridded” design
Finer Grids • Which output would you prefer to see? Stealth Stealth Speed Speed • The fly in the ointment - Studying two factors at this level of detail requires 11x11=121 experiments. Three factors would take 11x11x11=1331 experiments.
Finer Grids • Which output would you prefer to see? Stealth Stealth Speed Speed • The fly in the ointment - Studying two factors at this level of detail requires 11x11=121 experiments. Three factors would take 11x11x11=1331 experiments. Factorial Designs grow exponentially with the number of factors!
How Bad is That? • Consider a model with 100 factors • Study each factor at only two levels This would require 2 100 experiments 2 100 ≈ 10 30 , i.e., a “one” followed by thirty zeros!
How Bad is That? • Consider a model with 100 factors • Study each factor at only two levels This would require 2 100 experiments 2 100 ≈ 10 30 , i.e., a “one” followed by thirty zeros! If we could perform one billion experiments per second and started running experiments at the big bang, we would have completed less than (1/2500) th of the total number of experiments!!!!
Can Moore’s Law Save us? • Moore’s Law is not a law - it is an observation that computing power has maintained an exponential growth rate • In recent years, this has produced “petaflop” computers
Can Moore’s Law Save us? • Moore’s Law is not a law - it is an observation that computing power has maintained an exponential growth rate • In recent years, this has produced “petaflop” computers Petaflop = 1000 trillion ops/second Cost of “Roadrunner”= $133 million
Can Moore’s Law Save us? • Moore’s Law is not a law - it is an observation that computing power has maintained an exponential growth rate • In recent years, this has produced “petaflop” computers Petaflop = 1000 trillion ops/second Cost of “Roadrunner”= $133 million • Using the Roadrunner supercomputer would reduce the time required for our experiment to a mere 40 million years • This is better, but still not good enough to be of practical use
We Need New Types of Designs Efficient R5 FF and CCD
We Need New Types of Designs Efficient R5 FF and CCD Factorial (gridded) designs are most familiar
We Need New Types of Designs Efficient R5 FF and CCD
We Need New Types of Designs We have focused on Latin hypercubes Efficient R5 FF and CCD
We Need New Types of Designs and sequential Efficient R5 FF approaches and CCD
We Need New Types of Designs Efficient R5 FF and CCD
Nearly Orthogonal Latin Hypercubes -1. 0. 0. 1. -1. 0. 0. 1. -1. 0. 0. 1. 0 0 5 0 0 0 5 0 0 0 5 0 1.0 0.0 A -1.0 1.0 0.0 B -1.0 1.0 0.0 C -1.0 1.0 0.0 D -1.0 1.0 0.0 E -1.0 1.0 0.0 F -1.0 1.0 0.0 G -1.0 -1. 0. 0. 1. -1. 0. 0. 1. -1. 0. 0. 1. -1. 0. 0. 1. 0 0 5 0 0 0 5 0 0 0 5 0 0 0 5 0
Recommend
More recommend