In practice OPTIMIZATION ERROR TRAINING VALIDATION TEST training set: use to train the classifier validation set: use to monitor performance in real time - check for overfitting test set: use to train the classifier
In practice OPTIMIZATION ERROR TRAINING VALIDATION TEST NO CHEATING! NEVER USE TRAINING TO VALIDATE YOUR ALGORITHM!
The algorithm used to minimize is called OPTIMIZATION THERE ARE SEVERAL OPTIMIZATION TECHNIQUES
Optimization THERE ARE SEVERAL OPTIMIZATION TECHNIQUES THEY DEPEND ON THE MACHINE LEARNING ALGORITHM
Optimization THERE ARE SEVERAL OPTIMIZATION TECHNIQUES THEY DEPEND ON THE MACHINE LEARNING ALGORITHM NEURAL NETWORKS USE THE GRADIENT DESCENT AS WE WILL SEE LATER W t +1 = W t � λ h 5 f ( W t ) learning rate weights to be learned epoch
f W ( ~ x ) The differences are in the function that is used ARTIFICAL RANDOM FORESTS NEURAL NETWORKS (DEEP LEARNING) CARTS decision trees SUPPORT VECTOR MACHINES kernel algorithms
HOW TO CHOOSE YOUR CLASSICAL CLASSIFIER? NO RULE OF THUMB - REALLY DEPENDS ON APPLICATION ++ — Python ML METHOD Easy to interpret (“White CARTS / Over-complex trees sklearn.ensemble.RandomFo box”) Unstable restClassifier RANDOM Litte data preparation Biased tress if some classes sklearn.ensemble.RandomFo Both numerical + dominate restRegressor FOREST categorical Easy to interpret + Fast not very well suited to sklearn.svm SVM Kernel trick allows no linear multi-class problems sklearn.svc problems sklearn.neural_network.MP seed of deep-learning more difficult to interpret L_CLassifier very efficient with large NN computing intensive sklearn.neural_network.MP amount of data as we will L_Regressor see
CAN DEPEND ON YOUR MAIN INTEREST credit
ALSO INFLUENCED BY “MAINSTREAM” TRENDS Source
PART II: A FOCUS ON “SHALLOW” NEURAL NETWORKS
THE NEURON INSPIRED BY NEURO - SCIENCE? Credit: Karpathy
THE NEURON INSPIRED BY NEURO - SCIENCE? Credit: Karpathy
Mark I Perceptron FIRST IMPLEMENTATION OF NEURAL NETWORK [Rosenblatt, 1957! ] INTENDED TO BE A MACHINE (NOT AN ALGORITHM) it had an array of 400 photocells, randomly connected to the "neurons". Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors
TODAY’S ARTIFICIAL NEURON x ) = ~ x ) = g ( ~ z ( ~ x + b W. ~ f ( ~ x + b ) W. ~ Pre-Activation Weights Bias Activation Function Output Input
LAYER OF NEURONS x + ~ f ( ~ x ) = g ( W . ~ b ) SAME IDEA. NOW W becomes a matrix and b a vector
Hidden Layers of Neurons FIRST LAYER z h ( x ) = W h x + b h INPUT
ACTIVATION FUNCTION HIDDEN LAYER h ( x ) = g ( z h ( x )) = g ( W h x + b h )
OUTPUT LAYER z 0 ( x ) = W 0 h ( x ) + b 0
PREDICTION LAYER f ( x ) = softmax ( z 0 )
LABEL f W ( ~ x ) = ~ “CLASSICAL” y MACHINE LEARNING Q , SF REPLACE THIS BY A GENERAL NON LINEAR FUNCTION WITH SOME PARAMETERS W NETWORK p = g 3 ( W 3 g 2 ( W 2 g 1 ( W 1 ~ x 0 ))) FUNCTION
WHY HIDDEN LAYERS? More complex functions allow increasing complexity Credit: Karpathy
SO LET’S GO DEEPER AND DEEPER!
SO LET’S GO DEEPER AND DEEPER! YES BUT… NOT SO STRAIGHTFORWARD, DEEPER MEANS MORE WEIGHTS, MORE DIFFICULT OPTIMIZATION, RISK OF OVERFITTING…
LET’S FIRST EXAMINE IN MORE DETAIL HOW SIMPLE “SHALLOW” NETWORKS WORK
ACTIVATION FUNCTIONS? Function ADD NON LINEARITIES TO THE PROCESS
ACTIVATION FUNCTIONS Function
ACTIVATION FUNCTIONS 1 Sigmoid: f ( x ) = T anh : f ( x ) = tanh ( x ) 1 + e − x ReLu : f ( x ) = max (0 , x ) Soft ReLu : f ( x ) = log (1 + e x ) Leaky ReLu : f ( x ) = ✏ x + (1 − ✏ ) max (0 , x )
ACTIVATION FUNCTIONS + MANY OTHERS! 1 Sigmoid : f ( x ) = T anh : f ( x ) = tanh ( x ) 1 + e − x ReLu : f ( x ) = max (0 , x ) Soft ReLu : f ( x ) = log (1 + e x ) Leaky ReLu : f ( x ) = ✏ x + (1 − ✏ ) max (0 , x )
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
WHAT IS THE MEANING OF THE ACTIVATION FUNCTION? Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions
Recommend
More recommend