non uniform stochastic average gradient for training
play

Non-Uniform Stochastic Average Gradient for Training Conditional - PowerPoint PPT Presentation

Non-Uniform Stochastic Average Gradient for Training Conditional Random Fields Mark Schmidt, Reza Babanezhad, Mohamed Ahmed Ann Clifton, Anoop Sarkar University of British Columbia, Simon Fraser University NIPS Optimization Workshop, 2014


  1. Non-Uniform Stochastic Average Gradient for Training Conditional Random Fields Mark Schmidt, Reza Babanezhad, Mohamed Ahmed Ann Clifton, Anoop Sarkar University of British Columbia, Simon Fraser University NIPS Optimization Workshop, 2014

  2. Motivation: Structured Prediction Classical supervised learning:

  3. Motivation: Structured Prediction Classical supervised learning: Structured prediction:

  4. Motivation: Structured Prediction Classical supervised learning: Structured prediction: Other structure prediction tasks: Labelling all people/places in Wikiepdia, finding coding regions in DNA sequences, labelling all voxels in an MRI as normal or tumor, predicting protein structure from sequence, weather forecasting, translating from French to English,etc.

  5. Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T �

  6. Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k .

  7. Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k . Multinomial logistic regression to predict each letter: exp ( w T y j F ( x j )) p ( y j | x j , w ) = j F ( x j )) . � j exp ( w T y ′ y ′ This works if you are really good at predicting individual letters.

  8. Motivation: Structured Prediction Naive approaches to predicting letters y given images x : Multinomial logistic regression to predict word: exp ( w T y F ( x )) p ( y | x , w ) = y ′ F ( x )) . y ′ exp ( w T � This requires parameter vector w k for all possible words k . Multinomial logistic regression to predict each letter: exp ( w T y j F ( x j )) p ( y j | x j , w ) = j F ( x j )) . � j exp ( w T y ′ y ′ This works if you are really good at predicting individual letters. But this ignores dependencies between letters.

  9. Motivation: Structured Prediction What letter is this?

  10. Motivation: Structured Prediction What letter is this? What are these letters?

  11. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters.

  12. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter.

  13. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’).

  14. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending).

  15. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end).

  16. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end). F ( y ∈ D , x ) : is y in dictionary D ?

  17. Conditional Random Fields Conditional random fields model targets y given inputs x using exp ( w T F ( y , x )) y ′ exp ( w T F ( y , x )) = exp ( w T F ( y , x )) p ( y | x , w ) = . � Z where w are the parameters. Examples of features F ( y , x ) : F ( y j , x ) : these features lead to a logistic model for each letter. F ( y j − 1 , y j , x ) : dependency between adjacent letters (‘q-u’). F ( y j − 1 , y j , j , x ) : position-based dependency (French: ‘e-r’ ending). F ( y j − 2 , y j − 1 , y j , j , x ) : third-order and position (English: ‘i-n-g’ end). F ( y ∈ D , x ) : is y in dictionary D ? CRFs are a ubiquitous tool in natural language processing: Part-of-speech tagging, semantic role labelling, information extraction, shallow parsing, named-entity recognition, etc.

  18. Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1

  19. Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex.

  20. Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive.

  21. Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive. Chain-structures: run forward-backward on each example.

  22. Optimization Formulation and Challenge Typically train using ℓ 2 -regularized negative log-likelihood: n w f ( w ) = λ 2 � w � 2 − 1 � min log p ( y i | x i , w ) . n i = 1 Good news: ∇ f ( w ) is Lipschitz-continuous, f is strongly-convex. Bad news: evaluating log p ( y i | x i , w ) and its gradient is expensive. Chain-structures: run forward-backward on each example. General features: exponential in tree-width of dependency graph. A lot of work on approximate evaluation. This optimization problem remains a bottleneck.

  23. Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required.

  24. Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples.

  25. Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples. To scale to large n , stochastic gradient methods were examined. [Vishwanathan et al., 2006] Iteration cost is independent of n .

  26. Current Optimization Methods Lafferty et al. [2001] proposed an iterative scaling approach. Outperformed by L-BFGS quasi-Newton algorithm. [Wallach, 2002, Sha Pereira, 2003] Has a linear convergence rate: O ( log ( 1 /ǫ )) iterations required. But each iteration requires log p ( y i | x i , w ) for all n examples. To scale to large n , stochastic gradient methods were examined. [Vishwanathan et al., 2006] Iteration cost is independent of n . But has a sub linear convergence rate: O ( 1 /ǫ ) iterations required. Or with constant step-size you get linear rate up to fixed tolerance.

Recommend


More recommend