Structured Prediction Final words CS 6355: Structured Prediction 1 - - PowerPoint PPT Presentation

structured prediction
SMART_READER_LITE
LIVE PREVIEW

Structured Prediction Final words CS 6355: Structured Prediction 1 - - PowerPoint PPT Presentation

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a structure? The machine learning of interdependent variables 2 Recall: A working definition of a structure A structure is a concept that can be


slide-1
SLIDE 1

CS 6355: Structured Prediction

Structured Prediction

Final words

1

slide-2
SLIDE 2

A look back

  • What is a structure?
  • The machine learning of interdependent variables

2

slide-3
SLIDE 3

Recall: A working definition of a structure

A structure is a concept that can be applied to any complex thing, whether it be a bicycle, a commercial company, or a carbon molecule. By complex, we mean: 1. It is divisible into parts, 2. There are different kinds of parts, 3. The parts are arranged in a specifiable way, and, 4. Each part has a specifiable function in the structure of the thing as a whole

3

From the book Analysing Sentences: An Introduction to English Syntax by Noel Burton-Roberts, 1986.

slide-4
SLIDE 4

An example task: Semantic Parsing

Find the largest state in the US

4

name US_STATES size population capital name US_CITIES state population SELECT expression FROM table WHERE condition MAX (numeric list) ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

slide-5
SLIDE 5

A plausible strategy to build the query

Find the largest state in the US

5

name US_STATES size population capital name US_CITIES state population SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

slide-6
SLIDE 6

A plausible strategy to build the query

Find the largest state in the US

6

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

slide-7
SLIDE 7

A plausible strategy to build the query

Find the largest state in the US

7

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

slide-8
SLIDE 8

A plausible strategy to build the query

Find the largest state in the US

8

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

slide-9
SLIDE 9

A plausible strategy to build the query

Find the largest state in the US

9

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 Expression 1 = Expression 2

slide-10
SLIDE 10

A plausible strategy to build the query

Find the largest state in the US

10

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table MAX numeric list name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 Expression 1 = Expression 2

slide-11
SLIDE 11

A plausible strategy to build the query

Find the largest state in the US

11

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table US_STATES MAX numeric list name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 Expression 1 = Expression 2

slide-12
SLIDE 12

A plausible strategy to build the query

Find the largest state in the US

12

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table US_STATES MAX numeric list size name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 Expression 1 = Expression 2 size

Or perhaps population?

slide-13
SLIDE 13

A plausible strategy to build the query

Find the largest state in the US

13

SELECT expression FROM table WHERE condition name US_STATES size population capital name US_CITIES state population US_STATES SELECT expression FROM table US_STATES MAX numeric list size name SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 Expression 1 = Expression 2 size

Or perhaps population?

  • At each step many, many decisions to make
  • Some decisions are simply not allowed
  • A query has to be well formed!
  • Even so, many possible options
  • Why does “Find” map to SELECT?
  • Largest by size/population/population of capital?
slide-14
SLIDE 14

Standard classification tools can’t predict structures

X: “Find the largest state in the US.” Y: Classification is about making one decision

– Spam or not spam, or predict one label, etc

We need to make multiple decisions

– Each part needs a label

  • Should “US” be mapped to us_states or us_cities?
  • Should “Find” be mapped to SELECT or DELETE?

– The decisions interact with each other

  • If the outer FROM clause talks about the table us_states, then the inner FROM clause should not talk

about utah_counties

– How to compose the fragments together to create the whole structure?

  • Should the output consist of a WHERE clause? What should go in it?

14

SELECT name FROM us_states WHERE size = (SELECT MAX(size) FROM us_states)

slide-15
SLIDE 15

How did we get here?

15

Binary classification

  • Learning algorithms
  • Prediction is easy – Threshold
  • Features (???)

Multiclass classification

  • Different strategies
  • One-vs-all, all-vs-all
  • Global learning algorithms
  • One feature vector per outcome
  • Each outcome scored
  • Prediction = highest scoring outcome

Structured classification

  • Global models or local models
  • Each outcome scored
  • Prediction = highest scoring outcome
  • Inference is no longer easy!
  • Makes all the difference
slide-16
SLIDE 16

Structured output is…

  • A graph, possibly labeled and/or directed

– Possibly from a restricted family, such as chains, trees, etc. – A discrete representation of input – Eg. A table, the SRL frame output, a sequence of labels etc

  • A collection of inter-dependent decisions

– Eg: The sequence of decisions used to construct the output

  • The result of a combinatorial optimization problem

– argmaxy 2 all outputsscore(x, y)

16

Representation Procedural Formally

slide-17
SLIDE 17

Challenges with structured output

  • Two challenges

1. We cannot train a separate weight vector for each possible inference outcome

  • For multiclass, we could train one weight vector for each label

1. We cannot enumerate all possible structures for inference

  • Inference for binary/multiclass is easy
  • Solution

– Decompose the output into parts that are labeled – Define

  • how the parts interact with each other
  • how labels are scored for each part
  • an inference algorithm to assign labels to all the parts

17

slide-18
SLIDE 18

Multiclass as a structured output

  • A structure is…

– A graph (in general, hypergraph), possibly labeled and/or directed – A collection of inter- dependent decisions – The output of a combinatorial

  • ptimization problem

argmaxy 2 all outputsscore(x, y)

  • Multiclass

– A graph with one node and no edges

  • Node label is the output

– Can be composed via multiple decisions – Winner-take-all argmaxi wTÁ(x, i)

18

slide-19
SLIDE 19

Multiclass is a structure: Implications

1. A lot of the ideas from multiclass may be generalized to structures

– Not always trivial, but useful to keep in mind

2. Broad statements about structured learning must apply to multiclass classification

– Useful for sanity check, also for understanding

3. Binary classification is the most “trivial” form of structured classification

– Multiclass with two classes

19

slide-20
SLIDE 20

Structured Prediction The machine learning of interdependent variables

20

slide-21
SLIDE 21

Computational issues

21

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-22
SLIDE 22

Computational issues

22

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-23
SLIDE 23

What does it mean to define the model?

23

y1 y2 y3 y4 Say we want to predict four output variables from some input x

slide-24
SLIDE 24

What does it mean to define the model?

24

y1 y2 y3 y4 Say we want to predict four output variables from some input x Option 1: Score each decision separately

Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments

  • f variables connected

to it

Pro: Prediction is easy, each y independent Con: No consideration of interactions

slide-25
SLIDE 25

What does it mean to define the model?

25

y1 y2 y3 y4 Say we want to predict four output variables from some input x Option 2: Add pairwise factors

Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments

  • f variables connected

to it

Pro: Accounts for pairwise dependencies Cons: Makes prediction harder, ignores third and higher order dependencies

slide-26
SLIDE 26

What does it mean to define the model?

26

y1 y2 y3 y4 Say we want to predict four output variables from some input x Option 3: Use only order 3 factors

Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments

  • f variables connected

to it

Pro: Accounts for order 3 dependencies Cons: Prediction even harder. Inference should consider all triples of labels now

slide-27
SLIDE 27

What does it mean to define the model?

27

y1 y2 y3 y4 Say we want to predict four output variables from some input x Option 4: Use order 4 factors

Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments

  • f variables connected

to it

Cons: Basically no decomposition

  • ver the labels!

Pro: Accounts for order 4 dependencies

slide-28
SLIDE 28

What does it mean to define the model?

28

y1 y2 y3 y4 Say we want to predict four output variables from some input x

Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments

  • f variables connected

to it

How do we decide what to do?

slide-29
SLIDE 29

Some aspects to consider

  • Availability of supervision

– Supervised algorithms are well studied; supervision is hard (or expensive) to obtain

  • Complexity of model

– More complex models encode complex dependencies between parts; complex models make learning and inference harder

  • Features

– Most of the time we will assume that we have a good feature set to model our problem. But do we?

  • Domain knowledge

– Incorporating background knowledge into learning and inference in a mathematically sound way

29

slide-30
SLIDE 30

Computational issues

30

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-31
SLIDE 31

Training structured models

  • Inference in training makes all the difference from multiclass/binary

classification

  • Empirical risk minimization principle

– Minimize loss over the training data – Regularize the parameters to prevent overfitting

  • We have seen different training strategies falling under this umbrella

– Conditional Random Fields – Structural Support Vector Machines – Structured Perceptron (doesn’t have regularization)

  • Different algorithms exist

– We saw stochastic gradient descent in some detail

31

slide-32
SLIDE 32

Training considerations

  • Train globally vs train locally

32

y1 y2 y3 y4 x Global: Train according to your final model Pro: Learning uses all the available information Con: Computationally expensive

slide-33
SLIDE 33

Training considerations

  • Train globally vs train locally

33

Local: Decompose your model into smaller ones and train each one separately Full model still used at prediction time y1 y2 y3 y4 x y1 y2 y2 y3 y1 y4 y3 y4 y2 y4 y1 y3 Pro: Easier to train Con: May not capture global dependencies

slide-34
SLIDE 34

Training considerations

  • Local vs global

– Local learning

  • Learn parameters for individual components independently
  • Learning algorithm not aware of the full structure

– Global learning

  • Learn parameters for the full structure
  • Learning algorithm “knows” about the full structure

– Depends on inference complexity – Jury still out on which one is better – Depends on size of available data too

34

How do we choose?

slide-35
SLIDE 35

Computational issues

35

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-36
SLIDE 36

Inference

  • What is inference? The prediction step

– More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum – Different flavors: MAP, marginal, loss augmented.

  • Many algorithms, solution strategies

– Combinatorial optimization, one size doesn’t fit all – Graph algorithms, integer linear programming, heuristics, Monte Carlo methods, ….

  • Some tradeoffs

– Programming effort – Exact vs inexact – Is the problem solvable with a known algorithm? – Do we care about the exact answer?

36

How do we choose?

slide-37
SLIDE 37

Computational issues

37

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-38
SLIDE 38

How does background knowledge affect your choices?

  • Background knowledge biases your predictor in several ways

– What is the model?

  • Maybe third order factors are not needed… etc

– Your choices for learning and inference algorithms – Feature functions – Constraints that prohibit certain inference outcomes

38

slide-39
SLIDE 39

Computational issues

39

Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi- supervised/indirectly supervised?

slide-40
SLIDE 40

Data and how it influences your model

  • Annotated data is a precious resource

– Takes specialized expertise to generate – Or: very clever tricks (like online games that make data as a side effect)

  • Important directions

– Learning with latent representations, indirect supervision, partial supervision – In all these cases

  • Learning is rarely a convex problem
  • Modeling choices become very important! Bad model will hurt

40

slide-41
SLIDE 41

Looking ahead

  • Big questions (a very limited and biased set)

– Representations

  • Can we learn the factorization?
  • Can we learn feature functions?

– Dealing with the data problem for new applications

  • Clever tricks to get data
  • Taming latent variable learning

– Applications

  • How does structured prediction help you?
  • Gathering importance as computer programs have to deal with

uncertain, noisy inputs and make complex decisions

41