Robust Models in Information Retrieval Nedim Lipka Benno Stein Bauhaus-Universität Weimar [www.webis.de]
Robust Models in Information Retrieval Outline · Introduction · Bias and Variance · Robust Models in IR · Summary · Excursus: Bias Types
Introduction c [ ∧ ] � stein TIR’11
Introduction Classification Task Given: ❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = { ( x , y ) | x ∈ X, y = c ( x ) } c [ ∧ ] � stein TIR’11
Introduction Classification Task Given: ❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = { ( x , y ) | x ∈ X, y = c ( x ) } Searched: ❑ hypothesis h ∈ H that minimizes P ( h ( x ) � = c ( x )) , the generalization error. � �� � err ( h ) c [ ∧ ] � stein TIR’11
Introduction Classification Task Given: ❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = { ( x , y ) | x ∈ X, y = c ( x ) } Searched: ❑ hypothesis h ∈ H that minimizes P ( h ( x ) � = c ( x )) , the generalization error. � �� � err ( h ) Measuring effectiveness of h : ❑ err S ( h ) = 1 � loss 0 / 1 ( h ( x ) , c ( x )) | S | x ∈ S err S ( h ) is called test error if S is not used for the construction of h . ❑ err ( h ∗ ) := min h ∈ H err ( h ) defines lower bound for err ( h ) ➜ restriction bias. c [ ∧ ] � stein TIR’11
Introduction Classification Task Given: ❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = { ( x , y ) | x ∈ X, y = c ( x ) } Searched: ❑ hypothesis h ∈ H that minimizes P ( h ( x ) � = c ( x )) , the generalization error. � �� � err ( h ) Measuring effectiveness of h : ❑ err S ( h ) = 1 � loss 0 / 1 ( h ( x ) , c ( x )) | S | x ∈ S err S ( h ) is called test error if S is not used for the construction of h . ❑ err ( h ∗ ) := min h ∈ H err ( h ) defines lower bound for err ( h ) ➜ restriction bias. c [ ∧ ] � stein TIR’11
Introduction Model Formation Task The process (the function) α for deriving x from o is called model formation . α : O → X c [ ∧ ] � stein TIR’11
Introduction Model Formation Task The process (the function) α for deriving x from o is called model formation . α : O → X Choosing between different model formation functions α 1 , . . . , α m ➜ choosing between different feature spaces X α 1 , . . . , X α m ➜ choosing between different hypotheses spaces H α 1 , . . . , H α m c [ ∧ ] � stein TIR’11
Introduction Model Formation Task The process (the function) α for deriving x from o is called model formation . α : O → X Choosing between different model formation functions α 1 , . . . , α m ➜ choosing between different feature spaces X α 1 , . . . , X α m ➜ choosing between different hypotheses spaces H α 1 , . . . , H α m Feature spaces X α 1 X α m x x ... Hypotheses spaces h H α 1 h H α m ... c [ ∧ ] � stein TIR’11
Introduction Model Formation Task The process (the function) α for deriving x from o is called model formation . α : O → X Choosing between different model formation functions α 1 , . . . , α m ➜ choosing between different feature spaces X α 1 , . . . , X α m ➜ choosing between different hypotheses spaces H α 1 , . . . , H α m Feature spaces X α 1 X α m x x ... Hypotheses spaces h H α 1 h H α m ... We call the model under α 1 being more robust than the model under α 2 ⇔ err S ( h ∗ α 1 ) > err S ( h ∗ err ( h ∗ α 1 ) < err ( h ∗ α 2 ) and α 2 ) c [ ∧ ] � stein TIR’11
Introduction The Whole Picture Object classification (real-world) Objects Classes O Y c [ ∧ ] � stein TIR’11
Introduction The Whole Picture Object classification (real-world) Objects Classes O Y Model formation α X Feature space c [ ∧ ] � stein TIR’11
Introduction The Whole Picture Object classification (real-world) Objects Classes O Y Model formation α Feature vector classification c X Feature space Learning means searching for a h ∈ H such that P ( h ( x ) � = c ( x )) is minimum. c [ ∧ ] � stein TIR’11
Bias and Variance c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition Consider: ❑ A feature vector x and its predicted class label ˆ y = h ( x ) , where ❑ h is characterized by a weight vector θ , where ❑ θ has been estimated based on a random sample S = { ( x , c ( x ) } . ➜ θ ≡ θ ( S ) , and hence h ≡ h ( θ S ) c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition Consider: ❑ A feature vector x and its predicted class label ˆ y = h ( x ) , where ❑ h is characterized by a weight vector θ , where ❑ θ has been estimated based on a random sample S = { ( x , c ( x ) } . ➜ θ ≡ θ ( S ) , and hence h ≡ h ( θ S ) Observations: ❑ A series of samples S i , S i ⊆ U , entails a series of hypotheses h ( θ i ) , ❑ giving for a feature vector x a series of class labels ˆ y i = h ( θ i , x ) . ➜ ˆ y is considered as a random variable, denoted as Z . c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition Consider: ❑ A feature vector x and its predicted class label ˆ y = h ( x ) , where ❑ h is characterized by a weight vector θ , where ❑ θ has been estimated based on a random sample S = { ( x , c ( x ) } . ➜ θ ≡ θ ( S ) , and hence h ≡ h ( θ S ) Observations: ❑ A series of samples S i , S i ⊆ U , entails a series of hypotheses h ( θ i ) , ❑ giving for a feature vector x a series of class labels ˆ y i = h ( θ i , x ) . ➜ ˆ y is considered as a random variable, denoted as Z . Consequences: ❑ σ 2 ( Z ) is the variance of Z , (= variance of the prediction) σ 2 ( Z ) ↑ ❑ | θ | : | S | ↑ ➜ σ 2 ( Z ) ↑ ❑ | S | : | U | ↓ ➜ c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h ( θ S , x ) ) and y ( = c ( x ) ) . MSE ( Z ) = E (( Z − Y ) 2 ) = E ( Z 2 − 2 · Z · Y + Y 2 ) = E ( Z 2 ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + ( E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z )) 2 − 2 · E ( Z · Y ) + ( E ( Y )) 2 = ( E ( Z ) − E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z − Y )) 2 + σ 2 ( Z ) + σ 2 ( Y ) = ( bias ( Z )) 2 + σ 2 ( Z ) + IrreducibleError If Y is constant: = ( E ( Z ) − Y ) 2 + σ 2 ( Z ) c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h ( θ S , x ) ) and y ( = c ( x ) ) . MSE ( Z ) = E (( Z − Y ) 2 ) = E ( Z 2 − 2 · Z · Y + Y 2 ) = E ( Z 2 ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + ( E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z )) 2 − 2 · E ( Z · Y ) + ( E ( Y )) 2 = ( E ( Z ) − E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z − Y )) 2 + σ 2 ( Z ) + σ 2 ( Y ) = ( bias ( Z )) 2 + σ 2 ( Z ) + IrreducibleError If Y is constant: = ( E ( Z ) − Y ) 2 + σ 2 ( Z ) c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h ( θ S , x ) ) and y ( = c ( x ) ) . MSE ( Z ) = E (( Z − Y ) 2 ) = E ( Z 2 − 2 · Z · Y + Y 2 ) = E ( Z 2 ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + ( E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z )) 2 − 2 · E ( Z · Y ) + ( E ( Y )) 2 = ( E ( Z ) − E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z − Y )) 2 + σ 2 ( Z ) + σ 2 ( Y ) = ( bias ( Z )) 2 + σ 2 ( Z ) + IrreducibleError If Y is constant: = ( E ( Z ) − Y ) 2 + σ 2 ( Z ) c [ ∧ ] � stein TIR’11
Bias and Variance Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h ( θ S , x ) ) and y ( = c ( x ) ) . MSE ( Z ) = E (( Z − Y ) 2 ) = E ( Z 2 − 2 · Z · Y + Y 2 ) = E ( Z 2 ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + E ( Y 2 ) = ( E ( Z )) 2 + σ 2 ( Z ) − 2 · E ( Z · Y ) + ( E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z )) 2 − 2 · E ( Z · Y ) + ( E ( Y )) 2 = ( E ( Z ) − E ( Y )) 2 + σ 2 ( Y ) + σ 2 ( Z ) = ( E ( Z − Y )) 2 + σ 2 ( Z ) + σ 2 ( Y ) = ( bias ( Z )) 2 + σ 2 ( Z ) + IrreducibleError If Y is constant: = ( E ( Z ) − Y ) 2 + σ 2 ( Z ) c [ ∧ ] � stein TIR’11
Recommend
More recommend