Learning from Irregularly-Sampled Time Series A Missing Data Perspective Steven Cheng-Xian Li Benjamin M. Marlin University of Massachusetts Amherst
Irregularly-Sampled Time Series Irregularly-sampled time series: Time series with non-uniform time intervals between successive measurements 1
Problem and Challenges 2 Problem: learning from a collection of irregularly-sampled time series within a common time interval value time Challenges: • Each time series is observed at arbitrary time points . • Different data cases may have different numbers of observations • Observed samples may not be aligned in time • Many real-world time series data are extremely sparse • Most machine learning algorithms require data lying on fixed dimensional feature space
Problem and Challenges 2 Problem: learning from a collection of irregularly-sampled time series within a common time interval value time Tasks: • Learning the distribution of latent temporal processes • Inferring the latent process associated with a time series • Classification of time series This can be transformed into a missing data problem .
Index Representation of Incomplete Data • Examples: • Image: pixel coordinates • Time series: timestamps • Applicable for both finite and continuous index set. 3 Data defined on an index set I : • Complete data as a mapping: I → R . Index representation of an incomplete data case ( x , t ) : • t ≡ { t i } | t | i =1 ⊂ I are the indices of observed entries. • x i is the corresponding value observed at t i .
Index Representation of Incomplete Data • Examples: • Image: pixel coordinates • Time series: timestamps • Applicable for both finite and continuous index set. 3 Data defined on an index set I : • Complete data as a mapping: I → R . Index representation of an incomplete data case ( x , t ) : • t ≡ { t i } | t | i =1 ⊂ I are the indices of observed entries. • x i is the corresponding value observed at t i .
Generative Process of Incomplete Data 4 Generative process for an incomplete case ( x , t ) : f ∼ p θ ( f ) complete data f : I → R t ∈ 2 I (subset of I ) t ∼ p I ( t | f ) � � | t | x = f ( t i ) values of f indexed at t i =1 Goal: learning the complete data distribution p θ given the incomplete dataset D = { ( x i , t i ) } n i =1
Generative Process of Incomplete Data 4 Generative process for an incomplete case ( x , t ) : f ∼ p θ ( f ) complete data f : I → R t ∼ p I ( t ) independence between f and t � � | t | x = f ( t i ) values of f indexed at t i =1 Goal: learning the complete data distribution p θ given the incomplete dataset D = { ( x i , t i ) } n i =1
Generative Process of Incomplete Data 4 Generative process for an incomplete case ( x , t ) : f ∼ p θ ( f ) complete data f : I → R t ∼ p I ( t ) independence between f and t � � | t | x = f ( t i ) values of f indexed at t i =1 Goal: learning the complete data distribution p θ given the incomplete dataset D = { ( x i , t i ) } n i =1
Encoder-Decoder Framework for Incomplete Data Probabilistic latent variable model Decoder: 5 • Model the data generating process: z ∼ p z ( z ) , f = g θ ( z ) � � | t | • Given t ∼ p I , the corresponding values are g θ ( z , t ) ≡ f ( t i ) i =1 . • Note: our goal is to model g θ , not p I .
Encoder-Decoder Framework for Incomplete Data • Different incomplete cases carry different levels of uncertainty. Encoder ( stochastic ): • Replacing all missing entries by zero. 6 • Model the posterior distribution q φ ( z | x , t ) • Functional form: q φ ( z | x , t ) = q φ ( z | m ( x , t )) • Example: q φ ( z | x , t ) = N ( z | µ φ ( v ) , Σ φ ( v )) with v = m ( x , t ) . Masking function m ( x , t ) : ( ) • m = ( ) value x value x • m = 0 0 T T 0 time t 0 time t
Partial Variational Autoencoder (P-VAE) (i.i.d. noise) 7 Generative process: Joint distribution: ( x , t ) ∼ p D x t t ∼ p I ( t ) z ∼ p ( z ) q φ f = g θ ( z ) x i ∼ p ( x i | f ( t i )) z Example: p ( x i | f ( t i )) = N ( x i | f ( t i ) , σ 2 ) g θ � | t | � � x p θ ( x i | z , t i ) d z p ( x , t ) = p ( z ) p I ( t ) i =1 p θ ( x i | z , t i ) is the shorthand for p ( x i | f ( t i )) with f = g θ ( z ) .
Partial Variational Autoencoder (P-VAE) Kingma & Welling. (2014). Auto-encoding variational bayes. 8 Ma, et al. (2018). Partial VAE for hybrid recommender system. ( x , t ) ∼ p D x t Variational lower bound for log p ( x , t ) : � q φ ( z | x , t ) log p z ( z ) p I ( t ) � | t | i =1 p θ ( x i | z , t i ) q φ d z q φ ( z | x , t ) z Learning with gradients without p I ( t ) involved : � � p I ( t ) � | t | ✟ log p z ( z ) ✟✟ i =1 p θ ( x i | z , t i ) ∇ φ,θ E z ∼ q φ ( z | x , t ) g θ q φ ( z | x , t ) � x
Partial Variational Autoencoder (P-VAE) Related work: 9 • MIWAE [Mattei & Frellsen, 2019] • Partial VAE [Ma, et al., 2018] • Neural processes [Garnelo, et al., 2018] ( x , t ) ∼ p D x t Conditional objective (lower bound for log p ( x | t ) ): � � log p z ( z ) � | t | q φ i =1 p θ ( x i | z , t i ) E z ∼ q φ ( z | x , t ) q φ ( z | x , t ) z g θ � x
Partial Bidirectional GAN (P-BiGAN) Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN). 10 { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D x x ′ t q φ g θ z z ′ t ′ encoding decoding
Partial Bidirectional GAN (P-BiGAN) Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN). 10 { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D ( x , t ) ∼ p D x x ′ t q φ g θ z z ′ t ′ encoding decoding
Partial Bidirectional GAN (P-BiGAN) Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. Donahue, et al. (2016). Adversarial feature learning (BiGAN). 10 { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D x x ′ t q φ g θ ( · , t ′ ) ∼ p D z z ′ t ′ z ′ ∼ p z encoding decoding
Partial Bidirectional GAN (P-BiGAN) Donahue, et al. (2016). Adversarial feature learning (BiGAN). 10 Li, Jiang, Marlin. (2019). MisGAN: Learning from Incomplete Data with GANs. { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D ( x , t ) ∼ p D x x ′ t q φ g θ ( · , t ′ ) ∼ p D z z ′ t ′ z ′ ∼ p z encoding decoding Discriminator: D ( m ( x , t ) , z )
Invertibility Property of P-BiGAN 11 Theorem: For ( x , t ) with non-zero probability, if z ∼ q φ ( z | x , t ) then g θ ( z , t ) = x . { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D x t x ′ q φ g θ z z ′ t ′ encoding decoding g θ ( z , t ) is the shorthand notation for [ f ( t i )] | t | i =1 with f = g θ ( z ) .
Autoencoding Regularization for P-BiGAN 12 { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D x t x ′ q φ g θ x ) ℓ ( x , � z z ′ t ′ g θ � x
Missing Data Imputation Sampling: Imputation: 13 x t p ( x ′ | t ′ , x , t ) = E z ∼ q φ ( z | x , t ) [ p θ ( x ′ | z , t ′ )] q φ z t ′ z ∼ q φ ( z | x , t ) f = g θ ( z ) g θ x ′ = [ f ( t ′ i )] | t ′ | i =1 x ′
Supervised Learning: Classification Adding classification term to objective: Prediction: classification regularization 14 x t � � log p z ( z ) p θ ( x | z , t ) p ( y | z ) E z ∼ q φ ( z | x , t ) q φ q φ ( z | x , t ) � � log p z ( z ) p θ ( x | z , t ) + E q φ ( z | x , t ) [log p ( y | z )] = E q φ ( z | x , t ) q φ ( z | x , t ) z � �� � � �� � C y = argmax � E q φ ( z | x , t ) [log p ( y | z )] � y y
MNIST Completion P-VAE P-BiGAN square observation with 90% missing P-VAE P-BiGAN independent dropout with 90% missing 15
CelebA Completion P-VAE P-BiGAN square observation with 90% missing P-VAE P-BiGAN independent dropout with 90% missing 16
Architecture for Irregularly-Sampled Time Series P-VAE P-BiGAN How to construct decoder, encoder and discriminator for continuous 17 index set, e.g., time series with I = [0 , T ] ? ( x , t ) ∼ p D x t { ( x , t , z ) } { ( x ′ , t ′ , z ′ ) } D q φ x t x ′ z q φ g θ g θ z z ′ t ′ � x
Decoder for Continuous Time Series Generative process for time series: kernel smoother 18 z ∼ p z ( z ) v = CNN θ ( z ) values on evenly-spaced times u � L i =1 K ( u i , t ) v i f ( t ) = � L i =1 K ( u i , t )
Continuous Convolutional Layer 19 CNN encoder/discriminator Cross-correlation between: evenly-spaced u index t continuous kernel w ( t ) • continuous kernel w ( t ) • masked function m ( x , t )( t ) = � | t | i =1 x i δ ( t − t i ) δ ( · ) is the Dirac delta function.
Continuous Convolutional Layer 19 CNN encoder/discriminator evenly-spaced u index t continuous kernel w ( t ) Cross-correlation between w and m ( x , t ) : � w ( t i − u ) x i ( w ⋆ m ( x , t ))( u ) = i : t i ∈ neighbor ( u ) Construct kernel w ( t ) using a degree-1 B-spline
Continuous Convolutional Layer 19 CNN encoder/discriminator evenly-spaced u index t continuous kernel w ( t ) Cross-correlation between w and m ( x , t ) : � w ( t i − u ) x i ( w ⋆ m ( x , t ))( u ) = i : t i ∈ neighbor ( u ) Construct kernel w ( t ) using a degree-1 B-spline
Recommend
More recommend