Missing Data Imputation using Optimal Transport Boris Muzellec Julie Josse Claire Boyer Marco Cuturi
<latexit sha1_base64="xfKy83h87erRWERr/xzJ2E0Lv+E=">AB7HicbVDLTgJBEOzF+IL9ehlIjHxRHaNiR6JXjxiIo8ENmR26IUJM7ObmVkiIfyE8YZe/R3P/o0D7kHBOlV3Vae7K0oFN9b3v7zCxubW9k5xt7S3f3B4VD4+aZok0wbLBGJbkfUoOAKG5Zbge1UI5WRwFY0ul/orTFqwxP1ZCcphpIOFI85o9a12l0u3RY0vXLFr/pLkHUS5KQCOeq98me3n7BMorJMUGM6gZ/acEq15UzgrNTNDKaUjegAO/0xT42iEk04fV6ePCMXcaKJHSJZ1r/tUyqNmcjIeS1Q7OqLZr/aZ3MxrfhlKs0s6iYszgtzgSxCVl8TvpcI7Ni4ghlmrtDCRtSTZl1+ZRcAsHqv+ukeVUN/GrweF2p3eVZFOEMzuESAriBGjxAHRrAQMALzOHNU96rN/fef6wFL585hT/wPr4BZIGPEQ=</latexit> <latexit sha1_base64="xfKy83h87erRWERr/xzJ2E0Lv+E=">AB7HicbVDLTgJBEOzF+IL9ehlIjHxRHaNiR6JXjxiIo8ENmR26IUJM7ObmVkiIfyE8YZe/R3P/o0D7kHBOlV3Vae7K0oFN9b3v7zCxubW9k5xt7S3f3B4VD4+aZok0wbLBGJbkfUoOAKG5Zbge1UI5WRwFY0ul/orTFqwxP1ZCcphpIOFI85o9a12l0u3RY0vXLFr/pLkHUS5KQCOeq98me3n7BMorJMUGM6gZ/acEq15UzgrNTNDKaUjegAO/0xT42iEk04fV6ePCMXcaKJHSJZ1r/tUyqNmcjIeS1Q7OqLZr/aZ3MxrfhlKs0s6iYszgtzgSxCVl8TvpcI7Ni4ghlmrtDCRtSTZl1+ZRcAsHqv+ukeVUN/GrweF2p3eVZFOEMzuESAriBGjxAHRrAQMALzOHNU96rN/fef6wFL585hT/wPr4BZIGPEQ=</latexit> <latexit sha1_base64="xfKy83h87erRWERr/xzJ2E0Lv+E=">AB7HicbVDLTgJBEOzF+IL9ehlIjHxRHaNiR6JXjxiIo8ENmR26IUJM7ObmVkiIfyE8YZe/R3P/o0D7kHBOlV3Vae7K0oFN9b3v7zCxubW9k5xt7S3f3B4VD4+aZok0wbLBGJbkfUoOAKG5Zbge1UI5WRwFY0ul/orTFqwxP1ZCcphpIOFI85o9a12l0u3RY0vXLFr/pLkHUS5KQCOeq98me3n7BMorJMUGM6gZ/acEq15UzgrNTNDKaUjegAO/0xT42iEk04fV6ePCMXcaKJHSJZ1r/tUyqNmcjIeS1Q7OqLZr/aZ3MxrfhlKs0s6iYszgtzgSxCVl8TvpcI7Ni4ghlmrtDCRtSTZl1+ZRcAsHqv+ukeVUN/GrweF2p3eVZFOEMzuESAriBGjxAHRrAQMALzOHNU96rN/fef6wFL585hT/wPr4BZIGPEQ=</latexit> The missing data issue • Big data is plagued with missing values • What to do? Option 1: Remove entries with missing values information loss, not sustainable ⇒ = Example with 25% missing rate: 2d 3d 6d 10d With 1% missing rate: 5d: 95% rows kept 300d: 5% rows kept Option 2: Impute with reasonable guesses
Outline 1. Missing data and Optimal Transport 2. Non-parametric imputation with OT 3. Fitting parametric imputation models with OT
How to impute? - Mean imputation - Regression (conditional expectation) Deforms joint and marginal distributions Preserves distributions • Using a conditional model: - With logistic, multinomial, Poisson regressions: R’s mice (Van Buuren, 2011) • Assuming a joint model: - EM + Gaussian distribution: Amelia (Honacker et al., 2011) - Low-rank models: Softimpute (Mazumder et al., 2010) - VAE and GAN: MIWAE (Mattei & Frellsen, 2019), GAIN (Yoon et al., 2018) - … This work: Preserves distributions Parametric assumption not necessary
Recommend
More recommend