Ordinary Least Squares for Histogram Data based on Wasserstein Distance Rosanna Verde Antonio Irpino Dipartimento di Studi Europei e Mediterranei Seconda Università degli Studi di Napoli (ITALY) [rosanna.verde] [antonio.irpino]@unina2.it COMPSTAT 2010 - Paris - August 22-27
Outline Histogram data A regression model for histogram variables Properties of the Wasserstein distance Ordinary Least Square fitting Tools for the interpretation An application on real data COMPSTAT 2010 - Paris - August 22-27
Sources of histogram data Result of summary/clustering procedures From surveys From large databases From sensors 0.6 0.5 Temperatures 0.5 Pollutant concentration 0.4 Network activity 0.3 0.2 0.2 0.2 Data streams 0.1 0.1 Description of time windows 0 Image analysis Color bandwidths Confidentiality data 0.6 0.5 0.5 Summary data – non punctual 0.4 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 COMPSTAT 2010 - Paris - August 22-27
Histogram data as a particular case of modal symbolic descriptions [ Bock and Diday (2000) ] Histogram data is a kind of symbolic representation which allows to describe an individual by means of a histogram In Bock and Diday (2000) Histogram variable is one of the three definition of modal numerical variables : [ Histogram variable ] The description is a classic histogram where the support is partitioned into intervals. Each interval is weighted by the empirical density; [ Empirical distribution function variable ] The description is done according to an empirical distribution function; [ Model of distribution variable ] The description is done according to a predefined model of random variable. COMPSTAT 2010 - Paris - August 22-27
Histogram variable Let Y be a continuous variable defined on a finite support where: are the minimum and maximum values of the variable domain. The variable Y is partitioned into a set of contiguous intervals (bins) Given n observations of the variable Y, each semi-open interval, is associated with a random variable equal to It is possible to associate with an empirical distribution A histogram of Y is the representation in which each pair (for h = 1 , …,H ) is represented by a vertical bar, with base interval along the horizontal axis and the area proportional to COMPSTAT 2010 - Paris - August 22-27
A Regression model for histogram variables In order to study the dependence structure of a histogram variable Y (dependent) to the another X (independent) we introduce a new regression approach based on the Ordinary Least Square estimation method According to the nature of the variables, we propose to compute the squared deviations (in the least squares function) by using the Wasserstein distance. COMPSTAT 2010 - Paris - August 22-27
A Regression model for histogram variables Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? COMPSTAT 2010 - Paris - August 22-27
Simple linear regression Classic data Histogram data COMPSTAT 2010 - Paris - August 22-27
Regression between histograms: a proposal A solution was given by Billard and Diday (2006) The model fit a linear regression line throught the mixture of the n bivariate distributions Given a punctual value of X it is possible to predict the punctual value of Y COMPSTAT 2010 - Paris - August 22-27
Regression between histograms: our approach Given a histogram description for X, we search for a li linear r trasfo sformat rmation ion of the description which allows us to predict the histogram description of Y For example: given the temperature ature histogr ogram am observed ved in a region on during ng a m month, Is it possible to predict the e dist stribution tion of the temper perature ature of a another her month using a linear transformation of the histogram variable? A histogram by a histogram COMPSTAT 2010 - Paris - August 22-27
Wasserstein distance We propose to use the Wasserstein-Kantorovich metric in Least Square Function. Expecially the derived L 2 - Mallow’s distance between two quantile functions 1 2 1 1 d x ,x F (t ) F (t ) dt W i j i j 0 COMPSTAT 2010 - Paris - August 22-27
An interpretative decomposition of the L 2 -Wasserstein metric 1 2 2 1 1 d ( x x , ) : F ( ) t F ( ) t dt W i j i j 0 2 2 x x 2 (1 ( x x , ) i j i j i j i j Shape Location Siz e QQ plot 100 95 90 85 80 75 70 65 If the two distribution have the same shape: 60 0 10 20 30 40 2 ( , 2 2 d x x ) : W i j i j i j Location Size 2 ( , 2 If they have the same size and shape: d x x ) : W i j i j Location COMPSTAT 2010 - Paris - August 22-27
Some simplifications and notations quantile function of the i-th Mean and variance macro-unit x i 1 of the distribution/ x t ( ) F ( ) t (histogram/distribution data) i i histogram data 1 1 1 2 2 2 2 2 2 x x t dt ( ) and x t ( ) dt x x t ( ) dt x i i x i i i x i i i 0 0 0 Average distribution/ histogram data 1 1 n n n 1 1 1 x t ( ) x t ( ) t [0,1]; x x t dt ( ) x x t dt ( ) i i i n n n i 1 i 1 i 1 0 0 1 x t x t dt ( ) ( ) x x i j i j 1 0 ( , x x ) x t x t dt ( ) ( ) ( , x x ) x x i j i j i j x x i j i j x x 0 i j Correlation between pair of distribution/histogram data (x i , x j ) COMPSTAT 2010 - Paris - August 22-27
Fitting with a linear model Given two variables Y and X regression model is here proposed to perform a linear transformation of X which better fit Y y t ( ) x ( ) t ( ) t t [0,1] i i i ˆ y i Considering the error as close as possible to zero: ˆ ( ) t y t ( ) y t ( ) i i i COMPSTAT 2010 - Paris - August 22-27
The error term in the classic case Classic case (Euclidean norm) 2 ˆ ˆ ˆ 2 2 y y y y d y , y i i i i i i E i i ˆ i y Error y i x i COMPSTAT 2010 - Paris - August 22-27
The error term of the model (our approach) Histogram case (Wasserstein distance) 2 2 ˆ ˆ ( ) t y t ( ) y ( ) t ( ) t y t ( ) y ( ) t t [0,1 ] i i i i i i 1 2 ˆ ˆ 2 y t ( ) y ( ) t dt d y , y i i W i i 0 ˆ ( ) y t predicted i (squared) Error , ˆ 2 d y y W i i y t ( ) observed i x t ( ) i COMPSTAT 2010 - Paris - August 22-27
Fitting a linear model: histograms We propose to find a linear transformation of the quantile function of x i (histogram data) in order to predict the quantile function of y i i.e.: ˆ y ( ) t f x t ( ( )) x t ( ) t [0, 1 ] i i i It is worth noting the linear transformation is unique: the parameters and are estimated for all the i macro-units x i and y i A first problem: ˆ ( ) Only if >0 a quantile function can be derived. y t i In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance. COMPSTAT 2010 - Paris - August 22-27
Solution to <0 The quantile function can be decomposed as: c x ( ) t x x t ( ) where i i i c x t ( ) x ( ) t x is the centered quantile functi n o i i i Then, we propose the following model: c y t ( ) x x t ( ) ( ). t i 1 i 2 i i Using the Wasserstein distance it is possible to set up a OLS method that returns three coefficients. We demonstrate 2 is always greater or equal to zero. COMPSTAT 2010 - Paris - August 22-27
The error term: a property of the Wasserstein distance decomposition The (squared) error can be written according the two components 1 2 ˆ ˆ 2 2 d ( y , y ) y t ( ) y ( t ) dt i W i i i i 0 2 ˆ ˆ 2 c c y y d ( y , y ) i i W i i COMPSTAT 2010 - Paris - August 22-27
Recommend
More recommend