clustering compositional data trajectories
play

Clustering compositional data trajectories F. Greco , F. Bruno - PowerPoint PPT Presentation

Clustering compositional data trajectories F. Greco , F. Bruno Dipartimento di Scienze Statistiche Universit di Bologna Outline We deal with trajectories of compositional data, that is with sequences of composition measurements in domains.


  1. Clustering compositional data trajectories F. Greco , F. Bruno Dipartimento di Scienze Statistiche Università di Bologna

  2. Outline We deal with trajectories of compositional data, that is with sequences of composition measurements in domains. Observed trajectories are known as “functional data” The problem of clustering compositional data trajectories is addressed. Procedure for clustering functional data can be summarised as follows: • smooth the curves in order to remove measurement errors; • choose a metric to evaluate dissimilarity among the considered objects; • apply a clustering algorithm and evaluate the quality of the obtained partition.

  3. Some notation An observed compositional data trajectory can be seen as a set of measurement taken along a domain [ ] ∈ (for example altitude, deepness, time). x x ; x min max The complete data matrix for a compositional trajectory is denoted as [ ] = . D x p , • •• The generic t -th row is denoted as [ ] ( t =1,…, T ) x p , • t t ( ) • = and contains the C -dimensional composition vector p p , p ,..., p t 1 t 2 t Ct observed in correspondence of x . t

  4. Smoothing compositional data trajectories Steps proposed as a strategy for obtaining smoothed compositional trajectories : • apply the additive log-ratio ( alr ) transformation to the observed compositions; ⎛ ⎞ p p p ( ) = = ⎜ − 1 2 C 1 alr p z log ,log ,...,log ⎟ p p p ⎝ ⎠ C C C • smooth transformed data trajectories by means of usual smoothing techniques (B-spline, p-spline, cubic-spline, etc.). • In order to obtain smoothed compositions, the inverse alr transformation can be applied to smoothed transformed data.

  5. Smoothing compositional data trajectories ˆ ˆ ˆ ˆ ˆ ⎛ ⎞ ⎛ p p p ⎞ ⎛ z z ⎞ ⎛ z z ⎞ p p p 11 21 31 11 21 11 21 11 21 31 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ˆ ˆ ˆ ˆ ˆ p p p z z z z p p p ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 12 22 32 12 22 12 22 12 22 32 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ... ... ... ... ... ... ... ... ... ... alr − 1 ⎯⎯ → ⎯⎯⎯⎯ = + ε → ⎯⎯ ⎯ → alr z z ˆ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ it it it ˆ ˆ ˆ ˆ ˆ p p p z z z z p p p ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 t 2 t 3 t 1 t 2 t 1 t 2 t 1 t 2 t 3 t ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ... ... ... ... ... ... ... ... ... ... ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ˆ ˆ ˆ ˆ ˆ p p p z z z z p p p ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 1 T 2 T 3 T 1 T 2 T 1 T 2 T 1 T 2 T 3 T

  6. Functional Data Analysis (1) When considering K trajectories, the data matrix referred to the k -th trajectory is defined as [ ] = . D x , p • •• k k k is the C -dimensional compositional vector at time t for trajectory k p • tk × is the matrix containing data measured for the k -th trajectory. p T C •• k k Smoothed compositional trajectories ˆ , or k =1,.., K , are obtained p •• k following the steps described before. Several approaches in functional cluster analysis are based on measuring the differences in observed curves by evaluating differences on the spline ( ) ˆ ˆ ˆ ˆ = β β β coefficients . β , ,..., k 1 k 2 k Tk

  7. Functional Data Analysis (2) This approach is effective only if the same degree and vector of knots, as well as the same basis functions are used. Measurements might be taken at different values of the predictor variable ( ) ( ) (misalignment) and the quantities and can vary sensibly min x max x • • k k among trajectories. For this reason we prefer a more flexible approach based on different knots placement and different amount of smoothing for each trajectory.

  8. Construction of the metric Given two generic functions f and g , a measure of the distance between them in the interval [ ] is the integral: x ; x min max ( ) X ( ) ( ) ∫ = − min d f g , f x g x dx X max where • indicates a norm. This integral can be evaluated via Monte Carlo integration by averaging point-to-point distances on a regular grid in the interval [ ] as x ; x min max follows ∑ ( ) ( ) ( ) n − ≅ − 1 d f g , n f x g x i i = i 1 Suitable differences and norms have to be adopted in the simplex.

  9. Construction of the metric ( ) ( ) = = Given two compositions and such q q q , ,..., q w w w , ,..., w 1 2 C 1 2 C difference is evaluated as: ⎡ ⎤ ⎡ ⎤ q q q q w q w q w Θ = Γ = ⎢ ⎥ = m q w 1 , 2 ,..., C 1 1 , 2 2 ,..., C C ⎢ ⎥ ∑ ∑ ∑ C C C w w w ⎢ ⎥ ⎣ ⎦ q w q w q w ⎣ ⎦ 1 2 C i i i i i i = = = i 1 i 1 i 1 The distance ( ) d q w is defined as the norm of the difference , ⎡ 1 ⎤ ( ) ( ) ( ) ( ) − = = − = − 1 m where 1 d q w , m alr m Ψ ' alr Ψ I j j ' ⎢ ⎥ − − − C 1 C 1 C 1 ⎣ ⎦ C

  10. A distance between trajectories in the simplex (1) Predicted values and on a grid are obtained. p p % % •• •• l k Differently from p and , this predicted values are then aligned on the p •• •• l k = grid . x ; i 1,..., n i The distance between trajectories k and l are measured as ∑ ( ) n ≅ n − − 1 d l k , p p % % • • il ik = i 1 ( ) = The distance matrix D with generic entry is is finally obtained. D d l k , lk Starting from this matrix, alternative clustering algorithm can be adopted.

  11. Shape and level (center) 5 4 3 2 1 0 -1 -5 0 5 10 15

  12. A distance between trajectories in the simplex (2) The center of a curve in the simplex is captured by its geometric mean. Thus, for the predicted values , k =1,…, K , the centered trajectories are p % •• k obtained as: = Θ % c p g % % •• •• k k k ( ) − 1 = ∏ n n : geometric mean of the predicted values for trajectory k . g p % % • k ik = i 1 Distances between centered trajectories: ∑ ( ) n n − ≅ − * 1 d l k , c c % % • • il ik = i 1 ( ) = We obtain the distance matrix D where the generic entry is * * * . D d l k , lk

  13. Clustering algorithms Two clustering algorithms are applied and compared: Hierarchical clustering: Ward algorithm Partitive clustering: k -medoid (appealing because a representative object in the cluster can be identified).

  14. Data – Particulate matter vertical profiles Particulate vertical profiles measured along highness (71 launches in winter period) We consider three compositional classes 0.3-0.4; 0.4-0.5; 0.5-1.6

  15. Trajectory– matrx D – Class 1 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.70 0.72 0.74 0.76 0.78 1st composition

  16. Trajectory– matrx D – Class 2 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.16 0.18 0.20 0.22 0.24 2nd composition

  17. Trajectory– matrx D – Class 3 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.03 0.04 0.05 0.06 0.07 0.08 0.09 3rd composition

  18. Trajectory– matrx D * – Class 1 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.65 0.70 0.75 0.80 1st composition

  19. Trajectory– matrx D * – Class 2 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.16 0.18 0.20 0.22 0.24 0.26 2nd composition

  20. Trajectory– matrx D * – Class 3 1.0 0.5 Standardised Height 0.0 -0.5 -1.0 0.04 0.05 0.06 0.07 0.08 0.09 0.10 3rd composition

  21. Trajectories witin the Cluster – matrx D – Class 1 1.0 1.0 1.0 0.5 0.5 0.5 Height Height Height 0.0 0.0 0.0 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 0.55 0.60 0.65 0.70 0.75 0.80 0.55 0.60 0.65 0.70 0.75 0.80 0.55 0.60 0.65 0.70 0.75 0.80 1st Composition 1st Composition 1st Composition

  22. Trajectories witin the Cluster – matrx D – Class 2 1.0 1.0 1.0 0.5 0.5 0.5 Height Height Height 0.0 0.0 0.0 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 0.15 0.20 0.25 0.30 0.35 0.15 0.20 0.25 0.30 0.35 0.15 0.20 0.25 0.30 0.35 2nd Composition 2nd Composition 2nd Composition

  23. Trajectories witin the Cluster – matrx D – Class 3 1.0 1.0 1.0 0.5 0.5 0.5 Height Height Height 0.0 0.0 0.0 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.02 0.04 0.06 0.08 0.10 0.12 0.14 3rd Composition 3rd Composition 3rd Composition

  24. Trajectories witin the Cluster – matrx D * – Class 1 1.0 1.0 1.0 0.5 0.5 0.5 Height Height Height 0.0 0.0 0.0 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 0.60 0.65 0.70 0.75 0.80 0.60 0.65 0.70 0.75 0.80 0.55 0.60 0.65 0.70 0.75 0.80 1st Composition 1st Composition 1st Composition

  25. Trajectories witin the Cluster – matrx D * – Class 2 1.0 1.0 1.0 0.5 0.5 0.5 Height Height Height 0.0 0.0 0.0 -0.5 -0.5 -0.5 -1.0 -1.0 -1.0 0.10 0.15 0.20 0.25 0.30 0.10 0.15 0.20 0.25 0.30 0.15 0.20 0.25 0.30 2nd Composition 2nd Composition 2nd Composition

Recommend


More recommend