Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = FALSE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 140 140 140 140 140 [2,] 140 0 140 140 90 140 [3,] 140 140 0 92 140 140 [4,] 140 140 92 0 140 140 [5,] 140 90 140 140 0 140 [6,] 140 140 140 140 140 0 8/7/2009gr 11/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = TRUE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1.0000000 1.0000000 1.0000000 1.0000000 1 [2,] 1 0.0000000 1.0000000 1.0000000 0.6428571 1 [3,] 1 1.0000000 0.0000000 0.6571429 1.0000000 1 [4,] 1 1.0000000 0.6571429 0.0000000 1.0000000 1 [5,] 1 0.6428571 1.0000000 1.0000000 0.0000000 1 [6,] 1 1.0000000 1.0000000 1.0000000 1.0000000 0 8/7/2009gr 12/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LLCS: example R> x <- c(1, 1, 1, 2, 2, 3, 3) R> y <- c(1, 1, 1, 4, 3, 3, 4) R> seqdist(seqdef(rbind(x, y)), method = "LCS") [,1] [,2] [1,] 0 4 [2,] 4 0 R> seqdist(seqdef(rbind(x, y)), method = "LCS", norm = TRUE) [,1] [,2] [1,] 0.0000000 0.2857143 [2,] 0.2857143 0.0000000 8/7/2009gr 14/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM: substitution example Consider the 2 sequences 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 8/7/2009gr 18/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM: substitution example Consider the 2 sequences 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 8/7/2009gr 18/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Assigning indel and substitution costs Same cost for each ‘insert’ or ‘deletion’. indel cost is a single constant. Substitution costs: Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost c i , j = c j , i 8/7/2009gr 19/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Assigning indel and substitution costs Same cost for each ‘insert’ or ‘deletion’. indel cost is a single constant. Substitution costs: Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost c i , j = c j , i 8/7/2009gr 19/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Unique Costs R> subm.unique <- seqsubm(mvad.seq, method = "CONSTANT", cval = 2) R> subm.unique EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0 2 2 2 2 2 FE-> 2 0 2 2 2 2 HE-> 2 2 0 2 2 2 JL-> 2 2 2 0 2 2 SC-> 2 2 2 2 0 2 TR-> 2 2 2 2 2 0 8/7/2009gr 22/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Custom Costs R> subm.custom <- matrix(c(0, 1, 1, 2, 1, 1, 1, 0, 1, 2, + 1, 2, 1, 1, 0, 3, 1, 2, 2, 2, 3, 0, 3, 1, 1, 1, 1, + 3, 0, 2, 1, 2, 2, 1, 2, 0), nrow = 6, ncol = 6, byrow = TRUE, + dimnames = list(mvad.shortlab, mvad.shortlab)) R> subm.custom EM FE HE JL SC TR EM 0 1 1 2 1 1 FE 1 0 1 2 1 2 HE 1 1 0 3 1 2 JL 2 2 3 0 3 1 SC 1 1 1 3 0 2 TR 1 2 2 1 2 0 8/7/2009gr 23/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Based on Transition Rates R> subm.txrate <- seqsubm(mvad.seq, method = "TRATE") R> subm.txrate EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0.00000 1.97008 1.98723 1.95173 1.98536 1.95950 FE-> 1.97008 0.00000 1.99318 1.98266 1.99092 1.99235 HE-> 1.98723 1.99318 0.00000 1.99584 1.98184 1.99949 JL-> 1.95173 1.98266 1.99584 0.00000 1.99385 1.97808 SC-> 1.98536 1.99092 1.98184 1.99385 0.00000 1.99666 TR-> 1.95950 1.99235 1.99949 1.97808 1.99666 0.00000 8/7/2009gr 24/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Computing the distances Using the substitution cost matrix, we compute distances R> mvad.dist <- seqdist(mvad.seq, method = "OM", indel = 4, + sm = subm.custom, norm = TRUE) R> round(mvad.dist[1:10, 1:10], digits = 2) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [1] 0.00 1.03 0.86 0.90 1.03 0.47 0.46 0.34 0.27 0.57 [2] 1.03 0.00 1.23 1.93 0.16 1.49 0.57 0.69 1.30 1.37 [3] 0.86 1.23 0.00 1.01 1.39 0.70 1.14 1.20 0.59 1.26 [4] 0.90 1.93 1.01 0.00 1.93 0.46 1.36 1.24 0.63 0.90 [5] 1.03 0.16 1.39 1.93 0.00 1.49 0.64 0.69 1.30 1.37 [6] 0.47 1.49 0.70 0.46 1.49 0.00 0.91 0.80 0.20 0.99 [7] 0.46 0.57 1.14 1.36 0.64 0.91 0.00 0.11 0.73 0.80 [8] 0.34 0.69 1.20 1.24 0.69 0.80 0.11 0.00 0.61 0.69 [9] 0.27 1.30 0.59 0.63 1.30 0.20 0.73 0.61 0.00 0.79 [10] 0.57 1.37 1.26 0.90 1.37 0.99 0.80 0.69 0.79 0.00 8/7/2009gr 25/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 26/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Cluster analysis Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library agnes() : agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana() : divisive analysis. pam() : partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori). 8/7/2009gr 27/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Cluster analysis Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library agnes() : agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana() : divisive analysis. pam() : partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori). 8/7/2009gr 27/100
8/7/2009gr 28/100 Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Hierarchical clustering (Ward) R> plot(mvad.clusterward, ask = F, which.plots = 2) R> mvad.clusterward <- agnes(mvad.dist, diss = T, method = "ward") R> library(cluster) Height 0 5 10 15 [68] [26] [1] [120] [116] [150] [76] [169] [179] [155] [202] [193] [201] [213] [280] [263] [237] [306] [310] [292] [281] [373] [348] [345] [314] [392] [394] [399] [424] [421] [413] [404] [432] [481] [428] [427] [512] [525] [496] [527] [567] [558] [545] [617] [649] [598] [586] [680] [694] [695] [703] [178] [166] [289] [186] [483] [134] [384] [425] [528] [684] [702] [253] [566] [82] [638] [266] [352] [400] [39] [90] [278] [183] [224] [360] [65] Dendrogram of agnes(x = mvad.dist, diss = T, method = "ward") [368] [388] [81] [372] [163] [636] [357] [338] [560] [212] [412] [375] [159] [77] [570] [242] [350] [571] [396] [361] [114] [108] [633] [305] [469] [250] [398] [340] [477] [593] [575] [559] [648] [634] [502] [107] [407] [701] [46] [553] [123] [164] [416] [518] [479] [149] [151] [402] [328] [344] [56] [550] [119] [73] [532] [574] [240] [563] [591] [515] [12] [547] [635] [643] [79] [248] [507] [162] [80] [600] [437] [490] [690] [655] [449] [318] [657] [100] [681] [249] [661] [707] [7] [291] [293] [287] [509] [595] [117] [596] [74] [167] [172] [146] [619] [603] [678] [691] [125] [488] [700] [331] [364] [177] [61] [298] [497] [192] [98] [109] [168] [408] [662] [517] [200] [441] [555] [472] [308] [284] [176] [64] [8] [217] [211] [214] [180] [468] [346] [382] [353] [506] [478] [523] [582] [597] [683] [327] [605] [302] [255] [543] [199] [313] [30] [447] [602] [312] [659] [624] [708] [467] [430] [530] [157] [585] [277] [438] [145] [189] [465] [244] [54] [197] [243] [436] [70] Agglomerative Coefficient = 0.99 [247] [304] [446] [534] [653] [330] [406] [154] [152] [363] [111] [513] [494] [522] [625] [124] [271] [3] [267] [611] [264] [55] [127] [118] [9] [276] [362] [87] [66] [205] [126] [139] [184] [251] [252] [272] [371] [482] [326] [355] [628] [606] [579] [519] [141] [698] [78] [626] [387] [711] [59] [632] [629] [667] [334] [426] [351] [704] [580] [616] [29] [18] [637] [23] [92] [135] [121] [374] [397] [303] [409] [60] [22] [322] [386] [96] [696] [439] [420] [85] [343] [673] [105] [457] mvad.dist [299] [106] [122] [128] [419] [443] [672] [599] [140] [321] [401] [147] [161] [223] [682] [110] [160] [639] [546] [395] [95] [568] [699] [642] [6] [435] [319] [195] [471] [589] [354] [93] [493] [131] [288] [675] [225] [174] [58] [393] [136] [442] [132] [536] [187] [476] [296] [630] [97] [511] [268] [526] [564] [356] [389] [190] [309] [185] [524] [377] [486] [231] [671] [423] [4] [697] [644] [101] [86] [226] [473] [21] [191] [540] [69] [84] [265] [548] [499] [156] [712] [165] [535] [241] [290] [520] [38] [631] [41] [91] [440] [652] [508] [42] [501] [19] [315] [204] [539] [148] [103] [664] [210] [88] [71] [153] [325] [588] [10] [171] [463] [62] [336] [349] [14] [16] [562] [679] [414] [24] [219] [670] [102] [647] [307] [232] [196] [640] [317] [28] [270] [705] [381] [455] [229] [514] [188] [89] [342] [668] [221] [665] [227] [15] [20] [510] [262] [94] [40] [641] [138] [584] [627] [113] [366] [104] [254] [529] [347] [709] [537] [405] [99] [429] [660] [495] [403] [620] [663] [674] [689] [666] [594] [669] [687] [83] [618] [2] [581] [458] [269] [115] [335] [434] [710] [445] [324] [480] [533] [129] [448] [5] [491] [175] [294] [459] [561] [622] [230] [130] [541] [503] [531] [198] [556] [601] [385] [220] [112] [369] [466] [216] [379] [391] [538] [376] [651] [222] [489] [516] [233] [572] [554] [158] [142] [557] [484] [215] [246] [492] [339] [286] [645] [245] [311] [239] [462] [285] [301] [11] [418] [577] [576] [380] [370] [170] [261] [383] [433] [542] [173] [464] [487] [676] [182] [218] [297] [569] [300] [337] [470] [500] [549] [295] [275] [378] [341] [431] [590] [475] [444] [573] [415] [320] [551] [17] [578] [203] [650] [706] [688] [329] [43] [504] [677] [206] [45] [474] [460] [52] [692] [209] [181] [13] [235] [608] [27] [34] [49] [53] [32] [258] [238] [228] [57] [422] [417] [359] [279] [461] [505] [607] [259] [35] [48] [50] [234] [133] [256] [51] [358] [332] [283] [609] [604] [587] [454] [612] [685] [656] [498] [207] [36] [411] [257] [67] [25] [63] [333] [282] [451] [614] [615] [621] [452] [583] [544] [456] [44] [613] [610] [410] [646] [623] [143] [208] [323] [450] [31] [37] [273] [693] [365] [485] [316] [33] [236] [552] [390] [453] [521] [565] [367] [194] [144] [654] [592] [47] [274] [686] [75] [658] [260] [137] [72]
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Warning!!! Do not forget to specify the diss = T option. Otherwise (i.e. by default) functions agnes(), diana(), pam(), ... first compute the Euclidean distance matrix between rows of the dissimilarity matrix. 8/7/2009gr 29/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Retrieving cluster membership Select the number of clusters, cut tree at chosen level, and store cluster membership into a vector. R> mvad.cl3 <- cutree(mvad.clusterward, k = 3) R> mvad.cl3[1:10] [1] 1 2 1 1 2 1 1 1 1 3 R> clust.labels <- c("Employment", "Education", "Jobless") R> mvad.cl3.factor <- factor(mvad.cl3, levels = c(1, 2, + 3), labels = clust.labels) 8/7/2009gr 30/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Exploring clusters graphically Three types of graphics Transversal distribution with seqdplot() 1 Frequency plots with seqfplot() 2 Individual index-plots seqiplot() 3 Required argument: state sequence object. Use group = cluster.membership.factor to get plots by cluster. 8/7/2009gr 31/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Transversal Distributions R> seqdplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 32/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Most frequent sequences R> seqfplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 33/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Individual sequences R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0) 8/7/2009gr 34/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sorting sequences for i-plot display Previous i-plots become clearer if we sort sequences. Several possibilities: According to distance to most frequent sequence; distance to centro-type or any other useful reference. scores on first factor of a MDS analysis; 8/7/2009gr 35/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Computing distance to most frequent sequence Compute, in each cluster, distances to most frequent sequence ( refseq = 0) . Using here the custom substitution cost matrix. R> mvad.distom <- numeric(nrow(mvad)) R> mvad.distom[mvad.cl3 == 1] <- seqdist(mvad.seq[mvad.cl3 == + 1, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 2] <- seqdist(mvad.seq[mvad.cl3 == + 2, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 3] <- seqdist(mvad.seq[mvad.cl3 == + 3, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) 8/7/2009gr 36/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: Distance to most frequent sequence R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mvad.distom) 8/7/2009gr 37/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: First factor of MDS analysis R> mds1d <- cmdscale(mvad.dist, k = 1) R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mds1d) 8/7/2009gr 38/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Scatterplot (MDS) Through Multidimensional Scaling (MDS), we get a scatter plot of sequences R> mds2d <- cmdscale(mvad.dist, k = 2) R> plot(mds2d, type = "n") R> points(mds2d[mvad.cl3 == 1, ], pch = 16, col = "red") R> points(mds2d[mvad.cl3 == 2, ], pch = 16, col = "blue") R> points(mds2d[mvad.cl3 == 3, ], pch = 16, col = "green") R> legend("bottomright", fill = c("red", "blue", "green"), + legend = clust.labels) 8/7/2009gr 39/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by cluster ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● Employment ● ● ● ● ● ● Education ● Jobless ● ● −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 40/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Code for scatterplot colored by sex R> plot(mds2d, type = "n") R> points(mds2d[mvad$male == "yes", ], pch = 16, col = "red") R> points(mds2d[mvad$male == "no", ], pch = 23, col = "blue") R> legend("bottomright", col = c("red", "blue"), pch = c(16, + 23), legend = c("Men", "Women")) 8/7/2009gr 41/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by sex ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● Men ● Women −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 42/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 43/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Compute the sequence dispersion R> distMatLCS <- seqdist(mvad.seq, method = "LCS") R> distMatLCS[1:6, 1:7] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0 140 116 108 140 64 60 [2,] 140 0 72 140 22 140 80 [3,] 116 72 0 68 90 72 60 [4,] 108 140 68 0 140 46 112 [5,] 140 22 90 140 0 140 90 [6,] 64 140 72 46 140 0 68 R> dissvar(distMatLCS) [1] 42.74502 8/7/2009gr 45/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 46/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy ANOVA like analysis based on pairwise dissimilarities We decompose the SS (Sum of squares equivalent) SS T = SS B + SS W Here, with the formula shown earlier n n 1 � � SS T = d ij n i =1 j = i +1 � 1 n g n g � � � � = SS W d ij , g n g g i =1 j = i +1 SS B = SS T − SS W 8/7/2009gr 47/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo R-square and ANOVA Table ANOVA table for m groups Discrepancy df Mean Discr. F SS B SS B df W Between SS B df B = m − 1 df B df B SS W SS W Within SS W df W = � g n g − m df W Total SS T df T = n − 1 Pseudo R 2 SS B R 2 = SS T 8/7/2009gr 48/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo R-square and ANOVA Table ANOVA table for m groups Discrepancy df Mean Discr. F SS B SS B df W Between SS B df B = m − 1 df B df B SS W SS W Within SS W df W = � g n g − m df W Total SS T df T = n − 1 Pseudo R 2 SS B R 2 = SS T 8/7/2009gr 48/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo F Pseudo F SS B / ( m − 1) = F SS W / ( n − m ) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F . Empirical distribution of F under independence. 8/7/2009gr 49/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy Running an ANOVA like analysis for gcse5eq R> mvad.lcs <- seqdist(mvad.seq, method = "LCS") R> da <- dissassoc(mvad.lcs, group = mvad$gcse5eq, R = 1000) 8/7/2009gr 50/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy ANOVA output R> print(da) Pseudo ANOVA table: SS df MSE Exp 2499.945 1 2499.94539 Res 27934.510 710 39.34438 Total 30434.455 711 42.80514 Test values (p-values based on 999 permutation): PseudoF PseudoR2 PseudoF_Pval PseudoT PseudoT_Pval 63.54009 0.08214195 0 1.199912 0 Variance per level: n variance no 452 37.48481 yes 260 42.27453 Total 712 42.74502 8/7/2009gr 51/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Distribution of pseudo F R> hist(da, col = "cyan") Distribution of PseudoF 120 100 80 Frequency 60 40 20 0 1 2 3 4 PseudoF 8/7/2009gr 52/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Multiple factor analysis Generalize previous approach for multiple covariates. There are different approaches. Here, we Measure the additional contribution of each covariate v when we accounted for all other covariates. The F statistics reads F v = ( SS B c − SS B v ) / p SS W c / ( n − m − 1) where the SS B c and SS W c are the explained and residual sums of squares of the full model, SS B v the explained sum of squares of the model after removing variable v , and p the number of indicators or contrasts used to encode the covariate v . significance is assessed again through permutation tests. 8/7/2009gr 53/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Running a Multiple factor analysis R> da.mfac <- dissmfac(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 1000) R> print(da.mfac) Variable PseudoF PseudoR2 p_value 1 male 3.274802 0.003840223 0.026 2 Grammar 21.124081 0.024771330 0.000 3 funemp 4.483016 0.005257046 0.003 4 gcse5eq 75.725976 0.088800698 0.000 5 fmpr 2.715988 0.003184926 0.045 6 livboth 2.314571 0.002714201 0.078 7 Total 24.829102 0.174448528 0.000 8/7/2009gr 54/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time R> mvad.diff <- seqdiff(mvad.seq, group = mvad$gcse5eq) R> mvad.diff$stat[1:4, ] PseudoF PseudoR2 PseudoT Sep.93 29.09196 0.03936176 2.313692 Oct.93 29.39664 0.03975760 2.223468 Nov.93 29.76849 0.04024027 2.265784 Dec.93 30.09793 0.04066750 2.304112 R> mvad.diff$variance[1:4, ] no yes Total Sep.93 0.3688107 0.3113979 0.3620982 Oct.93 0.3691362 0.3127219 0.3629661 Nov.93 0.3704210 0.3133136 0.3642237 Dec.93 0.3725771 0.3146893 0.3663363 8/7/2009gr 56/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting R-squares over time R> plot(mvad.diff) 0.12 0.10 PseudoR2 0.08 0.06 0.04 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 57/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting residual discrepancy over time R> plot(mvad.diff, stat = "Variance") no yes Total 0.35 0.30 Variance 0.25 0.20 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 58/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Tree structured discrepancy analysis Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R 2 . Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant. 8/7/2009gr 59/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Tree structured discrepancy analysis Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R 2 . Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant. 8/7/2009gr 59/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Growing the tree R> dt <- disstree(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 5000) R> print(dt) Dissimilarity tree Global R2: 0.113 |-- Root [ 712 ] var: 42.7 |-> gcse5eq R2: 0.0821 |-- no [ 452 ] var: 37.5 |-> funemp R2: 0.0107 |-- no [ 362 ] var: 35.9 |-> male R2: 0.0123 |-- no [ 146 ] var: 38.7 |-- yes [ 216 ] var: 33.3 |-- yes [ 90 ] var: 41.8 |-- yes [ 260 ] var: 42.3 |-> Grammar R2: 0.0534 |-- no [ 183 ] var: 42.2 |-- yes [ 77 ] var: 34.9 8/7/2009gr 60/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Creating a Graphviz plot of the tree Using simplified interface to generate a file for GraphViz R> seqtree2dot(dt, "fg_mvadseqtree", seqdata = mvad.seq, type = "d", + border = NA, withlegend = FALSE, axes = FALSE, ylab = "", + yaxis = FALSE) 8/7/2009gr 61/100
Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Graphical Tree 8/7/2009gr 62/100
Sequential data analysis - 2 Mining event sequences Outline Dissimilarities among pairs of state sequences 1 Mining event sequences 2 Conclusion: Sequence of analyses 3 8/7/2009gr 63/100
Sequential data analysis - 2 Mining event sequences Event sequences Section outline Mining event sequences 2 Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints 8/7/2009gr 64/100
Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100
Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100
Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100
Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100
Recommend
More recommend