sequential data analysis with traminer part 2
play

Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard - PowerPoint PPT Presentation

Sequential data analysis - 2 Sequential data analysis with TraMineR, Part 2 Gilbert Ritschard Department of Econometrics and Laboratory of Demography University of Geneva http://mephisto.unige.ch/biomining APA-ATI Workshop on Exploratory Data


  1. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = FALSE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 140 140 140 140 140 [2,] 140 0 140 140 90 140 [3,] 140 140 0 92 140 140 [4,] 140 140 92 0 140 140 [5,] 140 90 140 140 0 140 [6,] 140 140 140 140 140 0 8/7/2009gr 11/100

  2. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Example on the 6 first mvad sequences Non-normalized LCP Distance LCP R> seqdist(mvad.seq[1:6, ], method = "LCP", norm = TRUE) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 1.0000000 1.0000000 1.0000000 1.0000000 1 [2,] 1 0.0000000 1.0000000 1.0000000 0.6428571 1 [3,] 1 1.0000000 0.0000000 0.6571429 1.0000000 1 [4,] 1 1.0000000 0.6571429 0.0000000 1.0000000 1 [5,] 1 0.6428571 1.0000000 1.0000000 0.0000000 1 [6,] 1 1.0000000 1.0000000 1.0000000 1.0000000 0 8/7/2009gr 12/100

  3. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100

  4. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100

  5. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100

  6. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LCS: Longest Common Subsequences LLCS: Length of Longest Common Subsequence shared by 2 sequences. Example : x : 1-1-1-2-2-3-3 y : 1-1-1-4-3-3-4 LLCS = 5 Distance measure: d LCS ( x , y ) = A ℓ ( x , x ) + A ℓ ( y , y ) − 2 A ℓ ( x , y ) Normalized form: D LCS ( x , y ) = A ℓ ( x , y ) √ | x |·| y | 8/7/2009gr 13/100

  7. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences LLCS: example R> x <- c(1, 1, 1, 2, 2, 3, 3) R> y <- c(1, 1, 1, 4, 3, 3, 4) R> seqdist(seqdef(rbind(x, y)), method = "LCS") [,1] [,2] [1,] 0 4 [2,] 4 0 R> seqdist(seqdef(rbind(x, y)), method = "LCS", norm = TRUE) [,1] [,2] [1,] 0.0000000 0.2857143 [2,] 0.2857143 0.0000000 8/7/2009gr 14/100

  8. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100

  9. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100

  10. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100

  11. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (optimal alignment) Based on Levenshtein (1966)’s distance Inspired from alignment used in biology (ADN or protein sequences) Introduced in social sciences by Abbott and Forrest (1986) 8/7/2009gr 15/100

  12. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100

  13. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100

  14. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Optimal matching (OM): principle Want to transform one sequence into the other one. Using two types of operations Insertion or deletion of an element Substitution of an element Each operation has a cost. OM distance is minimal cost for transforming one sequence into the other. 8/7/2009gr 16/100

  15. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100

  16. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100

  17. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100

  18. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM : example Consider the two sequences : 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Insertion of element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 3 Deletion of element ‘3’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 The two sequences are now identical. 8/7/2009gr 17/100

  19. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM: substitution example Consider the 2 sequences 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 8/7/2009gr 18/100

  20. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences OM: substitution example Consider the 2 sequences 1 0 1 1 2 2 2 3 2 0 1 1 2 2 3 3 Substitution of ‘3’ by element ‘2’ 1 0 1 1 2 2 2 3 2 0 1 1 2 2 2 3 8/7/2009gr 18/100

  21. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Assigning indel and substitution costs Same cost for each ‘insert’ or ‘deletion’. indel cost is a single constant. Substitution costs: Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost c i , j = c j , i 8/7/2009gr 19/100

  22. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Assigning indel and substitution costs Same cost for each ‘insert’ or ‘deletion’. indel cost is a single constant. Substitution costs: Each substitution may receive a different cost. Matrix of substitution costs. However: symmetrical cost c i , j = c j , i 8/7/2009gr 19/100

  23. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100

  24. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100

  25. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Defining substitution costs Unique cost c ij = c (should provide c ) Based on transition rates (no additional input required) c i , j = c j , i = 2 − p ( i t | j t − 1 ) − p ( j t | i t − 1 ) Custom costs (should provide whole cost matrix) Learned optimal costs (Gauthier et al., 2008) and their TCOFFEE software) 8/7/2009gr 20/100

  26. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100

  27. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100

  28. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Using Optimal Matching in TraMineR Create the state sequence object with seqdef() Get a substitution cost matrix or compute one with seqsubm() Compute matrix of OM distances with seqdist(..., method="OM", indel=..., sm=...) 8/7/2009gr 21/100

  29. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Unique Costs R> subm.unique <- seqsubm(mvad.seq, method = "CONSTANT", cval = 2) R> subm.unique EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0 2 2 2 2 2 FE-> 2 0 2 2 2 2 HE-> 2 2 0 2 2 2 JL-> 2 2 2 0 2 2 SC-> 2 2 2 2 0 2 TR-> 2 2 2 2 2 0 8/7/2009gr 22/100

  30. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Custom Costs R> subm.custom <- matrix(c(0, 1, 1, 2, 1, 1, 1, 0, 1, 2, + 1, 2, 1, 1, 0, 3, 1, 2, 2, 2, 3, 0, 3, 1, 1, 1, 1, + 3, 0, 2, 1, 2, 2, 1, 2, 0), nrow = 6, ncol = 6, byrow = TRUE, + dimnames = list(mvad.shortlab, mvad.shortlab)) R> subm.custom EM FE HE JL SC TR EM 0 1 1 2 1 1 FE 1 0 1 2 1 2 HE 1 1 0 3 1 2 JL 2 2 3 0 3 1 SC 1 1 1 3 0 2 TR 1 2 2 1 2 0 8/7/2009gr 23/100

  31. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Cost Matrix: Based on Transition Rates R> subm.txrate <- seqsubm(mvad.seq, method = "TRATE") R> subm.txrate EM-> FE-> HE-> JL-> SC-> TR-> EM-> 0.00000 1.97008 1.98723 1.95173 1.98536 1.95950 FE-> 1.97008 0.00000 1.99318 1.98266 1.99092 1.99235 HE-> 1.98723 1.99318 0.00000 1.99584 1.98184 1.99949 JL-> 1.95173 1.98266 1.99584 0.00000 1.99385 1.97808 SC-> 1.98536 1.99092 1.98184 1.99385 0.00000 1.99666 TR-> 1.95950 1.99235 1.99949 1.97808 1.99666 0.00000 8/7/2009gr 24/100

  32. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Measures of dissimilarity between sequences Computing the distances Using the substitution cost matrix, we compute distances R> mvad.dist <- seqdist(mvad.seq, method = "OM", indel = 4, + sm = subm.custom, norm = TRUE) R> round(mvad.dist[1:10, 1:10], digits = 2) [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [1] 0.00 1.03 0.86 0.90 1.03 0.47 0.46 0.34 0.27 0.57 [2] 1.03 0.00 1.23 1.93 0.16 1.49 0.57 0.69 1.30 1.37 [3] 0.86 1.23 0.00 1.01 1.39 0.70 1.14 1.20 0.59 1.26 [4] 0.90 1.93 1.01 0.00 1.93 0.46 1.36 1.24 0.63 0.90 [5] 1.03 0.16 1.39 1.93 0.00 1.49 0.64 0.69 1.30 1.37 [6] 0.47 1.49 0.70 0.46 1.49 0.00 0.91 0.80 0.20 0.99 [7] 0.46 0.57 1.14 1.36 0.64 0.91 0.00 0.11 0.73 0.80 [8] 0.34 0.69 1.20 1.24 0.69 0.80 0.11 0.00 0.61 0.69 [9] 0.27 1.30 0.59 0.63 1.30 0.20 0.73 0.61 0.00 0.79 [10] 0.57 1.37 1.26 0.90 1.37 0.99 0.80 0.69 0.79 0.00 8/7/2009gr 25/100

  33. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 26/100

  34. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Cluster analysis Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library agnes() : agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana() : divisive analysis. pam() : partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori). 8/7/2009gr 27/100

  35. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Cluster analysis Once we have a dissimilarity (distance) matrix we can run any cluster algorithm that accepts such a matrix as input. There are several possibilities in R, for instance with the cluster library agnes() : agglomerative nesting, i.e. hierarchical clustering (average, ward, ...). diana() : divisive analysis. pam() : partitioning around medoids (non hierarchical, faster, but number of cluster must be set a priori). 8/7/2009gr 27/100

  36. 8/7/2009gr 28/100 Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Hierarchical clustering (Ward) R> plot(mvad.clusterward, ask = F, which.plots = 2) R> mvad.clusterward <- agnes(mvad.dist, diss = T, method = "ward") R> library(cluster) Height 0 5 10 15 [68] [26] [1] [120] [116] [150] [76] [169] [179] [155] [202] [193] [201] [213] [280] [263] [237] [306] [310] [292] [281] [373] [348] [345] [314] [392] [394] [399] [424] [421] [413] [404] [432] [481] [428] [427] [512] [525] [496] [527] [567] [558] [545] [617] [649] [598] [586] [680] [694] [695] [703] [178] [166] [289] [186] [483] [134] [384] [425] [528] [684] [702] [253] [566] [82] [638] [266] [352] [400] [39] [90] [278] [183] [224] [360] [65] Dendrogram of agnes(x = mvad.dist, diss = T, method = "ward") [368] [388] [81] [372] [163] [636] [357] [338] [560] [212] [412] [375] [159] [77] [570] [242] [350] [571] [396] [361] [114] [108] [633] [305] [469] [250] [398] [340] [477] [593] [575] [559] [648] [634] [502] [107] [407] [701] [46] [553] [123] [164] [416] [518] [479] [149] [151] [402] [328] [344] [56] [550] [119] [73] [532] [574] [240] [563] [591] [515] [12] [547] [635] [643] [79] [248] [507] [162] [80] [600] [437] [490] [690] [655] [449] [318] [657] [100] [681] [249] [661] [707] [7] [291] [293] [287] [509] [595] [117] [596] [74] [167] [172] [146] [619] [603] [678] [691] [125] [488] [700] [331] [364] [177] [61] [298] [497] [192] [98] [109] [168] [408] [662] [517] [200] [441] [555] [472] [308] [284] [176] [64] [8] [217] [211] [214] [180] [468] [346] [382] [353] [506] [478] [523] [582] [597] [683] [327] [605] [302] [255] [543] [199] [313] [30] [447] [602] [312] [659] [624] [708] [467] [430] [530] [157] [585] [277] [438] [145] [189] [465] [244] [54] [197] [243] [436] [70] Agglomerative Coefficient = 0.99 [247] [304] [446] [534] [653] [330] [406] [154] [152] [363] [111] [513] [494] [522] [625] [124] [271] [3] [267] [611] [264] [55] [127] [118] [9] [276] [362] [87] [66] [205] [126] [139] [184] [251] [252] [272] [371] [482] [326] [355] [628] [606] [579] [519] [141] [698] [78] [626] [387] [711] [59] [632] [629] [667] [334] [426] [351] [704] [580] [616] [29] [18] [637] [23] [92] [135] [121] [374] [397] [303] [409] [60] [22] [322] [386] [96] [696] [439] [420] [85] [343] [673] [105] [457] mvad.dist [299] [106] [122] [128] [419] [443] [672] [599] [140] [321] [401] [147] [161] [223] [682] [110] [160] [639] [546] [395] [95] [568] [699] [642] [6] [435] [319] [195] [471] [589] [354] [93] [493] [131] [288] [675] [225] [174] [58] [393] [136] [442] [132] [536] [187] [476] [296] [630] [97] [511] [268] [526] [564] [356] [389] [190] [309] [185] [524] [377] [486] [231] [671] [423] [4] [697] [644] [101] [86] [226] [473] [21] [191] [540] [69] [84] [265] [548] [499] [156] [712] [165] [535] [241] [290] [520] [38] [631] [41] [91] [440] [652] [508] [42] [501] [19] [315] [204] [539] [148] [103] [664] [210] [88] [71] [153] [325] [588] [10] [171] [463] [62] [336] [349] [14] [16] [562] [679] [414] [24] [219] [670] [102] [647] [307] [232] [196] [640] [317] [28] [270] [705] [381] [455] [229] [514] [188] [89] [342] [668] [221] [665] [227] [15] [20] [510] [262] [94] [40] [641] [138] [584] [627] [113] [366] [104] [254] [529] [347] [709] [537] [405] [99] [429] [660] [495] [403] [620] [663] [674] [689] [666] [594] [669] [687] [83] [618] [2] [581] [458] [269] [115] [335] [434] [710] [445] [324] [480] [533] [129] [448] [5] [491] [175] [294] [459] [561] [622] [230] [130] [541] [503] [531] [198] [556] [601] [385] [220] [112] [369] [466] [216] [379] [391] [538] [376] [651] [222] [489] [516] [233] [572] [554] [158] [142] [557] [484] [215] [246] [492] [339] [286] [645] [245] [311] [239] [462] [285] [301] [11] [418] [577] [576] [380] [370] [170] [261] [383] [433] [542] [173] [464] [487] [676] [182] [218] [297] [569] [300] [337] [470] [500] [549] [295] [275] [378] [341] [431] [590] [475] [444] [573] [415] [320] [551] [17] [578] [203] [650] [706] [688] [329] [43] [504] [677] [206] [45] [474] [460] [52] [692] [209] [181] [13] [235] [608] [27] [34] [49] [53] [32] [258] [238] [228] [57] [422] [417] [359] [279] [461] [505] [607] [259] [35] [48] [50] [234] [133] [256] [51] [358] [332] [283] [609] [604] [587] [454] [612] [685] [656] [498] [207] [36] [411] [257] [67] [25] [63] [333] [282] [451] [614] [615] [621] [452] [583] [544] [456] [44] [613] [610] [410] [646] [623] [143] [208] [323] [450] [31] [37] [273] [693] [365] [485] [316] [33] [236] [552] [390] [453] [521] [565] [367] [194] [144] [654] [592] [47] [274] [686] [75] [658] [260] [137] [72]

  37. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Warning!!! Do not forget to specify the diss = T option. Otherwise (i.e. by default) functions agnes(), diana(), pam(), ... first compute the Euclidean distance matrix between rows of the dissimilarity matrix. 8/7/2009gr 29/100

  38. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Retrieving cluster membership Select the number of clusters, cut tree at chosen level, and store cluster membership into a vector. R> mvad.cl3 <- cutree(mvad.clusterward, k = 3) R> mvad.cl3[1:10] [1] 1 2 1 1 2 1 1 1 1 3 R> clust.labels <- c("Employment", "Education", "Jobless") R> mvad.cl3.factor <- factor(mvad.cl3, levels = c(1, 2, + 3), labels = clust.labels) 8/7/2009gr 30/100

  39. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Exploring clusters graphically Three types of graphics Transversal distribution with seqdplot() 1 Frequency plots with seqfplot() 2 Individual index-plots seqiplot() 3 Required argument: state sequence object. Use group = cluster.membership.factor to get plots by cluster. 8/7/2009gr 31/100

  40. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Transversal Distributions R> seqdplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 32/100

  41. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Most frequent sequences R> seqfplot(mvad.seq, group = mvad.cl3.factor) 8/7/2009gr 33/100

  42. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Individual sequences R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0) 8/7/2009gr 34/100

  43. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sorting sequences for i-plot display Previous i-plots become clearer if we sort sequences. Several possibilities: According to distance to most frequent sequence; distance to centro-type or any other useful reference. scores on first factor of a MDS analysis; 8/7/2009gr 35/100

  44. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Computing distance to most frequent sequence Compute, in each cluster, distances to most frequent sequence ( refseq = 0) . Using here the custom substitution cost matrix. R> mvad.distom <- numeric(nrow(mvad)) R> mvad.distom[mvad.cl3 == 1] <- seqdist(mvad.seq[mvad.cl3 == + 1, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 2] <- seqdist(mvad.seq[mvad.cl3 == + 2, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) R> mvad.distom[mvad.cl3 == 3] <- seqdist(mvad.seq[mvad.cl3 == + 3, ], refseq = 0, method = "OM", indel = 4, sm = subm.custom) 8/7/2009gr 36/100

  45. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: Distance to most frequent sequence R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mvad.distom) 8/7/2009gr 37/100

  46. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sort: First factor of MDS analysis R> mds1d <- cmdscale(mvad.dist, k = 1) R> seqiplot(mvad.seq, group = mvad.cl3.factor, tlim = 0, border = NA, + space = 0, sortv = mds1d) 8/7/2009gr 38/100

  47. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Scatterplot (MDS) Through Multidimensional Scaling (MDS), we get a scatter plot of sequences R> mds2d <- cmdscale(mvad.dist, k = 2) R> plot(mds2d, type = "n") R> points(mds2d[mvad.cl3 == 1, ], pch = 16, col = "red") R> points(mds2d[mvad.cl3 == 2, ], pch = 16, col = "blue") R> points(mds2d[mvad.cl3 == 3, ], pch = 16, col = "green") R> legend("bottomright", fill = c("red", "blue", "green"), + legend = clust.labels) 8/7/2009gr 39/100

  48. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by cluster ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● Employment ● ● ● ● ● ● Education ● Jobless ● ● −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 40/100

  49. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Code for scatterplot colored by sex R> plot(mds2d, type = "n") R> points(mds2d[mvad$male == "yes", ], pch = 16, col = "red") R> points(mds2d[mvad$male == "no", ], pch = 23, col = "blue") R> legend("bottomright", col = c("red", "blue"), pch = c(16, + 23), legend = c("Men", "Women")) 8/7/2009gr 41/100

  50. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Clustering and MDS Sequence scatterplot colored by sex ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mds2d[,2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● Men ● Women −1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 mds2d[,1] 8/7/2009gr 42/100

  51. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 43/100

  52. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100

  53. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100

  54. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Dispersion of the set of sequences From the distance matrix, we get the pseudo-variance of the set of sequences. Sum of squares SS can be expressed in terms of distances between pairs n n n y ) 2 = 1 � � � ( y i − y j ) 2 SS = ( y i − ¯ n i =1 i =1 j = i +1 n n 1 � � = d ij n i =1 j = i +1 Setting d ij equal to OM, LCP, LCS ... distance, we get SS. Can apply ANOVA principle (Studer et al., 2009) . 8/7/2009gr 44/100

  55. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Sequence dispersion Compute the sequence dispersion R> distMatLCS <- seqdist(mvad.seq, method = "LCS") R> distMatLCS[1:6, 1:7] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0 140 116 108 140 64 60 [2,] 140 0 72 140 22 140 80 [3,] 116 72 0 68 90 72 60 [4,] 108 140 68 0 140 46 112 [5,] 140 22 90 140 0 140 90 [6,] 64 140 72 46 140 0 68 R> dissvar(distMatLCS) [1] 42.74502 8/7/2009gr 45/100

  56. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Section outline Dissimilarities among pairs of state sequences 1 Measures of dissimilarity between sequences LCP LCS Optimal matching Clustering and MDS Cluster analysis Plotting sequences by cluster Multidimensional scaling (MDS) Sequence dispersion Analysis of sequence discrepancy 8/7/2009gr 46/100

  57. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy ANOVA like analysis based on pairwise dissimilarities We decompose the SS (Sum of squares equivalent) SS T = SS B + SS W Here, with the formula shown earlier n n 1 � � SS T = d ij n i =1 j = i +1 � 1 n g n g � � � � = SS W d ij , g n g g i =1 j = i +1 SS B = SS T − SS W 8/7/2009gr 47/100

  58. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo R-square and ANOVA Table ANOVA table for m groups Discrepancy df Mean Discr. F SS B SS B df W Between SS B df B = m − 1 df B df B SS W SS W Within SS W df W = � g n g − m df W Total SS T df T = n − 1 Pseudo R 2 SS B R 2 = SS T 8/7/2009gr 48/100

  59. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo R-square and ANOVA Table ANOVA table for m groups Discrepancy df Mean Discr. F SS B SS B df W Between SS B df B = m − 1 df B df B SS W SS W Within SS W df W = � g n g − m df W Total SS T df T = n − 1 Pseudo R 2 SS B R 2 = SS T 8/7/2009gr 48/100

  60. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Pseudo F Pseudo F SS B / ( m − 1) = F SS W / ( n − m ) Normality is not defendable in this setting. F cannot be compared with an F distribution. The significance is assesses through a permutation test Permutation test: iteratively randomly reassign each covariate profile to one of the observed sequence and recompute the F . Empirical distribution of F under independence. 8/7/2009gr 49/100

  61. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Analysis of sequence discrepancy Running an ANOVA like analysis for gcse5eq R> mvad.lcs <- seqdist(mvad.seq, method = "LCS") R> da <- dissassoc(mvad.lcs, group = mvad$gcse5eq, R = 1000) 8/7/2009gr 50/100

  62. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy ANOVA output R> print(da) Pseudo ANOVA table: SS df MSE Exp 2499.945 1 2499.94539 Res 27934.510 710 39.34438 Total 30434.455 711 42.80514 Test values (p-values based on 999 permutation): PseudoF PseudoR2 PseudoF_Pval PseudoT PseudoT_Pval 63.54009 0.08214195 0 1.199912 0 Variance per level: n variance no 452 37.48481 yes 260 42.27453 Total 712 42.74502 8/7/2009gr 51/100

  63. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Distribution of pseudo F R> hist(da, col = "cyan") Distribution of PseudoF 120 100 80 Frequency 60 40 20 0 1 2 3 4 PseudoF 8/7/2009gr 52/100

  64. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Multiple factor analysis Generalize previous approach for multiple covariates. There are different approaches. Here, we Measure the additional contribution of each covariate v when we accounted for all other covariates. The F statistics reads F v = ( SS B c − SS B v ) / p SS W c / ( n − m − 1) where the SS B c and SS W c are the explained and residual sums of squares of the full model, SS B v the explained sum of squares of the model after removing variable v , and p the number of indicators or contrasts used to encode the covariate v . significance is assessed again through permutation tests. 8/7/2009gr 53/100

  65. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Running a Multiple factor analysis R> da.mfac <- dissmfac(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 1000) R> print(da.mfac) Variable PseudoF PseudoR2 p_value 1 male 3.274802 0.003840223 0.026 2 Grammar 21.124081 0.024771330 0.000 3 funemp 4.483016 0.005257046 0.003 4 gcse5eq 75.725976 0.088800698 0.000 5 fmpr 2.715988 0.003184926 0.045 6 livboth 2.314571 0.002714201 0.078 7 Total 24.829102 0.174448528 0.000 8/7/2009gr 54/100

  66. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100

  67. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100

  68. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time How do differences between groups vary over time? How do differences between men and women insertion trajectories vary over time? Compute R 2 for short sliding windows (length 2) We get thus a sequence of R 2 , which can be plotted Similarly, we can plot series of total residual discrepancy ( SS W ) residual discrepancy of each group ( SS G ) 8/7/2009gr 55/100

  69. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Differences over time R> mvad.diff <- seqdiff(mvad.seq, group = mvad$gcse5eq) R> mvad.diff$stat[1:4, ] PseudoF PseudoR2 PseudoT Sep.93 29.09196 0.03936176 2.313692 Oct.93 29.39664 0.03975760 2.223468 Nov.93 29.76849 0.04024027 2.265784 Dec.93 30.09793 0.04066750 2.304112 R> mvad.diff$variance[1:4, ] no yes Total Sep.93 0.3688107 0.3113979 0.3620982 Oct.93 0.3691362 0.3127219 0.3629661 Nov.93 0.3704210 0.3133136 0.3642237 Dec.93 0.3725771 0.3146893 0.3663363 8/7/2009gr 56/100

  70. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting R-squares over time R> plot(mvad.diff) 0.12 0.10 PseudoR2 0.08 0.06 0.04 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 57/100

  71. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Plotting residual discrepancy over time R> plot(mvad.diff, stat = "Variance") no yes Total 0.35 0.30 Variance 0.25 0.20 Sep.93 Apr.94 Oct.94 Apr.95 Oct.95 Apr.96 Oct.96 Apr.97 Oct.97 Apr.98 Oct.98 Apr.99 8/7/2009gr 58/100

  72. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Tree structured discrepancy analysis Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R 2 . Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant. 8/7/2009gr 59/100

  73. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Tree structured discrepancy analysis Objective: Find the most important predictors and their interactions. Iteratively segment the cases using values of covariates (predictors) Such that groups be as homogenous as possible. At each step, we select the covariate and split with highest R 2 . Significance of split is assessed through a permutation F test. Growing stops, when the selected split is not significant. 8/7/2009gr 59/100

  74. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Growing the tree R> dt <- disstree(mvad.lcs ~ male + Grammar + funemp + gcse5eq + + fmpr + livboth, data = mvad, R = 5000) R> print(dt) Dissimilarity tree Global R2: 0.113 |-- Root [ 712 ] var: 42.7 |-> gcse5eq R2: 0.0821 |-- no [ 452 ] var: 37.5 |-> funemp R2: 0.0107 |-- no [ 362 ] var: 35.9 |-> male R2: 0.0123 |-- no [ 146 ] var: 38.7 |-- yes [ 216 ] var: 33.3 |-- yes [ 90 ] var: 41.8 |-- yes [ 260 ] var: 42.3 |-> Grammar R2: 0.0534 |-- no [ 183 ] var: 42.2 |-- yes [ 77 ] var: 34.9 8/7/2009gr 60/100

  75. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Creating a Graphviz plot of the tree Using simplified interface to generate a file for GraphViz R> seqtree2dot(dt, "fg_mvadseqtree", seqdata = mvad.seq, type = "d", + border = NA, withlegend = FALSE, axes = FALSE, ylab = "", + yaxis = FALSE) 8/7/2009gr 61/100

  76. Sequential data analysis - 2 Dissimilarities among pairs of state sequences Analysis of sequence discrepancy Graphical Tree 8/7/2009gr 62/100

  77. Sequential data analysis - 2 Mining event sequences Outline Dissimilarities among pairs of state sequences 1 Mining event sequences 2 Conclusion: Sequence of analyses 3 8/7/2009gr 63/100

  78. Sequential data analysis - 2 Mining event sequences Event sequences Section outline Mining event sequences 2 Event sequences Creating event subsequences in TraMineR Seeking frequent and discriminant subsequences Looking for state patterns Looking for specific subsequences Temporal constraints 8/7/2009gr 64/100

  79. Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100

  80. Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100

  81. Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100

  82. Sequential data analysis - 2 Mining event sequences Event sequences Analysis of event sequences Objective Focus on events, rather than states. Interest in the patterns of events. Pattern of event: events that occur systematically together and in same order Are there typical“patterns”of events? Relationship with covariates Which patterns best discriminate specific groups? Typical differences in event sequences between men and women. Events patterns vs typical state sequencing. Association rules between event subsequences: Sequence Leaving home → Childbirth generally followed by Marriage → Second Childbirth Not yet available, but ... coming soon. 8/7/2009gr 65/100

Recommend


More recommend