1
play

1 Mining Event Histories Mining Event Histories Sequence Analysis - PDF document

Mining Event Histories Mining Event Histories My talk is about life courses, So, let me start with an example of scientific life course Mining Event Histories: date event Some New Insights on Personal Swiss Life Courses 1970-1979 Studies in


  1. Mining Event Histories Mining Event Histories My talk is about life courses, So, let me start with an example of scientific life course Mining Event Histories: date event Some New Insights on Personal Swiss Life Courses 1970-1979 Studies in econometrics 1980-1992 Mathematical Economics 1985-... Work with Social scientists (Family studies) Gilbert Ritschard Interest in Statistics for social sciences 1990-1995 Interest in Neural Networks 2000-... KDD and data mining (Clustering, supervised learning) Dept of Econometrics and Laboratory of Demography, University of Geneva 2003-... Work with historians, demographers, psychologists http://mephisto.unige.ch (longitudinal data) 2005-... KDD and Data mining approaches PaVie Seminar, Lausanne, October 22, 2008 for analysing life course data 2007-... Start a SNF project on “Mining Event Histories” 21/10/2008gr 1/95 21/10/2008gr 2/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Motivation Outline Motivation Individual life course paradigm. Sequence Analysis in Social Sciences 1 Following macro quantities (e.g. #divorces, fertility rate, mean education level, ...) over time Survival Trees insufficient for understanding social behavior. 2 Need to follow individual life courses. Data availability Characterizing, rendering and clustering sequence data 3 Large panel surveys in many countries (SHP, CHER, SILC, GGP, ...) Mining Frequent Episodes 4 Biographical retrospective surveys (FFS, ...). Statistical matching of censuses, population registers and other administrative data. 21/10/2008gr 3/95 21/10/2008gr 6/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences Motivation Motivation Motivation Motivation: KD in Social sciences Need for suited methods for discovering interesting knowledge In KDD (Knowledge discovery in databases) and data mining, from these individual longitudinal data. focus on prediction and classification. Social scientists use Essentially Survival analysis (Event History Analysis) Improve prediction and classification errors. More rarely sequential data analysis (Optimal Matching, Markov Chain Models) In Social science, aim is understanding/explaining (social) Could social scientists benefit from data-mining approaches? behaviors. Which methods? Hence focus is on process rather than output. Are there specific issues with those methods for social scientists? 21/10/2008gr 7/95 21/10/2008gr 8/95 1

  2. Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences What kind of data? What kind of data? What kind of data? ontology of longitudinal data (Aristotelean tree) one state per time unit t States several states at each t not What kind of data are we dealing with? not Mainly categorical longitudinal data describing life courses Longitudinal data time stamped events Data can be in different forms ... Events event sequence not not not spell duration not not 21/10/2008gr 10/95 21/10/2008gr 11/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences What kind of data? What kind of data? Transforming time stamped events into state sequences Alternative views of Individual Longitudinal Data Example: the “BioFam” data Table: Time stamped events, record for Sandra Data from the retrospective survey conducted in 2002 by the Swiss Household Panel (SHP) ending secondary school in 1970 first job in 1971 marriage in 1973 (with support of Federal Statistical Office, Swiss National Fund for Scientific Research, University of Neuchatel.) Table: State sequence view, Sandra Retrospective survey: 5560 individuals year 1969 1970 1971 1972 1973 Retained familial life events: Leaving Home, First childbirth, civil status single single single single married First marriage and First divorce. education level primary secondary secondary secondary secondary Age 15 to 45 → 2601 remaining individuals, born between job no no first first first 1909 et 1957. 21/10/2008gr 12/95 21/10/2008gr 13/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences What kind of data? What kind of data? Creating state sequences Deriving the states Need one state for each combination of events: LHome marriage childbirth divorce 0 no no no no Example of time stamped data: 1 yes no no no individual LHome marriage childbirth divorce 2 no yes yes/no no 1 1989 1990 1992 NA 3 yes yes no no 4 no no yes no 5 yes no yes no 6 yes yes yes no 7 yes/no yes yes/no yes 21/10/2008gr 14/95 21/10/2008gr 15/95 2

  3. Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences What kind of data? Issues with life course data From events to states Issues with life course data Incomplete sequences Example of transformation : Censored and truncated data: events: Cases falling out of observation before experiencing an event of interest. individual LHome marriage childbirth divorce Sequences of varying length. 1 1989 1990 1992 NA Time varying predictors. states: Example: When analysing time to divorce, presence of children is a time varying predictor. individual ... 1988 1989 1990 1991 1992 1993 ... Data collected by clusters 1 ... 0 0 1 3 3 6 ... Example: Household panel surveys. Multi-level analysis to account for unobserved shared characteristics of members of a same cluster. 21/10/2008gr 16/95 21/10/2008gr 18/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences Issues with life course data Methods for Longitudinal Data Classical statistical approaches Multi-level: Simple linear regression example Survival Approaches 9 Survival or Event history analysis (Blossfeld and Rohwer, 2002) y = 15.6 - 0.8 x 8 Focuses on one event. y = 12.5 - 0.8 x Concerned with duration until event occurs 7 or with hazard of experiencing event. 6 Survival curves: Distribution of duration until event occurs 5 Children S ( t ) = p ( T ≥ t ) . 4 3 y = 3.2 + 0.2 x Hazard models: Regression like models for S ( t , x ) or hazard 2 h ( t ) = p ( T = t | T ≥ t ) y = 6.2 - 0.8 x 1 � � h ( t , x ) = g t , β 0 + β 1 x 1 + β 2 x 2 ( t ) + · · · . 0 1 3 5 7 9 11 13 15 Education 21/10/2008gr 19/95 21/10/2008gr 21/95 Mining Event Histories Mining Event Histories Sequence Analysis in Social Sciences Sequence Analysis in Social Sciences Methods for Longitudinal Data Methods for Longitudinal Data Survival curves (Switzerland, SHP 2002 biographical survey) Analysis of sequences 1 Frequencies of given subsequences 0.9 Essentially event sequences, e.g. (First job → Marriage). 0.8 Subsequences considered as categories ⇒ Methods for Survival probability 0.7 categorical data apply (Frequencies, cross tables, log-linear 0.6 models, logistic regression, ...). 0.5 Markov chain models 0.4 State sequences. 0.3 Focuses on transition rates between states. Does the rate also depend on previous states? 0.2 Women How many previous states are significant? 0.1 Optimal Matching (Abbott and Forrest, 1986) . 0 State sequences. 0 10 20 30 40 50 60 70 80 Edit distance (Levenshtein, 1966; Needleman and Wunsch, AGE (years) 1970) between pairs of sequences. Clustering of sequences. Leaving home Marriage 1st Chilbirth Parents' death Last child left Divorce Widowing 21/10/2008gr 22/95 21/10/2008gr 23/95 3

Recommend


More recommend