Ordering comparisons Comparing distributions: Part 4 R.W. Oldford
More than two distributions Often there are more than two distributions to compare at once. For example, in the SAheart data, we looked at systolic blood pressure separated first by family history ( famhist ) and then by whether they in fact had coronary heart disease or not ( chd == 1 or chd==0 ). library (ElemStatLearn) cols <- adjustcolor ( c ("firebrick", "steelblue"), 0.5) savePar <- par (mfrow= c (1,2)) boxplot (sbp ~ famhist , data=SAheart, col=cols, main="family history") boxplot (sbp ~ chd, data=SAheart, col=cols, main="Coronary heart disease") par (savePar)
More than two distributions For example, in the SAheart data, we looked at systolic blood pressure separated first by family history ( famhist ) and then by whether they in fact had coronary heart disease or not ( chd == 1 or chd==0 ). family history Coronary heart disease 220 220 200 200 180 180 160 160 140 140 120 120 100 100 Absent Present 0 1
More than two distributions How might we express this mathematically? family history Coronary heart disease 220 220 200 200 180 180 160 160 140 140 120 120 100 100 Absent Present 0 1
More than two distributions Each pair of boxplots is a different comparison and are modelled separately. A separate mathematical representation would be given for each comparison. Family history only model: � 1 if famhist = Present y ik = µ + α i + r ik with k = 1 , . . . , n i and i = 2 if famhist = Absent and n i is the number of patients in group i . (And usually α + = � i α i = 0.)
More than two distributions Each pair of boxplots is a different comparison and are modelled separately. A separate mathematical representation would be given for each comparison. Family history only model: � 1 if famhist = Present y ik = µ + α i + r ik with k = 1 , . . . , n i and i = 2 if famhist = Absent and n i is the number of patients in group i . (And usually α + = � i α i = 0.) Chronic heart disease only model: � 1 if chd = 1 y jk = µ + β j + r jk with k = 1 , . . . , n j and j = 2 if chd = 0 and n j is the number of patients in group j .(And usually β + = � j β j = 0.) Each model matches one pair of boxplots.
More than two distributions We might want to compare these at the same time, or 4 distributions at once. boxplot (sbp ~ famhist + chd , data=SAheart, col=cols, main="Family history and Coronary heart disease") Family history and Coronary heart disease 220 200 180 160 140 120 100 Absent.0 Present.0 Absent.1 Present.1 How is this different from the first? Which comparisons are of interest?
More than two distributions The corresponding mathematical representation would model Family and chronic heart disease together : y ijk = µ + α i + β j + γ ij + r ijk with k = 1 , . . . , n ij where as before � 1 if famhist = Present i = 2 if famhist = Absent and � 1 if chd = 1 j = 2 if chd = 0 and n ij is the number of patients in group ( i , j ). As before we require α + = β + = 0 and additionally that γ + j = γ i + = 0. How would you compare chd groups having famhist ? Those with famhist to those without given they have chd ? What is γ ij ? What does it mean if γ ij = 0?
More than two distributions Visual comparisons: most easily made between adjacent displays (here boxplots) family history and chd 220 200 180 160 Absent.0 Present.0 Absent.1 Present.1 140 120 100 Absent.0 Present.0 Absent.1 Present.1 The positions in the layout of the displays can be thought of as a graph where the nodes are the displays (groups) and the edges the immediate adjacency between displays where the pairwise comparisons are most reliably made. In this layout, unfortunately, the adjacent displays are not always those whose comparisons are of most interest. One is of no real interest at all.
More than two distributions Visual comparisons: At left is the layout as given; at right are the comparisons of interest family history and chd 220 Present.1 200 180 160 Present.0 Absent.1 140 120 100 Absent.0 Absent.0 Present.0 Absent.1 Present.1 We could produce a new layout by simply starting at one node of the graph and moving along the edges to every other node. If we visit every edge, we have every comparison of interest.
More than two distributions We can construct the display as follows: # Get the groups groups <- with (SAheart, split (sbp, list (famhist, chd))) # with names kable ( t ( names (groups))) Absent.0 Present.0 Absent.1 Present.1 Now order the groups # Put the names in the desired order ord <- c ("Present.0", "Present.1", "Absent.1", "Absent.0", "Present.0") and display the boxplots in that order # Match the colours to the family history cols <- adjustcolor ( c ("steelblue", "steelblue", "firebrick", "firebrick", "steelblue"), 0.5) # Create the display according to the layout. boxplot (groups[ord], col=cols)
More than two distributions Present.0 Present.1 Absent.1 Absent.0 Present.0 220 200 180 160 140 120 100 Present.0 Present.1 Absent.1 Absent.0 Present.0 All pairwise comparisons of interest are adjacent . No comparison that is not of interest is an adjacent pair. (N.B. one of the groups appears twice.) It would be nice if we had some automated way of laying out displays.
Layout via graph structure A graph G = ( V , E ) where V = { 1 , 2 , ..., k } is a set of vertices (or nodes) and E = { ( i , j ) : i ∈ V b ⊂ V , j ∈ V e ⊂ V , and i � = j } is a set of ordered pairs of nodes representing the edges of the graph. The graph is undirected if whenever ( i , j ) ∈ E so too is ( j , i ) ∈ E ; that is the edge ( i , j ) = ( j , i ) is an unordered pair of vertices. A complete graph on k = 5 nodes (below k = 5 and nodes are labelled X 1 , . . . , X 5 ) is of special interest since it has (undirected) an edge between every pair of nodes in the graph. Every pairwise comparison is represented by an edge, every ordering of the nodes by a path travelling along the edges of the graph.
Layout via graph structure For example, a path that visits every node exactly once is called a Hamiltonian path (or a Hamiltonian cycle if it returns to the original node). (Also a path uses no edge more than once.) The path provides an ordering of the nodes (which might, in turn, represent displays).
Layout via graph structure We could continue this path and make sure we visit all edges: The first five figures above identify a second Hamiltonian path. Returning to the original node in the last figure covers all edges exactly once. This is called an Eulerian path (or Eulerian cycle since it returns to the original node; sometimes an Eulerian tour ), or simply an Eulerian . By visiting every edge, an Eulerian ensures that every pairwise comparison (of interest, as defined by the graph) appears once in the order of nodes.
Layout via graph structure Example: Patients having advanced cancers of one of five major organs (stomach, bronchus, colon, ovary, or breast) were treated with Vitamin C (ascorbate). Interest lies in understanding whether patient survival time (in days, it seems) is different depending on the organ affected by cancer. library ("PairViz") data (cancer) # Need this step to load the data str (cancer) # Summary of structure of the data ## 'data.frame': 64 obs. of 2 variables: ## $ Survival: int 124 42 25 45 412 51 1112 46 103 876 ... ## $ Organ : Factor w/ 5 levels "Breast","Bronchus",..: 5 5 5 5 5 5 5 5 5 5 ... # We can separate the survival times by which organ is affected organs <- with (cancer, split (Survival, Organ)) # And record their names for use later! organNames <- names (organs) # the structure of the organs data str (organs) ## List of 5 ## $ Breast : int [1:11] 1235 24 1581 1166 40 727 3808 791 1804 3460 ... ## $ Bronchus: int [1:17] 81 461 20 450 246 166 63 64 155 859 ... ## $ Colon : int [1:17] 248 377 189 1843 180 537 519 455 406 365 ... ## $ Ovary : int [1:6] 1234 89 201 356 2970 456 ## $ Stomach : int [1:13] 124 42 25 45 412 51 1112 46 103 876 ...
Layout via graph structure Suppose we would like to compare every organ type with every other. To find an order, we need to find an Eulerian for a complete graph of k = 5 nodes. This can be found in PairViz as follows: library (PairViz) ord <- eulerian (5) ord ## [1] 1 2 3 1 4 2 5 3 4 5 1 which allows us to create the boxplots: library (colorspace) cols <- rainbow_hcl (5, c = 50) # choose chromaticity of 50 to dull colours boxplot (organs[ord], col=cols[ord], ylab="Survival time", main="Cancer treated by vitamin C")
Layout via graph structure Comparing survival times for vitamin C treated cancer in major organs. Cancer treated by vitamin C 3000 Survival time 2000 1000 0 Breast Bronchus Colon Breast Ovary Bronchus Stomach Colon Ovary Stomach Breast Every pair of comparisons appear adjacent to one another. Taking square roots should make the data look a little less asymmetric.
Layout via graph structure √ Comparing Survival times for vitamin C treated cancer in major organs. # Split the data sqrtOrgans <- with (cancer, split ( sqrt (Survival), Organ)) boxplot (sqrtOrgans[ord], col=cols[ord], ylab= expression ( sqrt ("Survival time")), main="Cancer treated by vitamin C") Again, every pair of comparisons will appear adjacent to one another.
Recommend
More recommend