Ordering comparisons Comparing distributions: Part 4 R.W. Oldford - - PowerPoint PPT Presentation
Ordering comparisons Comparing distributions: Part 4 R.W. Oldford - - PowerPoint PPT Presentation
Ordering comparisons Comparing distributions: Part 4 R.W. Oldford More than two distributions Often there are more than two distributions to compare at once. For example, in the SAheart data, we looked at systolic blood pressure separated first
More than two distributions
Often there are more than two distributions to compare at once. For example, in the SAheart data, we looked at systolic blood pressure separated first by family history (famhist) and then by whether they in fact had coronary heart disease or not (chd == 1 or chd==0).
library(ElemStatLearn) cols <- adjustcolor(c("firebrick", "steelblue"), 0.5) savePar <- par(mfrow=c(1,2)) boxplot(sbp ~ famhist , data=SAheart, col=cols, main="family history") boxplot(sbp ~ chd, data=SAheart, col=cols, main="Coronary heart disease") par(savePar)
More than two distributions
For example, in the SAheart data, we looked at systolic blood pressure separated first by family history (famhist) and then by whether they in fact had coronary heart disease or not (chd == 1 or chd==0).
Absent Present 100 120 140 160 180 200 220
family history
1 100 120 140 160 180 200 220
Coronary heart disease
More than two distributions
How might we express this mathematically?
Absent Present 100 120 140 160 180 200 220
family history
1 100 120 140 160 180 200 220
Coronary heart disease
More than two distributions
Each pair of boxplots is a different comparison and are modelled separately. A separate mathematical representation would be given for each comparison. Family history only model: yik = µ+αi+rik with k = 1, . . . , ni and i =
- 1
if famhist = Present 2 if famhist = Absent and ni is the number of patients in group i. (And usually α+ =
i αi = 0.)
More than two distributions
Each pair of boxplots is a different comparison and are modelled separately. A separate mathematical representation would be given for each comparison. Family history only model: yik = µ+αi+rik with k = 1, . . . , ni and i =
- 1
if famhist = Present 2 if famhist = Absent and ni is the number of patients in group i. (And usually α+ =
i αi = 0.)
Chronic heart disease only model: yjk = µ + βj + rjk with k = 1, . . . , nj and j =
- 1
if chd = 1 2 if chd = 0 and nj is the number of patients in group j.(And usually β+ =
j βj = 0.)
Each model matches one pair of boxplots.
More than two distributions
We might want to compare these at the same time, or 4 distributions at once.
boxplot(sbp ~ famhist + chd , data=SAheart, col=cols, main="Family history and Coronary heart disease")
Absent.0 Present.0 Absent.1 Present.1 100 120 140 160 180 200 220
Family history and Coronary heart disease How is this different from the first? Which comparisons are of interest?
More than two distributions
The corresponding mathematical representation would model Family and chronic heart disease together: yijk = µ + αi + βj + γij + rijk with k = 1, . . . , nij where as before i =
- 1
if famhist = Present 2 if famhist = Absent and j =
- 1
if chd = 1 2 if chd = 0 and nij is the number of patients in group (i, j). As before we require α+ = β+ = 0 and additionally that γ+j = γi+ = 0. How would you compare chd groups having famhist? Those with famhist to those without given they have chd? What is γij? What does it mean if γij = 0?
More than two distributions
Visual comparisons: most easily made between adjacent displays (here boxplots)
Absent.0 Present.0 Absent.1 Present.1 100 120 140 160 180 200 220
family history and chd
Absent.0 Present.0 Absent.1 Present.1
The positions in the layout of the displays can be thought of as a graph where the nodes are the displays (groups) and the edges the immediate adjacency between displays where the pairwise comparisons are most reliably made. In this layout, unfortunately, the adjacent displays are not always those whose comparisons are of most interest. One is of no real interest at all.
More than two distributions
Visual comparisons: At left is the layout as given; at right are the comparisons of interest
Absent.0 Present.0 Absent.1 Present.1 100 120 140 160 180 200 220
family history and chd
Absent.0 Present.0 Absent.1 Present.1
We could produce a new layout by simply starting at one node of the graph and moving along the edges to every other node. If we visit every edge, we have every comparison
- f interest.
More than two distributions
We can construct the display as follows:
# Get the groups groups <- with(SAheart, split(sbp, list(famhist, chd))) # with names kable(t(names(groups))) Absent.0 Present.0 Absent.1 Present.1
Now order the groups
# Put the names in the desired order
- rd <- c("Present.0", "Present.1", "Absent.1",
"Absent.0", "Present.0")
and display the boxplots in that order
# Match the colours to the family history cols <- adjustcolor(c("steelblue", "steelblue", "firebrick", "firebrick", "steelblue"), 0.5) # Create the display according to the layout. boxplot(groups[ord], col=cols)
More than two distributions
Present.0 Present.1 Absent.1 Absent.0 Present.0
Present.0 Present.1 Absent.1 Absent.0 Present.0 100 120 140 160 180 200 220
All pairwise comparisons of interest are adjacent. No comparison that is not of interest is an adjacent pair.
(N.B. one of the groups appears twice.)
It would be nice if we had some automated way of laying out displays.
Layout via graph structure
A graph G = (V , E) where V = {1, 2, ..., k} is a set of vertices (or nodes) and E = {(i, j) : i ∈ Vb ⊂ V , j ∈ Ve ⊂ V , and i = j} is a set of ordered pairs of nodes representing the edges of the graph. The graph is undirected if whenever (i, j) ∈ E so too is (j, i) ∈ E; that is the edge (i, j) = (j, i) is an unordered pair of vertices. A complete graph on k = 5 nodes (below k = 5 and nodes are labelled X1, . . . , X5) is of special interest since it has (undirected) an edge between every pair of nodes in the graph. Every pairwise comparison is represented by an edge, every ordering of the nodes by a path travelling along the edges of the graph.
Layout via graph structure
For example, a path that visits every node exactly once is called a Hamiltonian path (or a Hamiltonian cycle if it returns to the original node). (Also a path uses no edge more than once.) The path provides an ordering of the nodes (which might, in turn, represent displays).
Layout via graph structure
We could continue this path and make sure we visit all edges: The first five figures above identify a second Hamiltonian path. Returning to the
- riginal node in the last figure covers all edges exactly once. This is called an Eulerian
path (or Eulerian cycle since it returns to the original node; sometimes an Eulerian tour), or simply an Eulerian. By visiting every edge, an Eulerian ensures that every pairwise comparison (of interest, as defined by the graph) appears once in the order of nodes.
Layout via graph structure
Example: Patients having advanced cancers of one of five major organs (stomach, bronchus, colon, ovary, or breast) were treated with Vitamin C (ascorbate). Interest lies in understanding whether patient survival time (in days, it seems) is different depending on the organ affected by cancer.
library("PairViz") data(cancer) # Need this step to load the data str(cancer) # Summary of structure of the data ## 'data.frame': 64 obs. of 2 variables: ## $ Survival: int 124 42 25 45 412 51 1112 46 103 876 ... ## $ Organ : Factor w/ 5 levels "Breast","Bronchus",..: 5 5 5 5 5 5 5 5 5 5 ... # We can separate the survival times by which organ is affected
- rgans <- with(cancer, split(Survival, Organ))
# And record their names for use later!
- rganNames <- names(organs)
# the structure of the organs data str(organs) ## List of 5 ## $ Breast : int [1:11] 1235 24 1581 1166 40 727 3808 791 1804 3460 ... ## $ Bronchus: int [1:17] 81 461 20 450 246 166 63 64 155 859 ... ## $ Colon : int [1:17] 248 377 189 1843 180 537 519 455 406 365 ... ## $ Ovary : int [1:6] 1234 89 201 356 2970 456 ## $ Stomach : int [1:13] 124 42 25 45 412 51 1112 46 103 876 ...
Layout via graph structure
Suppose we would like to compare every organ type with every other. To find an
- rder, we need to find an Eulerian for a complete graph of k = 5 nodes.
This can be found in PairViz as follows: library(PairViz)
- rd <- eulerian(5)
- rd
## [1] 1 2 3 1 4 2 5 3 4 5 1 which allows us to create the boxplots:
library(colorspace) cols <- rainbow_hcl(5, c = 50) # choose chromaticity of 50 to dull colours boxplot(organs[ord], col=cols[ord], ylab="Survival time", main="Cancer treated by vitamin C")
Layout via graph structure
Comparing survival times for vitamin C treated cancer in major organs.
Breast Bronchus Colon Breast Ovary Bronchus Stomach Colon Ovary Stomach Breast 1000 2000 3000
Cancer treated by vitamin C
Survival time
Every pair of comparisons appear adjacent to one another. Taking square roots should make the data look a little less asymmetric.
Layout via graph structure
Comparing √ Survival times for vitamin C treated cancer in major organs. # Split the data sqrtOrgans <- with(cancer, split(sqrt(Survival), Organ)) boxplot(sqrtOrgans[ord], col=cols[ord], ylab=expression(sqrt("Survival time")), main="Cancer treated by vitamin C") Again, every pair of comparisons will appear adjacent to one another.
Layout via graph structure
Comparing √ Survival times for vitamin C treated cancer in major organs.
Breast Bronchus Colon Breast Ovary Bronchus Stomach Colon Ovary Stomach Breast 10 20 30 40 50 60
Cancer treated by vitamin C
Survival time
Again, every pair of comparisons will appear adjacent to one another. Note that this Eulerian order does not show all organs (colours) in the first five boxplots. That is, it is not one Hamiltonian followed by another.
Layout via graph structure
We can ensure an Eulerian on a complete graph is composed of one Hamiltonian after another in PairViz as follows:
# Get a Hamiltonian decomposition
- rdHam <- hpaths(5)
# Here each Hamiltonian is a row of the matrix
- rdHam
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 5 4 ## [2,] 1 3 4 2 5 # For our purposes, we would like the Hamiltonians to be # joined together
- rdHam <- hpaths(5, matrix = FALSE)
# Here the Eulerian is composed of the two Hamiltonians
- rdHam
## [1] 1 2 3 5 4 1 3 4 2 5 1
The displays can be constructed from this order as well.
# boxplot(sqrtOrgans[ordHam], col=cols[ordHam], ylab=expression(sqrt("Survival time")), main="Cancer treated by vitamin C")
Layout via graph structure
We can ensure an Eulerian on a complete graph is composed of one Hamiltonian after another in PairViz as follows:
Breast Bronchus Colon Stomach Ovary Breast Colon Ovary Bronchus Stomach Breast 10 20 30 40 50 60
Cancer treated by vitamin C
Survival time
Again, every pair of comparisons will appear adjacent to one another. There are many possible orderings.
Layout via graph structure
We could, for example, run some statistical test comparing each pair and then use the
- bserved significance level for that test.
Suppose we test the equality of each pairwise means (assuming normal distributions of course). A function in R that accomplishes this (and which corrects the p-values for simultaneity,
though this doesn’t matter for our purpose of simply ordering) is pairwise.t.test. # Get the test results test <- with(cancer, pairwise.t.test(sqrt(Survival), Organ)) pvals <- test$p.value pvals ## Breast Bronchus Colon Ovary ## Bronchus 0.0002419522 NA NA NA ## Colon 0.0230079256 0.5087249 NA NA ## Ovary 0.7514312378 0.1702530 0.7514312 NA ## Stomach 0.0012609285 0.7765983 0.7514312 0.2933608
From this we can see that in testing the hypothesis that the mean √ Survival is identical for those with colon cancer and those with breast cancer, the observed significance level (or “p-value”) is about 0.023. This low value indicates evidence against the hypothesis that the means are identical. Note the shape of the output (compare row names to column names). This will require a little work to put it into a more useful form.
Layout via graph structure
Imagine the graph, with the boxplots (or organs) as nodes, and edges being the
- comparisons. We could assign weights to the edges of the graph that were identical to
the significance levels: We might then ask whether we could have a path that
◮ visited all edges but preferred to visit low-weight (or high weight) edges soonest (a
greedy Eulerian)
◮ this is particularly useful if not all edges can be visited (for whatever reason)
◮ visited every node but had the least (or most) total weight
Layout via graph structure
The PairViz package can handle graphs with weights, but we first need to get the weights into the right form. ngrps <- length(organNames) # Build the matrix of weights weights <- matrix(0, nrow=ngrps, ncol=ngrps) rownames(weights) <- organNames colnames(weights) <- organNames # Now insert the p-values into the lower triangle of the weight matrix. # (Note that you should check to see that the names match) weights[2:ngrps, 1:(ngrps-1)] <- pvals weights ## Breast Bronchus Colon Ovary Stomach ## Breast 0.0000000000 0.0000000 0.0000000 0.0000000 ## Bronchus 0.0002419522 NA NA NA ## Colon 0.0230079256 0.5087249 NA NA ## Ovary 0.7514312378 0.1702530 0.7514312 NA ## Stomach 0.0012609285 0.7765983 0.7514312 0.2933608
Layout via graph structure
Now need to fix up the diagonal and the upper triangular part diag(weights) <- 0 for (i in 1:(ngrps -1)) { for (j in (i+1):ngrps) { weights[i,j] <- weights[j,i] } } # The matrix of weights: weights ## Breast Bronchus Colon Ovary Stomach ## Breast 0.0000000000 0.0002419522 0.02300793 0.7514312 0.001260928 ## Bronchus 0.0002419522 0.0000000000 0.50872487 0.1702530 0.776598277 ## Colon 0.0230079256 0.5087248668 0.00000000 0.7514312 0.751431238 ## Ovary 0.7514312378 0.1702530406 0.75143124 0.0000000 0.293360755 ## Stomach 0.0012609285 0.7765982772 0.75143124 0.2933608 0.000000000 This matrix now holds the edge weights for the graph. For example, the edge from the node Bronchus to the node Ovary has weight 0.1702530. We can now use this as defining a weighted graph in PairViz and come up with some possibly different orderings.
Layout via graph structure
Breast Bronchus Colon Ovary Stomach Breast 0.0000 0.0002 0.0230 0.7514 0.0013 Bronchus 0.0002 0.0000 0.5087 0.1703 0.7766 Colon 0.0230 0.5087 0.0000 0.7514 0.7514 Ovary 0.7514 0.1703 0.7514 0.0000 0.2934 Stomach 0.0013 0.7766 0.7514 0.2934 0.0000
Eulerians: low2highEulord <- eulerian(weights); colnames(weights)[low2highEulord] ## [1] "Bronchus" "Breast" "Stomach" "Ovary" "Bronchus" "Colon" ## [7] "Breast" "Ovary" "Colon" "Stomach" "Bronchus" high2LowEulord <- eulerian(- weights); colnames(weights)[high2LowEulord] ## [1] "Bronchus" "Stomach" "Colon" "Ovary" "Breast" "Colon" ## [7] "Bronchus" "Ovary" "Stomach" "Breast" "Bronchus" Note that in this case, one is the reverse of the other. This need NOT be true in general.
Layout via graph structure
Breast Bronchus Colon Ovary Stomach Breast 0.0000 0.0002 0.0230 0.7514 0.0013 Bronchus 0.0002 0.0000 0.5087 0.1703 0.7766 Colon 0.0230 0.5087 0.0000 0.7514 0.7514 Ovary 0.7514 0.1703 0.7514 0.0000 0.2934 Stomach 0.0013 0.7766 0.7514 0.2934 0.0000
Hamiltonians: low2highHamord <- weighted_hpaths(weights); low2highHamord ## [,1] [,2] [,3] [,4] [,5] ## [1,] 5 1 3 2 4 ## [2,] 5 3 4 1 2 high2LowHamord <- weighted_hpaths(- weights); high2LowHamord ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 5 3 4 1 ## [2,] 2 3 1 5 4
Layout via graph structure
Eulerian boxplot(sqrtOrgans[low2highEulord], col=cols[low2highEulord], ylab=expression(sqrt("Survival time")), main="Cancer treated by vitamin C")
Bronchus Breast Stomach Ovary Bronchus Colon Breast Ovary Colon Stomach Bronchus 10 20 30 40 50 60
Cancer treated by vitamin C
Survival time
Layout via graph structure
Hamiltonian: lowest total weight boxplot(sqrtOrgans[low2highHamord[1,]], col=cols[low2highHamord[1,]], ylab=expression(sqrt("Survival time")), main="Cancer treated by vitamin C")
Stomach Breast Colon Bronchus Ovary 10 20 30 40 50 60
Cancer treated by vitamin C
Survival time
Layout via graph structure
Hamiltonian: highest total weight boxplot(sqrtOrgans[high2LowHamord[1,]], col=cols[high2LowHamord[1,]], ylab=expression(sqrt("Survival time")), main="Cancer treated by vitamin C")
Bronchus Stomach Colon Ovary Breast 10 20 30 40 50 60
Cancer treated by vitamin C
Survival time
Layout via graph structure
Could in fact work with any graph structure. For example, suppose we want to look at
- nly that half of the pairwise comparisons which are most significant.
# First get the complete graph as constructed from the weights: g <- mk_complete_graph(weights) plot(g, "circo")
Breast Bronchus Colon Ovary Stomach
Layout via graph structure
Now find that half of the pairwise comparisons which are most significant.
# First get the weights: wvals <- weights[upper.tri(weights)] # And the median q <- quantile(wvals, 0.5) # and now trim the graph via the PairViz function dn_graph() g1 <- dn_graph(g, q, `<=`) savePar <- par(oma=rep(6,4)) plot(g1, "circo") Breast Bronchus Ovary Colon Stomach par(savePar)
Layout via graph structure
Note that the subgraph g is not even (i.e. some nodes have an odd number of edges). It must be made even by adding extra edges if we are to get an Eulerian. (But not necessarily to get a Hamiltonian path.)
Breast Bronchus Ovary Colon Stomach # The eulerian: g1EulOrd <- eulerian(g1) g1EulOrd ## [1] "Bronchus" "Breast" "Colon" "Breast" "Stomach" "Ovary" ## [7] "Bronchus"
Layout via graph structure
Finding a Hamiltonian for any graph g is hard (NP-hard, NP-complete), so there is no function in PairViz that solves the problem. Luckily for this graph, there are only two solutions and both are obvious (and interchangeable).
Breast Bronchus Ovary Colon Stomach # The eulerian: g1HamOrd <- c("Colon", "Breast", "Stomach", "Ovary", "Bronchus", "Breast") g1HamOrd ## [1] "Colon" "Breast" "Stomach" "Ovary" "Bronchus" "Breast"
Layout via graph structure
“Eulerian” from g1 (with added edge). Note that we need to add the organNames to cols now since the paths are returned as a vector of strings and not numbers. names(cols) <- organNames boxplot(sqrtOrgans[g1EulOrd], col=cols[g1EulOrd], ylab=expression(sqrt("Survival time")), main="Most significant pairwise differences adjacent (Eulerian)")
Bronchus Breast Colon Breast Stomach Ovary Bronchus 10 20 30 40 50 60
Most significant pairwise differences adjacent (Eulerian)
Survival time
Layout via graph structure
g Hamiltonian (found by inspection) boxplot(sqrtOrgans[g1HamOrd], col=cols[g1HamOrd], ylab=expression(sqrt("Survival time")), main="Most significant pairwise differences adjacent (Hamiltonian)")
Colon Breast Stomach Ovary Bronchus Breast 10 20 30 40 50 60
Most significant pairwise differences adjacent (Hamiltonian)
Survival time
Layout via graph structure
Back to the SAheart data. Here we construct the graph directly:
library(PairViz) ; library(ElemStatLearn) groups <- with(SAheart, split(sbp, list(famhist, chd))) ## Build the graph nodes <- names(groups) SAheartGraph = new("graphNEL", nodes = nodes, edgemode = "undirected") SAheartGraph = addEdge("Present.1", "Present.0", SAheartGraph) SAheartGraph = addEdge("Present.1", "Absent.1", SAheartGraph) SAheartGraph = addEdge("Absent.1", "Absent.0", SAheartGraph) SAheartGraph = addEdge("Absent.0", "Present.0", SAheartGraph)
- rd <- eulerian(SAheartGraph)
- rd
cols <- adjustcolor(rep(c("firebrick", "steelblue"), 2), 0.5) names(cols) <- nodes savePar <- par(mfrow=c(1,2), oma=rep(6,4)) boxplot(groups[ord], col=cols[ord], main="family history and chd") plot(SAheartGraph, "circo") par(savePar)
Layout via graph structure
Back to the SAheart data. Here we construct the graph directly:
## [1] "Present.0" "Present.1" "Absent.1" "Absent.0" "Present.0"
Present.0 Present.1 Absent.1 Absent.0 Present.0 100 120 140 160 180 200 220
family history and chd
Absent.0 Present.0 Absent.1 Present.1
(Note that the colours here are chosen to be consistent with earlier presentations of the SAheart and do not encode the value of the node in the graph.)
Layout via graph structure
Note:
◮ pairwise comparisons are often best made when the displays are adjacent ◮ graph structure can be used to identify pairs of interest
◮ displays are nodes ◮ edges indicate pairs to be adjacent
◮ an Eulerian path will visit all edges (at least) once ◮ a Hamiltonian path will visit all nodes (displays) ◮ weights on edges can indicate importance/interest in those pairs
◮ could choose paths that have small (large) total weight