concurrence topology a tool for describing high order
play

Concurrence Topology: A Tool for Describing High-Order Statistical - PowerPoint PPT Presentation

Concurrence Topology: A Tool for Describing High-Order Statistical Dependence in Data Steven P. Ellis (mostly joint work with Arno Klein) 6/9/14 Abstract Data analytic methods possessing the following three features are desirable: (1)


  1. “Concurrence Topology:” A Tool for Describing High-Order Statistical Dependence in Data Steven P. Ellis (mostly joint work with Arno Klein) 6/9/’14

  2. Abstract Data analytic methods possessing the following three features are desirable: (1) The method describes ”high-order dependence” among variables. (2) It does so with few preconceptions. And (3) it can handle at least dozens, maybe hundreds of variables. However, if approached in a naive fashion, data analysis having these three features triggers a ”combinatorial explosion”: The output from the analysis can include thousands, maybe millions of numbers. Few methods exist possessing all three features yet which avoid the combinatorial explosion. Ellis has devised a data analytic method he calls ”Concurrence Topology (CT)” which does so.

  3. Abstract, continued CT takes an apparently radically new approach to solving solving this problem. It starts by translating data into a ”filtration”, a series of ”shapes”. The shapes in the series are called ”frames”. A filtration is like a building. The frames are like floors of the building. But while the floors of a building are two-dimensional, the frames of a filtration can have dimension much higher than two. A filtration can have holes that are like elevator shafts in a building. Such holes indicate relatively weak or negative association among the variables. CT uses computational algebraic topology to describe the pattern of holes. Normally, there are no more than a few dozen holes, so CT avoids the combinatorial explosion. Often one can identify small groups of variables that are closely associated with a given hole. This process facilitates interpretation of the hole.

  4. Abstract, continued A limitation of CT is that, so far, it only works with binary data. But quantitative data can always be binarized. Ellis wrote software in R (available upon request) implementing CT. A paper, written by Arno Klein and Ellis, introducing CT and demonstrating it on fMRI data has been accepted by a topology journal.

  5. ◮ Free R code exists that implements the procedures described in this talk. ◮ Reference: S.P. Ellis, A. Klein (2014) “Describing high-order statistical dependence using ‘concurrence topology’, with application to functional MRI brain data,” Homology, Homotopy, and Applications, 16, 245–264.

  6. CONCERNED WITH DATA ANALYSIS CHARACTERIZED BY THREE FEATURES

  7. INGREDIENT 1: HIGH-ORDER DEPENDENCE ◮ A statistic that can be computed from a multivariate sample by looking at only k variables at a time, but which cannot be obtained by looking at fewer than k variables at a time reflects “ k th order dependence” among the variables. ◮ “High-order dependence” means dependence of order at least 3.

  8. Examples: ◮ The list of means of 10 variables reflects “first order dependence”. ◮ A correlation matrix of 10 variables reflects second order dependence. ◮ A simple network model reflects second order dependence. ◮ Factor analysis reflects second order dependence.

  9. Regression ◮ Least squares estimates of the coefficients in a regression model Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + error reflects second order dependence. ◮ The least squares estimate of the interaction β 12 in the regression model Y = β 0 + β 1 X 1 + β 2 X 2 + β 12 X 1 : X 2 + error reflects third order dependence. ◮ Interactions in regression models can be important. ◮ This suggests that looking at dependence of order higher than 2 might be useful in general.

  10. Three data sets identical (statistically) up to 2nd order, but not at third order. I II III x y z x y z x y z 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1

  11. SUMMARIZING ◮ Typically, there are very many ways variables can be dependent. ◮ Any summary of dependence cannot capture every sort of dependence. ◮ Correlation.

  12. INGREDIENT 2: “AGNOSTIC” STATISTICS ◮ Typically, formulating a regression model involves choices. ◮ Which variable should be the response (dependent) variable? ◮ Which variables should be the predictors (independent variables)? ◮ Which variables should be included in interactions? ◮ Ditto for path analysis ◮ If you have prior knowledge to guide you, then regression modeling (or path analysis) is a powerful way to learn from data. ◮ A more data-driven approach is “agnostic analysis”: ◮ Treating all variables the same a priori. ◮ Example: Factor analysis is a second order agnostic analysis method.

  13. Group variables ◮ I’m mostly interested in “unsupervised” methods. ◮ But if there is a variable that specifies classes or groups that the data come from, then one might not want to treat it like any old variable. ◮ Output of unsupervised methods can be used as part of input to supervised methods. ◮ Give examples later.

  14. INGREDIENT 3: “LARGE” NUMBER OF VARIABLES ◮ In this talk “large number” means “dozens”, maybe a hundred or so.

  15. “COMBINATORIAL EXPLOSION” ◮ The three features constitute an “explosive mixture”. ◮ Prima facie , agnostically describing k th -order dependence in a data set means examining all combinations of k variables at a time. ◮ If there are many variables and k > 2, the number of combinations can be huge. ◮ Sometimes the collection of all these combinations can be regarded as a “haystack” in which we’re searching for “needles”.

  16. “COMBINATORIAL EXPLOSION:” EXAMPLE Analysis of seventh-order dependence among the regions of the brain “default mode network” in an fMRI data set. ◮ 32 variables. ◮ Naive agnostic seventh order analysis of 32 variables means � 32 � looking at = 3 , 365 , 856 combinations of 7 variables. 7 ◮ E.g., ≥ 3 , 365 , 856 terms in a “log linear model”. ◮ Data contained only 6,144 numbers.

  17. “COMBINATORIAL EXPLOSION,” continued ◮ Computing and interpreting many many combinations is very challenging. ◮ With many combinations, looking at individual combinations of k variables is not helpful: ◮ Difficult to interpret a torrent of numbers. ◮ When there are many groups of k variables, behavior of individual groups is unlikely to be reproducible. ◮ (Multiple comparisons)

  18. SOME METHODS THAT AGNOSTICALLY CAPTURE HIGH ORDER DEPENDENCE IN MANY VARIABLES

  19. “Unsupervised” methods: ◮ There seem to be few established unsupervised methods that capture high order dependence. ◮ Independent Components Analysis ◮ Tensor based methods: ◮ “Parallel factor analysis” ◮ “Tucker 3” ◮ Only go up to third order dependence?

  20. “Supervised” methods ◮ Many machine learning classification methods tap into high order dependence.

  21. Experimental methods.

  22. CONCURRENCE TOPOLOGY (CT) ◮ Apparently new “unsupervised” method for high-order agnostic analysis of dependence among dozens (hundreds?) of variables. ◮ CT is radically different from methods mentioned above. ◮ Since there are few methods there’s no need to choose among them: “Use all of them.” ◮ So comparing methods to see which is best is not urgent.

  23. ◮ The germ of the idea for CT came from a theoretical neuroscience talk I heard by the mathematician Carina Curto.

  24. CONCURRENCE TOPOLOGY (CT), continued ◮ CT is often able to extract a moderate number of high order statistics from a combinatorial explosion. ◮ CT detects certain forms of negative or weak association among the variables.

  25. TOPOLOGY

  26. TOPOL

  27. TOPOLOGY, continued ◮ Topology is study of qualitative aspects of shapes. ◮ Quantitative aspects of shapes such as length, angle, area, volume, curvature are only loosely connected to topology. ◮ Famously, topology can’t tell the difference between a donut and a coffee cup. ◮ Topology does pay attenton to holes in shapes (like hole in donut or in the handle of coffee cup). ◮ Topology ignores details. ◮ That’s good: There’s a combinatorial explosion of details. We have to ignore practically all of them. ◮ That’s bad: Sometimes the details are important. ◮ “Needles” are details. ◮ But often we can recover details from a CT analysis.

  28. ANALOGY FOR CT ◮ Consider this hypothetical histogram. Persistence in a histogram 14 C 12 B 10 8 A 6 4 2 0

  29. ANALOGY, continued ◮ Y axis is “count” or “frequency”. ◮ It’s discrete: 1, 2, 3, . . . . ◮ Cut the histogram at various heights. ◮ Suffices to do it at whole number heights ◮ “Frequency levels” ◮ Dark line segments show intersections of horizontal lines with the histogram. ◮ As horizontal line moves downward, sometimes a gap appears in the intersection. ◮ A gap is “born”. ◮ At a lower level the gap might be filled in. ◮ The gap “dies”. ◮ Difference in 2 heights is the “lifespan” of the gap.

  30. PERSISTENCE ◮ The phenomenon of birth and death of gaps is “persistence”. ◮ Can plot it (“persistence plot”) Persistence plot in dimension 0 for histogram 12 C 10 8 death A 6 4 B 2 0 0 2 4 6 8 10 12

Recommend


More recommend