Problems for Multivariate Data Analysis Censored data. Riffle: an R - PowerPoint PPT Presentation

Problems for Multivariate Data Analysis • Censored data. Riffle: an R Package for Nonmetric Clustering – Tied ranks and reduced variance when “ < 5” ⇒ “5”. Geoffrey B. Matthews and Robin A. Matthews – Systematic bias when omitted. Western Washington University Bellingham, WA, USA • Missing data. – Omit entire row when one variable column is missing? • Noisy, “useless” parameters. – Measured anyway. – Can be unrelated to major patterns. Riffle: an R Package for Nonmetric Clustering Riffle • Dissimilar data types Matthews & Hearne, IEEE PAMI, 1991 – Chemical A clustering algorithm: ∗ ph, alkalinity • group similar points into clusters. – Physical A nonmetric algorithm: ∗ temperature, percent canopy cover, sediment size, land • uses only order statistics for continuous data use classes • can handle both continuous and categorical data together – Biological Uses variables independently: ∗ chlorophyll, sex (male, female, juvenile) • ignores scattered missing values • uses incommensurable variables without normalizing ∗ rare species (counts 1-2) ∗ common species (counts 10,000-100,000) Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

Proportional Reduction in Error Proportional Reduction in Error • Measuring Predictability for Categorical Variables Independent variables: red green blue Errors A 5 8 2 7 red green blue Errors B 2 3 9 5 A 6 3 9 9 C 8 1 0 1 B 4 2 6 6 Minimum: Totals 15 12 11 13 C 2 1 3 3 0% reduction Totals 12 6 18 18 0 18 Errors predicting (red, green, blue) a priori : 12 + 11 = 23 Errors predicting (red, green, blue) given (A, B, C): 7 + 5 + 1 = 13 Perfectly predictable variables: 23 − 13 = 10 red green blue Errors Proportional reduction in error: 23 23 A 12 0 0 0 B 0 0 18 0 Maximum: • More meaningful and robust than, e.g., χ 2 C 0 6 0 0 100% reduction Totals 12 6 18 0 18 18 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Clustering with categorical variables Handling ordered variables • Assign clusters to maximize predictability over other variables. • Cuts adjusted to maximize predictability of clusters Variable 1 A A A A 12 12 A A A A A A A A A 1 8 2 A A 1 A A A A A A A A A A Point Cluster A A A A A A 10 A A A 10 A A A B 3 4 1 11 A A A A A A A A A A A A A A 1 A A A C 1 2 3 A A A A A A 2 A 8 A 8 A C C C C C C 3 B C C y C y C 4 C 6 C C 6 C C C C Variable 2 C C C C 5 A A 9 3 2 4 4 5 6 C B B B B B B B B B B B B B 2 8 2 11 B B B B B B 7 B B B B B B B B B 2 B 2 B C 1 9 3 B B B B 8 C B B B B B B 9 A B B 0 0 10 C 2 4 6 8 10 12 2 4 6 8 10 12 Variable 3 x x 11 B A 8 3 7 . . 2 . . . . B 2 4 3 25 x y x y C 6 6 2 A 20 10 0 A 0 10 20 A 30 0 0 A 0 0 30 B 0 10 10 B 20 0 0 B 0 20 0 B 20 0 0 C 0 0 10 C 0 10 0 C 0 0 10 C 0 10 0 ... 20/40 30/40 30/30 30/30 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

Cutting Gaussian variables Alternative handling of Gaussian variables (EM) • Generate independent Gaussians from cluster statistics µ i , σ i • Assign to most likely group, instead of max predictability. • Cut where max likelihood changes from one to another. • Not used in Riffle. 12 A A A A 12 A A A A A A A A A A A A A A A A A A A A A A A A A A A A 10 A 10 A A A A A A A A A A A A A A A A A A A density density A A A A A A 8 A 8 A C C C C C C C C y C 6 C y C C 6 C C C C C C C C 4 4 B y B y B B B B B B B B B B B B B B B B B B B B B B B x B B 2 2 B x B B B B B B B B B B B B 0 0 2 4 6 8 10 12 2 4 6 8 10 12 x x x y A 30 0 0 A 0 1 29 B 1 19 0 B 20 0 0 C 0 0 10 C 0 10 0 28/29 30/31 Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Essential Algorithm Getting things started variables <- quantile.cuts(data) To find initial cuts for variables: clusters <- seed.clusters(variables) • Use quantiles for cut points. score <- reduction.in.error(variables, clusters) • Use quantiles for µ i , overall σ for σ i . while (improving(score)) { To find initial clusters, given cut variables: variables <- best.cuts(variables, clusters) • Select one point randomly as seed. clusters <- best.clusters(variables, clusters) • Find other seeds by selecting points as different as possible. score <- reduction.in.error(variables, clusters) • Assign each seed to a different cluster. } • Assign all other points to cluster of most similar seed. return (clusters, variables) Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering

Embellishments Data Exploration vs. Confirmation • Clustering in general is exploratory. • Each variable is dealt with independently. • Clustering data with known groups: • Each variable has a score (predictability vs. cluster). – correlation between clusters and groups measures significance. • Use score to eliminate variables, or rank them in importance. – identifies important variables as the ones with • We use this to handle the curse of dimensionality and find a high predictibility. small set of critical variables. – determine not only significance of effect, but also which variables are affected the most. • Data reduction – we have used this to chart seasonal effects. Riffle: an R Package for Nonmetric Clustering Riffle: an R Package for Nonmetric Clustering Conclusion • We have used Riffle successfully for over 10 years for ecological and toxicological data analysis. • Riffle can cluster using incommensurate variables. • Riffle handles censored data and missing data with few assumptions. • Riffle can reduce complexity in highly multivariate datasets. • R package available 2006. Riffle: an R Package for Nonmetric Clustering

Problems for Multivariate Data Analysis Censored data. Riffle: an R - PowerPoint PPT Presentation

Problems for Multivariate Data Analysis Censored data. Riffle: an R Package for Nonmetric Clustering Tied ranks and reduced variance when < 5 5. Geoffrey B. Matthews and Robin A. Matthews Systematic bias when

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Reading multivariate data Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Multivariate t-distributions Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Multivariate Ordination Analyses: Principal Component Analysis Dilys Vela Tatiana Boza Tatiana

Multivariate Linear Regression Max Turgeon STAT 4690Applied Multivariate Analysis

Multivariate Normal Distribution Max Turgeon STAT 4690Applied Multivariate Analysis Building

Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints

Multivariate Analysis of Variance Max Turgeon STAT 4690Applied Multivariate Analysis Quick

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Advanced PHP Dr. Steven Bitner A/B and Multivariate testing Why use multivariate testing If

Multivariate normal distribution Surajit Ray Reader, University of Glasgow DataCamp

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Principal Component Analysis Surajit Ray Reader, University of Glasgow DataCamp Multivariate

Analyzing spatial multivariate structures St ephane Dray Univ. Lyon 1 CARME 2011, Rennes

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Sets and Elements Slides to accompany Sections 1.(1-3) of Discrete Mathematics and Functional

COMP 364: Computer Tools for Life Sciences Introduction to image analysis with scikit-image (part

Object Oriented Programming OOP Basic Principles C++ Classes September 2004

Constraint Satisfaction Sven Koenig, USC Russell and Norvig, 3 rd Edition, Chapter 6 These slides

tr t trr t t

Javascript: Arrays ATLS 3020 - Digital Media 2 Week 3 - Day 1 Review on Variables Javascript

ECG782: Multidimensional Digital Signal Processing Color Image Processing

CS440/ECE 448, Lecture 6: Constraint Satisfaction Problems Slides by Mark Hasegawa-Johnson,

Sambuz

Useful Links

Newsletter

Mail Us