Computational geometry and statistical depth measures Eynat Rafalin Computer Science Department, Tufts University www.cs.tufts.edu/research/geometry Joint work with Prof. Diane Souvaine 1 Interface 04
Outline of talk � Data analysis, Computational geometry and depth based statistics � Applications – A basic technique: the duality transform – Least Median of Squares (LMS) regression in optimal time – Half-space depth contours in optimal time – Depth contours – Simplicial depth � Future research 2 Interface 04
Computational Geometry � Deals with problems that require geometric algorithms for their solutions. � Systematic study of algorithms and data structures for Computational geometry is geometric objects, with a focus on exact algorithms that are asymptotically fast. everywhere! � At the outset: once exact algorithms have been obtained, refined, and are still slow, then move to approximation algorithms. 3 Interface 04
Computational geometry & Statistics – data analysis 4 Interface 04
Multivariate analysis by Data depth � Data depth - A way of measuring how deep a given point x in R d is relative to F , a probability distribution, or relative to a given data cloud. � Examples: – Halfspace (Location, Tukey) depth (Hodges 55, Tukey 75) – Simplicial depth (Liu 90) – Convex Hull Peeling depth (Barnett 76, Eddy 82) – Regression depth (Rousseeuw & Hubert 99) – Mahalanobis depth (Mahalanobis 36) – Oja depth (Oja 83) 5 Interface 04
Multivariate analysis by Data depth � Data depth - A way of measuring how deep a given point x in R d is relative to F , a probability distribution, or relative to a given data cloud. � Concept provides center outward ordering of points. � Non parametric, multivariate statistics. � Robust. � affine invariance - for many depth functions the choice of axes does not affect the depth values. 6 Interface 04
Outliers and Robustness � Observations that deviate from the main part of the data ( outliers ) can have an undesirable influence on the analysis of the data � A robust depth function yields reasonable results even if several unannounced outliers occur in the data [Handbook of statistics 15, Rao & Maddala]. � For example – Depth contours are nested contours that enclose regions with increasing depth – For half-space depth contours: in the presence of m outliers only the m outermost depth contours may be corrupted by the outliers, but the inner set of depth contours will maintain 7 its shape [Donoho & Gasko 92]. Interface 04
Data depth: a characterization, visualization and quantification tool � Deepest point � Outliers � Depth contours � Bag-plot (Box-plot) [ Rousseeuw , Ruts, Tukey 99] � Scale curve as a measure of scale [Liu, Parelius, Singh 99] � Fan plot as a measure of tailedness [Liu, Parelius, Singh 99] � Robustified classification and cluster analysis [ Rousseeuw , Ruts 96] 8 Interface 04
Fan plots [Liu, Parelius & Singh 99] Relative area (CH of p%/CH) Percentile of points 50 data points, created from a random distribution, with covariance matrix 4 times identity. The fans are created for data sets containing the 1/6, 2/6, ..central regions. For each region the area of the CH of 2, 4, 6,…% of the points is 9 Interface 04 computed.
The continuous and finite sample case � Most depth functions are defined in respect to a probability distribution F , considering {X 1 ,.., X n } random observations from F . � The finite sample version of the depth function is obtained by replacing F by F n , the empirical distribution of the sample {X 1 ,.., X n }. � In general, computational geometers study the finite sample case! 10 Interface 04
Applications
Applications � History – Shamos, Geometry and statistics: problems at the interface,1976 – Bentley & Shamos, A problem in multivariate statistics: algorithm, data structure and applications, 1977 12 Interface 04
Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87) Logarithm of light intensity Given a set of points find a line such that the sum of the squares of the residuals is minimized Star spectrum 13 Interface 04
Data set of the stellar cluster CYGOB1 (Leroy & Rousseeuw 87) Logarithm of light intensity Given a set of points find a line such that the median of the squares of the residuals is minimized Star spectrum 14 Interface 04
Least Median of Squares Regression � Ordinary least sum of Squares – Low breakdown point � Least median of squares – high breakdown point � Given a set of points, find a line such that the median of the squares of the residuals is minimized � Find two parallel lines at minimum vertical distance from each other with half of the data points in the slab they define � naïve approach O(n 3 ) � O(n 2 logn) time algorithm for computing the LMS line in R 2 [Souvaine,Steele 87] B � An O(n 2 ) algorithm using duality and topologcial sweep [Edelsbrunner,Souvaine 90] A C l 15 Interface 04
Points and lines � It is hard to find an order in a set of points. � An arrangement of lines is easier. � A set of points can be transformed into an arrangement of lines, preserving important properties using duality: T a point (a,b) a line y=ax+b 16 Interface 04
TC:y=3x TD:y=4x-1 Duality TB:y=2x+1 TA:y=x+2 l: y = -x+3 T(l) A (1,3) (1,2) T(m) B (2,2) (2,1) C (3,0) D (4,-1) m: y=-2x+2 Primal Dual Primal Dual T Preserves slope, vertical distance and the a point (a,b) a line y=ax+b above\below relationship T ? (-c, d) A line y=cx+d 17 Interface 04
LMS Primal LMS B z y A x l LMS dual C TA Tx The LMS line bisects a slab bounded by 2 parallel TC lines, one of which goes T l through 2 data points and Ty the other goes through one data point TB (Provable characteristics of LMS) 18 Interface 04 Tz
� Least Median of Squares (LMS) Regression – The LMS line can be computed in 2D in O(n 2 ) [Edelsbrunner, Souvaine 90]. Earlier result: [Souvaine, Steele 87] – Practical approximation algorithm [Mount, Netanyahu, Romanik, Silverman, Yu 97], [Mount, Erickson, Har-Peled 04] 19 Interface 04
Half-space depth
D The half-space depth of a F p point p is the minimum number of points of a G given set S lying in any E A closed halfplane bounded B by a line through p C Question – – how to compute the half how to compute the half- -space depth space depth Question contours efficiently? (naive cost per point 2 )) contours efficiently? O(n 2 (naive cost per point– – O(n )) 21 Interface 04
The depth of a point p – – The minimum The minimum The depth of a point p number of points of S S lying in any closed lying in any closed number of points of halfspace determined by a line through determined by a line through p halfspace p � A line l through p a point T(l) through line T(p) � k points in the half-plane above the line l through p k lines above the point T(l) � To count how many lines above another point TA look at the level D TB p F TC TD G E TE T(l) A B T(p) TF l 22 Interface 04 TG C
All the half-space depth contours in R 2 can be Depth 1 D p F computed in O(n 2 ) time using topological sweep Depth 2 G E [Miller, Ramaswami, Rousseeuw, A Sellares,Souvaine,Streinu,Struyf,01] B TA TB TC C TD TE Tp TF 23 Interface 04 TG
Half-space depth contours � The minimum number of points lying in any closed half-space determined by a line through p - the min level of the dual line T(l) � To compute the k-th half-space depth contour (all points of depth at least k) find the k-th level in the dual 24 Interface 04
Sweeping an arrangement of lines � Vertical line sweep � Topological line sweep – Report all intersection – Report all intersection pairs pairs – sorted in order of x – according to a partial order related to the levels coordinate of the arrangement – O(n 2 ) time and O(n) space – O(n 2 logn) time and O(n) space 25 Interface 04
Duality in 3D Primal Dual Primal Dual T a point (a,b,c) a plane z=ax+by+c 26 Interface 04
Half-space depth in R d � The depth of a point p is the minimum number of points of a given set S lying in any closed half-space bounded by a line hyperplane through p 27 Interface 04
Collaboration – half-space depth � The depth of a single point can be computed in O(nlog n) [Rousseeuw & Ruts 1996]. The lower bound is Ω (n log n) [Aloupis, Cortes, Gomez, Soss, Toussaint 02] � Computing the 2D tukey median can be done in O(n log 5 n) [Matousek 1991], and was improved to O(n log 3 n) [Langerman, Steiger 03] � Computing all 2D depth contours can be done in O(n 2 ) time using duality & topological sweep [Miller, Ramaswami, Rousseeuw, Sellares, Souvaine, Streinu, Struyf, 01] � Another approach for computing depth contours uses parallel arrangement construction [Fokuda & Rosta, 02] � Half-space depth contours can be computed for display in 2D using hardware assisted computation [Krishnan, Mustafa, Venkatasubramanian 02 ] 28 Interface 04
Depth Contours
Depth Contours � nested contours that enclose regions with increasing depth. � First introduced by Tukey as a data visualization tool for a two dimensional data (half-space depth contours) [Tukey 75] � Provide powerful tools to visualize and compare data sets. 30 Interface 04
Recommend
More recommend