Example A general setup for RFM. Some applications of RFM Simulation results References Multivariate and Functional Robust Fusion Methods for Big Data B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille badihghattas@gmail.com 1/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Outline 1 Example 2 A general setup for RFM. 3 Some applications of RFM 4 Simulation results 2/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References The problem We address one of the important problems in Big Data, namely how to combine estimators from different subsamples by robust fusion procedures, when we are unable to deal with the whole sample. Our Idea A classic ‘divide and conquer’. Cases: Multivariate location and scatter matrix, the covariance operator for functional data, and clustering problems. 3/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References PLAN 1 Example 2 A general setup for RFM. 3 Some applications of RFM 4 Simulation results 4/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Estimating the median The median of a huge set of iid random variables { X 1 , . . . , X n } with common density f X we split the sample into m subsamples of size l , n = ml . We calculate the median of each subsample and obtain m random variables Y 1 , . . . , Y m . Then we take the median of the set Y 1 , . . . , Y m It is clear that it does not coincide with the median of the whole original sample { X 1 , . . . , X n } , but it will be close. What else could we say about this estimator regarding its efficiency and robustness? 5/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Estimating the median Each of the m variables Y i is a median of l iid variables having density f X . Suppose l = 2 k + 1, Y i has a density given by: g Y ( y ) = (2 k + 1)! F X ( t ) k (1 − F X ( t )) k f X ( t ) . ( k !) 2 If f X is uniform on [0 , 1], it becomes h Y ( y ) = (2 k + 1)! t k (1 − t ) k 1 [0 , 1] ( t ) , ( k !) 2 which corresponds to a Beta ( k + 1 , k + 1) distribution. 6/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Estimating the median Asymptotically, we have for the empirical median ˆ θ = med ( X 1 , . . . , X n ) ∼ N ( θ, V (ˆ θ )) where V (ˆ θ ) = 1 / (4 nf X ( θ ) 2 ), while for ˜ θ RFM , the median of medians, θ RFM ∼ N ( θ, V (˜ ˜ θ RFM )) where V (˜ θ RFM ) = 1 / (4 mh Y ( θ ) 2 ). For the uniform case, both are centred at 1 / 2, f X (0 . 5) = 1, and h Y (0 . 5) = (1 / 2) 2 k (2 k + 1)! / ( k ! 2 ) ∼ � 2 k /π So the relative loss of efficiency V (ˆ θ ) → 1 /π V (˜ θ RFM ) 7/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References PLAN 1 Example 2 A general setup for RFM. 3 Some applications of RFM 4 Simulation results 8/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References The RFM { X 1 , . . . , X n } of iid random elements in a metric space E . θ a parameter to estimate a) split the sample into m subsamples with n = ml { X 1 , . . . , X l } , { X l +1 , . . . , X 2 l } , . . . , { X ( m − 1) l +1 , . . . , X lm } . b) Compute a robust estimate of θ on each subsample obtaining ˆ θ 1 , . . . , ˆ θ m . θ RFM by RFM combining ˆ c) Compute the final estimate ˜ θ 1 , . . . , ˆ θ m by a robust approach. θ RFM can be the deepest point, or the average of the 40% For instance ˜ deepest points, among ˆ θ 1 , . . . , ˆ θ m Table: Parameters estimation using RFM Consistency, efficiency, robustness and computational time properties of 9/32 the RFM ? B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References The Depth function Let X be a r.v. taking values in a Banach space ( E , � · � ), with p.d. P X , and x ∈ E . The depth of x with respect to P X is defined as follows: � X − x � �� � � D ( x , P X ) = 1 − � . (1) � E P X � � � X − x � (see Chaudhuri [1996], Vardi and Zhang [2000], and extension to a very general setup by Chakraborty and Chaudhuri [2014]). We can use it for the ”fusion” step of RFM with a suitable norm. 10/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Breakdown point for the RFM Breakdown point Following Donoho [1982], the finite-sample breakdown point, Definition Let x = { x 1 , . . . , x n } be a dataset, θ and unknown parameter lying in a metric space Θ, and ˆ θ n = ˆ θ n ( x ) an estimate based on x . Let X p be the set of all data sets y of size n having n − p elements in common with x : X p = { y : card ( y ) = n , card ( x ∩ y ) = n − p } , then the breakdown point of ˆ n (ˆ θ n at x is ǫ ∗ θ n , x ) = p ∗ / n , where p ∗ = max { p ≥ 0; ∀ y ∈ X p , ˆ θ n ( y ) is bounded and also bounded away from the boundary ∂ Θ , if ∂ Θ � = ∅} . Its is the maximum percentage of outliers (located at the worst possible positions) we can have in a sample before the estimate breaks in the sense that it 11/32 can be arbitrarily large (or close to the boundary of the parameter space) . B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Breakdown point for the RFM BP Analysis Consider the case where the robust estimates over each subsample have the breakdown point 0.5. Let B i = 1 if observation i is an outlier and 0 otherwise. Assume that the variables B i are iid ∼ B ( p ) Let S j = � l s =1 B ( j − 1) l + s the number of outliers in subsample number j , for j = 1 , . . . , m . The RFM estimator will break down if and only if there are more than m / 2 cases where S j is greater than k (recall that l = 2 k + 1). 12/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Breakdown point for the RFM To take a glance at the behaviour of the BP, n = 30000, X ∼ B ( p ) Split the sample in m subsamples, and for each one compute S j Calculate the proportion of subsubsamples containing more than l / 2 outliers; that is, the percentage of times the estimator breaks down Repeat this experiment 5000 times and look at the average value of this propotion. p = 0 . 45 p = 0 . 49 p = 0 . 495 p = 0 . 499 m 5 0 0.0020 0.0820 0.3892 10 0 0.0088 0.1564 0.5352 30 0 0.0052 0.1426 0.5186 50 0 0.0080 0.1598 0.5412 100 0 0.0192 0.2162 0.6084 150 0 0.0278 0.2728 0.6780 As expected, the best possible choice would be to take m as small as possible. 13/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References PLAN 1 Example 2 A general setup for RFM. 3 Some applications of RFM 4 Simulation results 14/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Example A general setup for RFM. Some applications of RFM Simulation results References Three applications Estimating multivariate location and scatter matrix, Estimating the covariance operator for functional data, and Clustering. Solutions for many other problems may be derived from these cases (Principal Components, for example, both for non-functional and functional data). 15/32 B. Ghattas joint work with A. Cholaquidis and R. Fraiman Universit´ e d’Aix-Marseille Multivariate and Functional Robust Fusion Methods for Big Data
Recommend
More recommend