nonparametric distributed learning architecture algorithm
play

Nonparametric Distributed Learning Architecture: Algorithm and - PowerPoint PPT Presentation

Introduction Elements of Distributed Statistical Learning Applications Nonparametric Distributed Learning Architecture: Algorithm and Application Scott Bruce, Zeda Li, Hsiang-Chieh(Alex) Yang, and Subhadeep (DEEP) Mukhopadhyay Temple


  1. Introduction Elements of Distributed Statistical Learning Applications Nonparametric Distributed Learning Architecture: Algorithm and Application Scott Bruce, Zeda Li, Hsiang-Chieh(Alex) Yang, and Subhadeep (DEEP) Mukhopadhyay Temple University - Department Award Day Seminar Best Paper Award , JSM 2016 Section on Nonparametric Statistics of ASA. Winner Fox School Ph.D. Student Research Competition. April 15, 2016

  2. Introduction Elements of Distributed Statistical Learning Applications Big Data Statistical Inference: Motivating Example Goal: Nonparametric two-sample inference algorithm for Expedia personalized hotel recommendation engine. We develop a scalable distributed algorithm that can mine search data from millions of travelers to find the important features that best predict customers’ likelihood to book a hotel. Key Challenges Variety : Different data types require different stasitical measures. Volume : Over 10 million observations across 52 variables. Scalability : Distributed, parallel processing for massive data analysis. Nonparametric Distributed Learning Architecture: Algorithm and Application 1/17

  3. Introduction Elements of Distributed Statistical Learning Applications Summary of Main Contributions Dramatic increases in the size of datasets have made traditional “ centralized ” statistical inference techniques prohibitive. Surprisingly very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity of a single-machine system. Indeed, the topic of big data statistical inference is very much in its nascent stage of development. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the basic fundamental data modeling principles that were developed for ‘small’ data over the last century? To address this problem we present MetaLP –a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis addressing: (1) massive volume and (2) variety or mixed data problem. Nonparametric Distributed Learning Architecture: Algorithm and Application 2/17

  4. Introduction Elements of Distributed Statistical Learning Applications LP Nonparametric Harmonic Analysis Conventional statistical approach fails to address the ‘mixed data problem’. We resolve this by representing data in a new transform domain via a specially designed procedure (analogous to time → frequency domain representation via Fourier transform). Theorem (Mukhopadhyay and Parzen, 2014) Random variable X (discrete or continuous) with finite variance admits the following decomposition: X − E ( X ) = � j > 0 T j ( X ; X ) E [ XT j ( X ; X )] with probability 1 . Traditional and modern statistical measures developed for different data-types can be compactly expressed as inner products in the LP Hilbert space. Nonparametric Distributed Learning Architecture: Algorithm and Application 3/17

  5. Introduction Elements of Distributed Statistical Learning Applications Data-Adaptive Shapes 1.5 2.0 2 1.0 0.5 1.0 1 −0.5 0.0 0 0.0 −1 −1.5 −1.0 −1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 3 3 4 2 2 3 2 1 2 1 1 0 1 −1 0 0 0 −2 −1 −1 −2 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1: The left 2x2 panel shows the first four LP orthonormal score functions for discrete variable length of stay . The right panel shows the shape of the score functions for continuous price usd . Nonparametric Distributed Learning Architecture: Algorithm and Application 4/17

  6. Introduction Elements of Distributed Statistical Learning Applications LP Hilbert Functional Representation Define the two-sample LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions LP[ j ; X , Y ] = Cor[ T j ( X ; X ) , Y ] = E [ T j ( X ; X ) T 1 ( Y ; Y )] . (1) Properties Sample LP statistics √ n � LP[ j ; X , Y ] asymptotically converge to i.i.d. standard normal distributions (Mukhopadhyay and Parzen, 2014). LP[1;X,Y] unifies various measures of linear association for different data-type combinations . Higher order LP statistics capture distributional differences Allows data scientists to write a single computing formula irrespective of its data type, with a common metric and asymptotic characteristics. Steps towards Unified Algorithms. Nonparametric Distributed Learning Architecture: Algorithm and Application 5/17

  7. Introduction Elements of Distributed Statistical Learning Applications Meta-Analysis and Data-Parallelism The key is to recognize that meta-analytic logic can provide a formal statistical framework to address: How to judiciously combine the “local” LP-inferences executed in parallel by different servers to get the “global” inference for the original big data? Towards large-scale parallel computing: We use meta-analysis to parallelize the statistical inference process for massive datasets. Nonparametric Distributed Learning Architecture: Algorithm and Application 6/17

  8. Introduction Elements of Distributed Statistical Learning Applications What to Combine? Instead of simply providing point estimates we seek to provide a distribution estimator (analogous to the Bayesian posterior distribution) for the LP-statistics via a Confidence Distribution ( CD ) that contains information for virtually all types of statistical inference (e.g. estimation, hypothesis testing, CI, etc.). Definition (Confidence Distribution) Suppose Θ is the parameter space of the unknown parameter of interest θ , and ω is the sample space corresponding to data X n = { X 1 , X 2 , . . . , X n } T . Then a function H n ( · ) = H n ( X , · ) on ω × Θ → [0 , 1] is a confidence distribution (CD) if: (i). For each given X n ∈ ω, H n ( · ) is a continuous cumulative distribution function on Θ; (ii). At the true parameter value θ = θ 0 , H n ( θ 0 ) = H n ( X , θ 0 ), as a function of the sample X n , following U [0 , 1]. Nonparametric Distributed Learning Architecture: Algorithm and Application 7/17

  9. Introduction Elements of Distributed Statistical Learning Applications How to Combine? The combining function for CDs across k different studies can be expressed as: H ( c ) (LP[ j ; X , Y ]) = G c { g c ( H (LP 1 [ j ; X , Y ]) , . . . , H (LP k [ j ; X , Y ]) } . The function G c is determined by the monotonic g c function defined as G c ( t ) = P ( g c ( U 1 , . . . , U k ) ≤ t ) , in which U 1 , . . . , U k are independent U [0 , 1] random variables. A popular and useful choice for g c is g c ( u 1 , . . . , u k ) = α 1 F − 1 0 ( u 1 ) + . . . + α k F − 1 0 ( u k ) , where F 0 ( · ) is a given cumulative distribution function and α ℓ ≥ 0 , with at least one α ℓ � = 0, are generic weights. Nonparametric Distributed Learning Architecture: Algorithm and Application 8/17

  10. Introduction Elements of Distributed Statistical Learning Applications Combining Formula for the LP CD’s Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) 0 ( t ) = Φ − 1 ( t ) and α l = √ n ℓ , where n ℓ is the size of Setting F − 1 subpopulation ℓ = 1 , . . . , k, the following combined aCD for LP[ j ; X , Y ]) follows: �� �� � 1 / 2 � k � ( c ) [ j ; X , Y ] LP[ j ; X , Y ] − � H ( c ) (LP[ j ; X , Y ]) = Φ n ℓ LP ℓ =1 � k ℓ =1 n ℓ � LP ℓ [ j ; X , Y ] ( c ) [ j ; X , Y ] = � LP Where � k ℓ =1 n ℓ �� k � − 1 ( c ) [ j ; X , Y ] and where � LP ℓ =1 n ℓ are the mean and variance respectively of the combined aCD for LP[ j ; X , Y ] . Nonparametric Distributed Learning Architecture: Algorithm and Application 9/17

  11. Parallel Broken-Big Datasets are Often Heterogeneous Failure to take heterogeneity into account can easily spoil the big data discovery process. � iid ∼ N (LP ℓ [ j ; X , Y ] , s 2 LP ℓ [ j ; X , Y ] | LP ℓ [ j ; X , Y ] , s i i ) (2) iid ∼ N (LP[ j ; X , Y ] , τ 2 ) LP ℓ [ j ; X , Y ] | LP[ j ; X , Y ] , τ (3) 100 75 Partition count 50 Random Visit ID 25 0 −0.3 −0.2 −0.1 0.0 0.1 0.2 LP statistics for Price usd Figure 2: Histogram of LP-statistic of the variable price usd based on random partition and visitor location country id partition.

  12. Heterogeneity-Corrected LP Confidence Distribution Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) � ( τ 2 + (1 / n ℓ )) , where n ℓ is Setting F − 1 0 ( t ) = Φ − 1 ( t ) and α ℓ = 1 / the size of subpopulation ℓ = 1 , . . . , k, the following combined aCD for LP[ j ; X , Y ] follows: H ( c ) (LP[ j ; X , Y ]) = �� � � 1 / 2 � k 1 ( c ) [ j ; X , Y ]) (LP[ j ; X , Y ] − � Φ LP , τ 2 + (1 / n ℓ ) ℓ =1 � k ℓ =1 ( τ 2 + (1 / n ℓ )) − 1 � LP ℓ [ j ; X , Y ]) ( c ) [ j ; X , Y ]) = � LP � k ℓ =1 ( τ 2 + (1 / n ℓ )) − 1 ( c ) [ j ; X , Y ]) and ( � k ℓ =1 1 / ( τ 2 + (1 / n ℓ ))) − 1 are the mean where � LP and variance respectively of the combined aCD for LP[ j ; X , Y ] .

  13. Introduction Elements of Distributed Statistical Learning Applications Expedia: Variable Importance, Impact of Regularization 0.06 LP Confidence Interval 0.03 0.00 −0.03 0 10 20 30 40 Variables Figure 3: 95% Confidence Intervals for each variables’ LP Statistics under Random Sampling Partitioning (black) and Country ID Partitioning (red). Nonparametric Distributed Learning Architecture: Algorithm and Application 12/17

Recommend


More recommend