roadmap roadmap
play

Roadmap Roadmap Distributed Data Mining (DDM) Distributed Data - PDF document

Distributed Data Mining for Distributed Data Mining for Pervasive and Privacy- -Sensitive Sensitive Pervasive and Privacy Applications Applications Hillol Kargupta Hillol Kargupta Dept. of Computer Science and Electrical Engg Engg, ,


  1. Distributed Data Mining for Distributed Data Mining for Pervasive and Privacy- -Sensitive Sensitive Pervasive and Privacy Applications Applications Hillol Kargupta Hillol Kargupta Dept. of Computer Science and Electrical Engg Engg, , Dept. of Computer Science and Electrical University of Maryland Baltimore County University of Maryland Baltimore County http://www.cs.umbc.edu/~hillol www.cs.umbc.edu/~hillol http:// hillol@cs.umbc.edu hillol@cs.umbc.edu Roadmap Roadmap ■ Distributed Data Mining (DDM) Distributed Data Mining (DDM) ■ ■ Pervasive and Privacy Pervasive and Privacy- -Sensitive Sensitive ■ Applications of DDM Applications of DDM ■ Dealing with ensemble of data mining Dealing with ensemble of data mining ■ models models ■ Linear representations for advanced Linear representations for advanced ■ meta meta- -level analysis of models level analysis of models ■ Conclusions Conclusions ■ 1

  2. Distributed Data Mining (DDM) Distributed Data Mining (DDM) ■ Distributed resources Distributed resources ■ – data data – – Computation and communication – Computation and communication – users – users ■ Data mining by properly exploiting the Data mining by properly exploiting the ■ distributed resources distributed resources Distributed Resources and DDM Distributed Resources and DDM ■ Distributed compute nodes connected by first Distributed compute nodes connected by first ■ communication network communication network – Partition data if necessary and distribute – Partition data if necessary and distribute computation computation ■ Inherently distributed data that may not be Inherently distributed data that may not be ■ collected to a single site or re- -partitioned partitioned collected to a single site or re – Connected by limited bandwidth network – Connected by limited bandwidth network – Privacy Privacy- -sensitive data sensitive data – 2

  3. Pervasive Applications: UMBC Fleet Pervasive Applications: UMBC Fleet Health Monitoring Health Monitoring • Vehicle Health Monitoring Systems • Vehicle Health Monitoring Systems • • Collect and analyze vehicle related Collect and analyze vehicle related information. information. • • On On- -board/ board/ in situ in situ data analysis data analysis • • Send out interesting patterns Send out interesting patterns • • Analyze data for the entire fleet Analyze data for the entire fleet • • UMBC fleet operations management UMBC fleet operations management Continued… Continued… ■ Onboard real Onboard real- -time time ■ vehicle vehicle- -mining mining system over a wireless system over a wireless network network 3

  4. Pervasive Applications: MobiMine MobiMine Pervasive Applications: ■ MobiMine MobiMine System: A System: A ■ mobile data stream mobile data stream mining system for mining system for monitoring financial monitoring financial data data DDM from NASA EOS Distributed Data DDM from NASA EOS Distributed Data Repositories Repositories 4

  5. Mining from Distributed Privacy- - Mining from Distributed Privacy Sensitive Data Sensitive Data ■ Analyze data without moving the data Analyze data without moving the data ■ in its original form. in its original form. ■ Many DDM algorithms are privacy Many DDM algorithms are privacy- - ■ friendly since they minimize data friendly since they minimize data communication. communication. Distributed Data Mining Distributed Data Mining Site 1 Central site Local Analysis mining & filtering Models/patterns Aggregation and filtered data and analysis of models/patterns Local Analysis mining & filtering Site 2 5

  6. Ensemble of Classifiers and Clusters Ensemble of Classifiers and Clusters … f 2 ( x ) f 3 ( x ) f 1 ( x ) f n ( x ) ∑ Weighted Sum a i : weight for the i-th base classifier f(x) = ∑ i a i f i (x) f i (x) : output of the i-th classifier Discrete Structures for Data Mining Discrete Structures for Data Mining Models Models ■ Trees, in general Graphs are popular choices Trees, in general Graphs are popular choices ■ for data mining models: for data mining models: – Decision trees (Tree) – Decision trees (Tree) – Neural networks (Graph) Neural networks (Graph) – – Graphical models (Graph) Graphical models (Graph) – – Clusters (Graph, – Clusters (Graph, hypergraph hypergraph) ) ■ Dealing with ensembles requires an algebraic Dealing with ensembles requires an algebraic ■ framework. framework. 6

  7. Examples Examples ■ Eigen Eigen analysis of graphs: analysis of graphs: ■ – Graphs can be represented using matrices Graphs can be represented using matrices – – Eigen – Eigen analysis of the analysis of the Laplacian Laplacian of graphs (Chung, of graphs (Chung, 1997). 1997). ■ Wavelet, Fourier, or other representations of Wavelet, Fourier, or other representations of ■ discrete structures?? discrete structures?? Decision Trees as Functions Decision Trees as Functions Outlook Outlook 2 Rain 0 Sunny Overcast 1 Humidity Wind Humidity Wind Yes 1 Strong Weak 1 0 High Normal 1 0 No Yes 0 1 Yes 1 No 0 ■ Decision tree can be viewed as a numeric function. Decision tree can be viewed as a numeric function. ■ 7

  8. Fourier Representation of a Decision Tree Fourier Coefficient (FC) Outlook f( x ) = ∑ j w j Ψ j ( x ) 2 0 1 Humidity Wind 1 partition 1 0 1 0 Fourier Basis Function 1 0 1 0 Fourier Basis Fourier Basis ∑ = Ψ f(x) w (x) j j ∈ j Ξ ∈ { { 0, 1 } l j, x ∈ 0, 1 } j, x l Ψ (x) = j . x (-1) j - -th th Fourier basis function, Fourier basis function, j j w j w j is the corresponding Fourier coefficient; is the corresponding Fourier coefficient; 1 ∑ = Ψ w f( x ) (x) j j N x 8

  9. Partitions Partitions A partition j j is an is an l l - -bit bit boolean boolean string. string. A partition It can also be viewed as a subset of variables. It can also be viewed as a subset of variables. Example: Example: Partition 101 ⇒ ⇒ {x {x 1 , x 2 } contains the features Partition 101 1 , x 2 } contains the features associated with locations indicated by the 1- associated with locations indicated by the 1 -s in the s in the partition. partition. Order of a partition = the number 1 of a partition = the number 1- -s in a partition. s in a partition. Order Fourier Spectrum of a Decision Tree Fourier Spectrum of a Decision Tree ■ Very sparse representation; polynomial number of Very sparse representation; polynomial number of ■ non non- -zero coefficients. If k is the depth then all zero coefficients. If k is the depth then all coefficients involving more than k features are zero. coefficients involving more than k features are zero. Higher order coefficients are exponentially smaller Higher order coefficients are exponentially smaller ■ ■ compared to the low order coefficients ( compared to the low order coefficients (Kushlewitz Kushlewitz and Mansour and Mansour, 1990; Park, , 1990; Park, Kargupta Kargupta, 2001). , 2001). ■ Can be approximated by the low order coefficients Can be approximated by the low order coefficients ■ with significant magnitude. with significant magnitude. ■ Further details in [ Further details in [Linial Linial, , Mansour Mansour, Nisan, 89], [Park, , Nisan, 89], [Park, ■ Ayyagari Ayyagari Kargupta Kargupta 01’], [ 01’], [Kargupta Kargupta et al. 2001]. et al. 2001]. 9

  10. Exponential Decay of FCs Exponential Decay of FCs (S&P 500 Index Data S&P 500 Index Data) ) ( Compression Compression Sufficient spectrum Energy preserved in the Lower (99% energy) Order Coefficients 10

  11. Fourier Spectrum and Decision Trees Fourier Spectrum and Decision Trees Decision Tree Decision Tree Fourier Spectrum Fourier Spectrum ■ Developed efficient algorithms to Developed efficient algorithms to ■ – Compute Fourier spectrum of decision tree – Compute Fourier spectrum of decision tree (IEEE TKDE, SIAM Data Mining Conf., IEEE Data Mining Conf, ACM SIGKDD (IEEE TKDE, SIAM Data Mining Conf., IEEE Data Mining Conf, ACM S IGKDD Explorations) Explorations) – Compute tree from the Fourier spectrum – Compute tree from the Fourier spectrum (DMKD, SIGMOD 2002) (DMKD, SIGMOD 2002) Aggregation of Multiple Decision Aggregation of Multiple Decision Trees Trees Σ w ψ j F1(x) = Σ w j j ψ (x) F1(x) = j (x) F2(x) = Σ Σ w j ψ ψ j F3(x) = Σ Σ w j ψ ψ j F2(x) = w j j (x) (x) F3(x) = w j j (x) (x) F(x) = a1*F1(x) + a2*F2(x) + a3*F3(x) F(x) = a1* F1(x) + a2*F2(x) + a3*F3(x) ■ Weighted average of decision trees through Fourier Weighted average of decision trees through Fourier ■ analysis analysis 11

  12. Visualization of Decision Trees Visualization of Decision Trees FC are color FC are color- -coded in accordance to the magnitude. coded in accordance to the magnitude. ■ ■ Brighter spots are more significant coefficients. Brighter spots are more significant coefficients. ■ ■ On clicking, partition corresponding to the coefficient On clicking, partition corresponding to the coefficient ■ ■ is displayed. is displayed. PCA- -Based Visualization of Decision Trees Based Visualization of Decision Trees PCA 0.3 0.2 0.1 2nd Principal Component 0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 1st Principal Component 12

Recommend


More recommend