roadmap roadmap
play

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed - PDF document

Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of


  1. Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of Maryland, Baltimore County and AGNIK www.cs.umbc.edu/~hillol www.cs.umbc.edu/~hillol Acknowledgements: Wes Griffin, Souptik Acknowledgements: Wes Griffin, Souptik Datta Datta, , Kanishka Bhaduri, Kamalika Kanishka Bhaduri, Kamalika Das, Ran Wolff, Chris Das, Ran Wolff, Chris Giannella Giannella Roadmap Roadmap � Distributed Data Mining: Why Bother? Distributed Data Mining: Why Bother? � � Some Emerging Applications Some Emerging Applications � � Local Algorithms Local Algorithms � � Exact Local Algorithms Exact Local Algorithms � � Approximate Local Algorithms Approximate Local Algorithms � � Resources Resources � 1

  2. Data Mining and Distributed Data Mining Data Mining and Distributed Data Mining � Data Mining: Scalable analysis of data by paying Data Mining: Scalable analysis of data by paying � careful attention to the resources: careful attention to the resources: � computing, computing, � � communication, communication, � � storage, and storage, and � � human human- -computer interaction. computer interaction. � � Distributed data mining (DDM): Mining data Distributed data mining (DDM): Mining data � using distributed resources. using distributed resources. Data Mining for Distributed and Ubiquitous Data Mining for Distributed and Ubiquitous Environments: Applications Environments: Applications � Mining Large Databases from distributed sites Mining Large Databases from distributed sites � � Grid data mining in Earth Science, Astronomy, Counter Grid data mining in Earth Science, Astronomy, Counter- -terrorism, Bioinformatics terrorism, Bioinformatics � � Monitoring Multiple time critical data streams Monitoring Multiple time critical data streams � � Monitoring vehicle data streams in real Monitoring vehicle data streams in real- -time time � � Monitoring physiological data streams Monitoring physiological data streams � � Analyzing data in Lightweight Sensor Networks and Mobile devices Analyzing data in Lightweight Sensor Networks and Mobile devices � � Limited network bandwidth Limited network bandwidth � � Limited power supply Limited power supply � � Preserving privacy Preserving privacy � � Security/Safety related applications Security/Safety related applications � � Peer Peer- -to to- -peer data mining peer data mining � � Large decentralized asynchronous environments Large decentralized asynchronous environments � 2

  3. Vehicles: Source of High Volume Data Streams Vehicles: Source of High Volume Data Streams � Vehicles generate tons Vehicles generate tons � of data of data � Hundreds of different Hundreds of different � parameters from parameters from different subsystems different subsystems � High throughput data High throughput data � streams streams � So what? So what? � Why Mine Vehicle Data? Why Mine Vehicle Data? � Fuel consumption analysis Fuel consumption analysis � � Fleet analytics Fleet analytics � � Vehicle benchmarking Vehicle benchmarking � � Predictive health Predictive health- -monitoring monitoring High gas prices High gas prices � � Driver behavior analytics Driver behavior analytics � Breakdowns cost Breakdowns cost Bad driving Bad driving thousands of thousands of costs money--- --- costs money dollars dollars fuel, brake shoe, fuel, brake shoe, insurance, law- insurance, law - suits suits 3

  4. From Concept to Commercial Product From Concept to Commercial Product First prototype First prototype -- -- PDA PDA- -based platform based platform � � Other choices: Other choices: � � Cell phones and Cell phones and � � Low- -cost, less powerful embedded devices cost, less powerful embedded devices Low � � Circa 2001 Circa 2001 Market Entry Point Market Entry Point Circa 2005 � � Circa 2005 � Location management companies Location management companies � � M2M companies M2M companies � Low Cost Embedded GPS Devices Low Cost Embedded GPS Devices � � Resource constrained Resource constrained � � 3 3- -4K run time memory 4K run time memory � � Circa 2007 Circa 2007 250K footprint 250K footprint � � Resource sharing with GPS program Resource sharing with GPS program � � Private & Secure Data Mining from Multi- -Party Party Private & Secure Data Mining from Multi Distributed Data Distributed Data � Compute global patterns without direct access to the multi Compute global patterns without direct access to the multi- -party party � raw distributed data raw distributed data � Minimize communication cost Minimize communication cost � � Must come with provably correct guarantees with respect to a Must come with provably correct guarantees with respect to a � given privacy model given privacy model � Must be scalable with respect to Must be scalable with respect to � � number of data sites number of data sites � � size of the data size of the data � � Privacy Privacy- -preserving data mining preserving data mining � � Blends in ``pattern Blends in ``pattern- -preserving’’ transformations with data analysis preserving’’ transformations with data analysis � 4

  5. How PURSUIT Works for the User How PURSUIT Works for the User � Need to have your own sensor such as SNORT, MINDS Need to have your own sensor such as SNORT, MINDS � � Download PURSUIT plug Download PURSUIT plug- -in for the sensor and install in for the sensor and install � � PURSUIT plug PURSUIT plug- -in offers in offers � � A stand A stand- -alone interface for processing your alerts from the sensor alone interface for processing your alerts from the sensor � and cross and cross- -domain analysis domain analysis � Web account for detailed cross Web account for detailed cross- -domain statistics domain statistics � � Optional distributed collaboration management module for Optional distributed collaboration management module for � managing the threats and archiving forensics managing the threats and archiving forensics PURSUIT Web Site PURSUIT Web Site 5

  6. Peer- -to to- -peer (P2P) Networks peer (P2P) Networks Peer � Relies primarily on the computing resources of the Relies primarily on the computing resources of the � participants in the network rather than a relatively low participants in the network rather than a relatively low number of servers. number of servers. � P2P networks are typically used for connecting nodes via P2P networks are typically used for connecting nodes via � largely ad hoc connections. largely ad hoc connections. � No central administrator/coordinator No central administrator/coordinator � � Peers simultaneously function as both "clients" and "servers" Peers simultaneously function as both "clients" and "servers" � � Privacy is an important issue in most P2P applications Privacy is an important issue in most P2P applications � Where do we find P2P Networks? Where do we find P2P Networks? � Applications: Applications: � � File File- -sharing networks: sharing networks: KaZAa KaZAa, Napster, Gnutella , Napster, Gnutella � � P2P network storage, web caching, P2P network storage, web caching, � � P2P bio P2P bio- -informatics, informatics, � � P2P astronomy, P2P astronomy, � � P2P Information retrieval P2P Information retrieval � � P2P Sensor Networks? P2P Sensor Networks? � � P2P Mobile Ad P2P Mobile Ad- -hoc hoc NETwork NETwork (MANET)? (MANET)? � � Next Generation: Next Generation: � � P2P Search Engines, Social Networking, Digital libraries, P2P P2P Search Engines, Social Networking, Digital libraries, P2P � “YouTube”? “YouTube”? 6

  7. P2P Web Mining P2P Web Mining � Web mining in a sever Web mining in a sever- -less environment less environment � Useful Browser Data Useful Browser Data � Web Web- -browser history browser history � � Browser cache Browser cache � � Click Click- -stream data stored at browser (browsing pattern) stream data stored at browser (browsing pattern) � � Search queries typed in the search engine Search queries typed in the search engine � � User profile User profile � � Bookmarks Bookmarks � � Challenges Challenges � � Indexing, clustering, data analysis in a decentralized Indexing, clustering, data analysis in a decentralized � asynchronous manner asynchronous manner � Scalability Scalability � � Privacy Privacy � 7

Recommend


More recommend