Parallel Clustering for Visualizing Large Scien5fic Line Data - PowerPoint PPT Presentation

Parallel ¡Clustering ¡for ¡Visualizing ¡ Large ¡Scien5fic ¡Line ¡Data � Jishang ¡Wei , ¡ University ¡of ¡California, ¡Davis ¡ ¡ Hongfeng ¡Yu , ¡ Sandia ¡Na5onal ¡Laboratories ¡ Jacqueline ¡H. ¡Chen , ¡ Sandia ¡Na5onal ¡Laboratories ¡ Kwan-‑Liu ¡Ma , ¡ University ¡of ¡California, ¡Davis ¡ ¡

Background ¡ • Line ¡data ¡in ¡scien2fic ¡simula2ons ¡and ¡experiments ¡ – Line: ¡an ¡ordered ¡sequence ¡of ¡mul2-‑dimensional ¡data ¡points ¡ – Examples: ¡vector ¡field ¡lines, ¡white ¡ma@er ¡fibers, ¡2me ¡series ¡ curves ¡ Jeffrey ¡Heer, ¡Michael ¡Bostock, ¡and ¡Vadim ¡Ogievetsky, ¡A ¡ O. ¡Mallo, ¡R. ¡Peikert, ¡C. ¡Sigg, ¡F. ¡Sadlo, ¡ Generated ¡by ¡Pierre ¡Fillard, ¡Neurospin ¡CEA ¡ ¡ ¡ Tour ¡Through ¡the ¡Visualiza2on ¡Zoo, ¡2010 ¡ Illuminated ¡Lines ¡Revisited, ¡2005 ¡ ¡

Mo2va2on ¡ • Challenges ¡to ¡visualize ¡large ¡line ¡data ¡ – Visual ¡clu@er, ¡clustering ¡first, ¡then ¡visualizing ¡ – Large ¡data, ¡using ¡a ¡parallel ¡machine ¡to ¡handle ¡heavy ¡workload ¡ • Our ¡contribu2on ¡ – A ¡parallel ¡design ¡of ¡model-‑based ¡clustering ¡for ¡categorizing ¡and ¡ visualizing ¡large ¡line ¡data ¡with ¡mul2ple ¡CPUs ¡and ¡GPUs ¡ ¡ I T Chaoli ¡Wang, ¡Hongfeng ¡Yu, ¡and ¡Kwan-‑Liu ¡Ma ¡ h@p://www.absoluteastronomy.com/ O’Donnell. ¡ ¡Cerebral ¡White ¡Ma@er ¡Analysis ¡Using ¡ Importance-‑Driven ¡Time-‑Varying ¡Data ¡Visualiza2on. ¡2008. ¡ ¡ topics/Drish2 ¡ Diffusion ¡Imaging. ¡2006. ¡

Model-‑based ¡Clustering ¡ • What ¡is ¡model-‑based ¡clustering ¡ – Assume ¡that ¡data ¡can ¡be ¡divided ¡into ¡K ¡groups, ¡and ¡each ¡ has ¡a ¡probabilis2c ¡model ¡to ¡describe ¡the ¡data ¡within ¡it ¡ – Recover ¡model ¡parameters ¡from ¡data ¡ – Assign ¡a ¡data ¡object ¡to ¡a ¡cluster ¡with ¡highest ¡probability ¡ • Why ¡is ¡model-‑based ¡clustering ¡ – Cluster ¡lines ¡of ¡different ¡lengths ¡ – Process ¡large ¡data ¡efficiently ¡ • Model-‑based ¡clustering ¡of ¡line ¡data ¡ – Polynomial ¡regression ¡model ¡ – Recover ¡model ¡parameters ¡using ¡Expecta2on-‑Maximiza2on ¡ algorithm ¡

Parallel ¡Model-‑based ¡Clustering ¡ • Distribute ¡line ¡data ¡to ¡mul2ple ¡compute ¡nodes ¡ – Keep ¡workload ¡balanced ¡and ¡minimize ¡ communica2on ¡costs ¡between ¡compute ¡nodes ¡ – Use ¡a ¡sorted ¡balancing ¡algorithm ¡to ¡ensure ¡the ¡total ¡ number ¡of ¡data ¡points ¡on ¡each ¡compute ¡node ¡roughly ¡ the ¡same ¡ • Preprocess ¡line ¡data ¡on ¡each ¡compute ¡node ¡ – Smooth ¡and ¡sample ¡local ¡lines ¡on ¡each ¡compute ¡node ¡ – Use ¡GPUs ¡to ¡accelerate ¡the ¡preprocessing ¡

Parallel ¡Model-‑based ¡Clustering � • Cluster ¡lines ¡using ¡mul2ple ¡CPUs ¡ – On ¡each ¡compute ¡node, ¡Ini2alize ¡K ¡component ¡ model ¡parameters ¡ – Iterate ¡between ¡two ¡steps ¡ • Expecta2on ¡step: ¡on ¡each ¡compute ¡node, ¡es2mate ¡local ¡ lines’ ¡probabilis2c ¡membership ¡in ¡different ¡clusters ¡ • Maximiza2on ¡step: ¡on ¡each ¡compute ¡node, ¡calculate ¡the ¡K ¡ model ¡parameters ¡globally ¡ – Assign ¡each ¡local ¡line ¡to ¡a ¡cluster ¡with ¡highest ¡ membership ¡probability ¡on ¡each ¡CPU ¡node ¡

Experiment ¡Seengs ¡ • Cluster: ¡8 ¡computer ¡nodes, ¡each ¡node ¡contains ¡ ¡ – One ¡Intel ¡quad-‑core ¡3.00GHz ¡CPU ¡with ¡4GB ¡of ¡memory ¡ – One ¡NVIDIA ¡GeForce ¡GTX ¡285 ¡GPU. ¡ • Datasets: ¡ – 10,000 ¡streamlines ¡from ¡the ¡vector ¡field ¡of ¡a ¡solar ¡plume ¡simula2on ¡ – 1,000,000 ¡2me ¡series ¡curves ¡correla2ng ¡mul2ple ¡variables ¡ generated ¡from ¡a ¡combus2on ¡simula2on ¡ case ¡ Data ¡set ¡ Number ¡of ¡lines ¡ Number ¡of ¡computer ¡nodes ¡ 1 ¡ ¡ ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡ ¡ ¡5 ¡ ¡ ¡ ¡ ¡ ¡6 ¡ ¡ ¡ ¡ ¡ ¡7 ¡ ¡ ¡ ¡ ¡ ¡8 ¡ 1 ¡ solar ¡plume ¡ 10,000 ¡ X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ 2 ¡ combus2on ¡ 10,000 ¡ X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ 3 ¡ combus2on ¡ 100,000 ¡ X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ 4 ¡ combus2on ¡ 1,000,000 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ ¡ ¡ ¡ ¡ ¡X ¡ Table ¡: ¡Setup ¡of ¡experiments. ¡Entries ¡marked ¡with ¡“x” ¡represent ¡experiment ¡runs. ¡

Clustering ¡Performance ¡Results ¡ Case ¡1 ¡smoothing ¡2me ¡ Case ¡1 ¡resampling ¡2me ¡ Case ¡1 ¡E-‑Step ¡2me ¡ ¡ Case ¡1 ¡M-‑Step ¡2me ¡ ¡ Case ¡4 ¡smoothing ¡2me ¡ Case ¡4 ¡resampling ¡2me ¡ Case ¡4 ¡E-‑Step ¡2me ¡ ¡ Case ¡4 ¡M-‑Step ¡2me ¡ ¡ Speedups ¡of ¡scalability ¡study. ¡In ¡each ¡plot, ¡the ¡horizontal ¡axis: ¡number ¡of ¡nodes; ¡the ¡ver2cal ¡axis: ¡ running ¡2me ¡in ¡second; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡:real ¡speed-‑up ¡2me; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡: ¡ideal ¡speed-‑up ¡2me. ¡ ¡

Clustering ¡Performance ¡Results ¡ Case ¡1 ¡M-‑Step ¡2me(0.03%) ¡ ¡ Case ¡1 ¡smoothing ¡2me(0.53%) ¡ ¡ Case ¡1 ¡resampling ¡2me(1.64%) ¡ ¡ Case ¡1 ¡E-‑Step ¡2me(0.11%) ¡ ¡ Case ¡4 ¡smoothing ¡2me(3.46%) ¡ ¡ Case ¡4 ¡resampling ¡2me(2.09%) ¡ ¡ Case ¡4 ¡E-‑Step ¡2me(0.16%) ¡ ¡ Case ¡4 ¡M-‑Step ¡2me(0.01%) ¡ ¡ Workloads ¡among ¡8 ¡nodes ¡for ¡Cases ¡1 ¡and ¡4. ¡In ¡each ¡plot, ¡the ¡horizontal ¡axis ¡represents ¡the ¡node ¡ID, ¡and ¡the ¡ ver2cal ¡axis ¡represents ¡the ¡running ¡2me ¡in ¡second. ¡The ¡percentage ¡number ¡associated ¡with ¡each ¡plot ¡is ¡the ¡ difference ¡ra2o ¡( ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡) ¡between ¡the ¡maximum ¡and ¡minimum ¡2mes ¡ dr = ( max time − min time ) /max time among ¡the ¡nodes. ¡

Visualiza2on ¡Results ¡ (b) ¡ (c) ¡ (a) ¡ (d) ¡ (e) ¡ (f) ¡ (g) ¡ (h) ¡ (i) ¡ Visualiza2on ¡of ¡the ¡streamlines ¡generated ¡from ¡the ¡solar ¡plume ¡velocity ¡vector ¡field. ¡(a) ¡shows ¡the ¡overview ¡of ¡all ¡ 10,000 ¡streamlines. ¡(b)-‑(i) ¡show ¡the ¡eight ¡different ¡groups ¡of ¡streamlines. ¡

Visualiza2on ¡Results ¡ (a) ¡ (b) ¡ (c) ¡ (d) ¡ (e) ¡ (g) ¡ (f) ¡ (h) ¡ (i) ¡ (j) ¡ (m) ¡ (n) ¡ (o) ¡ (k) ¡ (l) ¡ Visualiza2on ¡of ¡the ¡2me ¡series ¡curves ¡rela2ng ¡two ¡variables, ¡mixture ¡frac2on ¡(the ¡red ¡axis) ¡and ¡ temperature ¡(the ¡green ¡axis), ¡in ¡the ¡combus2on ¡simula2on. ¡(a) ¡shows ¡the ¡overview ¡of ¡all ¡ 100,000 ¡2me ¡series ¡curves. ¡(b)-‑(o) ¡show ¡the ¡fourteen ¡different ¡groups ¡of ¡2me ¡series ¡curves. ¡

Parallel Clustering for Visualizing Large Scien5fic Line Data - PowerPoint PPT Presentation

Parallel Clustering for Visualizing Large Scien5fic Line Data Jishang Wei , University of California, Davis Hongfeng Yu , Sandia Na5onal Laboratories Jacqueline

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Title Slide Math 696 Class July 19, 2002 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Visualizing Large Pedigree Visualizing Large Pedigree Charts in 3D Space Charts in 3D Space

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

CSE 373: AVL trees Question: is this also an AVL tree? 2 5 6 10 9 8 12 11 13 14 5 AVL

Sub-6GHz 5G TDD Wireless Infrastructure Block Diagram High Power Switch - LNA MAIA-011002

BRAND EVN BROADBAND RECEIVER - A TECHNOLOGICAL CHALLENGE - Gino Tuccari on behalf of the BRAND

RF Power Sources (a brief history) Ralph J. Pasquinelli PIP-II Machine Advisory

Balanced Independent Sets on Colored Interval Graphs Sujoy Bhore, Jan-Henrik Haunert, Fabian

Generalized roofline analysis? Jee Choi Marat Dukhan Richard (Rich) Vuduc October 2, 2013

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application

Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us