data services for scientific computing
play

Data Services for Scientific Computing Tony Hey Corporate Vice - PowerPoint PPT Presentation

Data Services for Scientific Computing Tony Hey Corporate Vice President Microsoft Research 1 Scientific Data In 2000 the Sloan Digital Sky Survey collected more data in its 1 st week than was collected in the entire history of Astronomy By


  1. Data Services for Scientific Computing Tony Hey Corporate Vice President Microsoft Research 1

  2. Scientific Data In 2000 the Sloan Digital Sky Survey collected more data in its 1 st week than was collected in the entire history of Astronomy By 2016 the New Large Synoptic Survey Telescope in Chile will acquire 140 terabytes in 5 days - more than Sloan acquired in 10 years The Large Hadron Collider at CERN generates 40 terabytes of data every second Sources: The Economist, Feb ‘10; IDC

  3. Global information and available storage Exabytes 2,000 Forecast 1,750 1,500 1,200 1,000 Information created 750 500 1 exabyte = 1 million terabytes, 250 Available storage equivalent to 10 billion copies of 0 The Economist 2005 06 07 08 09 10 11 3 ¡ Source: ¡ IDC, as reported in The Economist, Feb 25, 2010

  4. Economics of Storage $44.56 $0.07 $1,250 $0.15 Hard Drive Storage Web Storage (per gigabyte) (per gigabyte) 2001 2002 2003 2005 2006 2007 2008 2009 2010 2000 2004 Source: Wired Magazine April 2010; Figures represented in USD

  5. Cost per Genome $3 billion per Genome $3,000,000,000 $60,000,000 $1,000,000 $48,000 $45,000 per Genome $10,000 $500-$10,000 per Genome $2,500 $500 $100 $100 per Genome? Source: George Church, Harvard Medical School, as reported in IEEE Spectrum, Feb ’10. Figures represented in USD 5

  6. Moore’s Law is alive and well... …but a hardware issue just became a software problem 1.E+07 1.E+06 1.E+05 Transistors (in thousands) 1.E+04 1.E+03 Frequency (MHz) 1.E+02 1.E+01 Cores 1.E+00 1.E-01 1970 1980 1990 2000 2010 Source: Jack Dongarra, Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, Krste Asanovic, and Kathy Yelick

  7. Computing Tools for Big Data Scientific Workflow Workbench (Trident) Dryad and DryadLINQ • Built on top of Windows Workflow Foundation • Programming models for writing distributed data-parallel applications that • Visually program workflows with the use of libraries scale from a small cluster to a large data- of activities and workflows center. • Scale from desktops to HPC clusters • A DryadLINQ programmer can use • Distribution: Moving work closer to the data source thousands of machines, each of them • Workflow sharing in myExperiment social Web site with multiple processors or cores, without for researchers prior knowledge in parallel programming. Version 1.2 available for download on CodePlex Academic release available for download (Apache 2.0 open source)

  8. Dryad • Continuously deployed since 2006 • The execution engine for Bing analytics • Running on >> 10 4 machines • Runs on clusters > 3000 machines • Sifting through > 10Pb data daily

  9. Dryad & DryadLINQ High-level language API (C#) DryadLINQ Dataflow graph as the computation model, distributed execution, fault- Dryad tolerance, scheduling Remote process execution, Cluster Services naming, storage Windows Windows Server Server

  10. DryadLINQ leverages LINQ’s extensibility LINQ - Microsoft’s Language INtegrated Query Released with .NET Framework 3.5, extremely extensible Scalability Local machine Execution engines Cluster DryadLINQ LINQ provider .Net Query interface PLINQ program Multi-core (C#, VB, LINQ-to-SQL F#, etc) Objects Single-core LINQ-to-XML

  11. WorldWide Telescope - TeraPixel Challenge: Create the largest, clearest seamless image of the sky Digitized Sky Survey (DSS) • Produced photographic plates of overlapping regions of the sky • 1,791 pairs of red-light and blue-light images acquired from two telescopes • Scanned over 15 year period into3,120,100 files, 417 GB Create Spherical Image 1. Create color plates from DSS data 2. Stitch and smooth images 3. Create sky image pyramid for WWT

  12. WorldWide Telescope - TeraPixel Computational and Data Intensive Create RGB color plates from DSS Stitch and smooth images Create sky image pyramid for WWT data Project Sphere Image Vignetting Correction Tiled Multi-resolution onto Plane ( Red, Blue ) Distributed gradient- Astrometric Alignment domain processing Statistical Analysis (Saturation & noise floor) Large-scale data aggregation easily performed with integrated set of technologies Colored Plate Creation • DryadLINQ => concise code • .NET Parallel Extension => faster decompression of DSS data • DryadLINQ + Windows HPC => Efficient and robust execution Managed and Coordinated by Project Trident : A Scientific Workflow Workbench

  13. Workflows for Processing Data in Parallel Local Desktop Machine (process automation and reruns) Staging Data Across Using DryadLINQ for Collecting User Inputs Post Processing the HPC Cluster Parallel Processing HPC Cluster (processing data in parallel – e.g. generating color images ) Data partition Executing the workflow in \UserData\Terapixel\All\Part parallel on the HPC cluster 1791 0,56, MSR-SCR-Dryad1 Trident workflow runtime close 1,56, MSR-SCR-Dryad4 to data on 2,56, MSR-SCR-Dryad5 each node …… 1790, 56, MSR-SCR-Dryad32

  14. Deployment Architecture Generating RGB color plates • Generation of 1,791 plates with 64 compute nodes • Processing time: 5 hrs. • Input: 417 GB (compressed, 4 TB uncompressed) • Output: 790 GB (approx. 450 MB/plate)

  15. WorldWide Telescope - TeraPixel Result: Largest, clearest, and smoothest sky image in the world Special Thanks to • Brian McLean (Space Telescope Science Institute), • Misha Kazhdan (Johns Hopkins University), Hugues Hoppe (MSR), and Dinoj Surendran (MSR) • Dean Guo (MSR), Christophe Poulain (MSR) • Aditi Team

  16. Cloud Computing: One Definition For the US National Institute of Standards and Technology (NIST), Cloud Computing means: • On-demand service • Broad network access • Resource pooling • Flexible resource allocation • Measured service

  17. Microsoft’s Datacenter Evolution Datacenter Co- Quincy and San Chicago and Dublin Modular Datacenter Location Antonio Generation 3 Generation 4 Generation 1 Generation 2 Facility PAC Server Capacity Time to Market Lower TCO

  18. Cloud ¡Op)ons ¡

  19. Cloud Services Infrastructure as a Service Infrastructure (IaaS) as a Service – Provide a way to host virtual machines on demand Platform as a Service (PaaS) – You write an Application to Cloud APIs and the platform manages and scales it for you. Software Platform Software as a Service as a as a Service (SaaS) Service – Delivery of software to the desktop from the Cloud

  20. Azure ¡Programming ¡Model ¡ Public Internet Front- end Worker Web Role(s) Load ¡ Role ¡Balancer ¡ Load-balancers In-­‑band ¡communication ¡– ¡ Azure Services (storage) Switches software ¡control ¡ Highly-­‑available ¡ Fabric ¡Controller ¡

  21. MODIS Azure: Computing Evapotranspiration (ET) in the Cloud A pipeline for download, processing, and reduction of diverse NASA MODIS satellite imagery. Contributors: Catharine van Ingen (MSR), Youngryel Ryu (UC Berkeley), Jie Li (Univ. of Virginia)

  22. MODIS Azure • Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration, or evaporation through plant membranes, by plants. • Climate change isn’t just about a change in temperature, it’s also about a change in the water balance and hence water supply which is critical to human activity. Source: Youngryel Ryu’s PhD project

  23. MODIS Azure Aqua, Terra: Time series raster data, 36 spectral bands, 1-2d • Over some period of time at some time frequency at some spatial granularity over some spatial area • Conversion from L0 data to L2 and beyond as well as reprojection

  24. MODIS Azure : Four Stage Image Processing Pipeline Data collection stage • Downloads requested input tiles from NASA ftp sites • Includes geospatial lookup for non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile Reprojection stage • Converts source tile(s) to intermediate result sinusoidal tiles • Simple nearest neighbor or spline algorithms Derivation reduction stage • First stage visible to scientist • Computes ET in our initial use Analysis reduction stage • Optional second visible stage • Enables production of science analysis artifacts such as maps

  25. MODIS Azure : Architectural Overview <PipelineStage>Job <PipelineStage>JobStatus Queue Persist … MODISAzure Service <PipelineStage> (Web Role) Request <PipelineStage>TaskStatus Service Monitor Parse & Persist (Worker Role) Dispatch <PipelineStage>Task Queue … ModisAzure Service is the Web Service Monitor is a dedicated Role front door Worker Role • Receives all user requests • Parses all job requests into tasks – recoverable units of work • Queues request to appropriate • Execution status of all jobs and tasks Download, Reprojection, or persisted in Tables Reduction Job Queue

  26. Computing a one US Year ET Computation • Computational costs driven by data scale and need to run reduction multiple times • Storage costs driven by data scale and 12 month project duration Total: $1420 • Small with respect to the people costs even at graduate student rates !

Recommend


More recommend