Big ideas + big data = real life benefits Thursday 27 October 2016 synchrotron.org.au
Big Data at the Australian Synchrotron Professor Andrew Peele Director Australian Synchrotron and ANSTO Representative in Victoria synchrotron.org.au
Australian Nuclear Science and Technology Organisation ANSTO is a public research organisation with a variety of roles for the nation. ANSTO operates Australia’s multipurpose nuclear reactor. Expert advice and support Research and Innovation Commercial Businesses to Government and Science and Engineering international agencies
Australia’s National Research Priorities Advanced Environmental manufacturing change and ANSTO health Research Infrastructure Isotope Radiobiology Landmark and Tracing in & Bioimaging Natural National Research Cyber Soil and Systems Infrastructure security water • OPAL multi-purpose reactor • Australian Centre for Nuclear Radiotracers & Neutron Scattering Stewardship Radioisotopes • Australian Synchrotron • Centre for Accelerator Science Transport Food Materials National Development & Deuteration Characterisation Facility Energy Resources
Multi-site organisation CLAYTON VIC LUCAS HEIGHTS NSW CAMPERDOWN NSW
Life-changing pharmaceutical breakthroughs Several drugs have been developed following structural studies and target screening at the Australian Synchrotron and are now under clinical trials Venetoclax CSL362 Momelotinib Nexvax2 Solanezumab PRMT5 inhibitors DEVELOPED BY DEVELOPED BY DEVELOPED BY DEVELOPED BY DEVELOPED BY DEVELOPED BY WEHI, St Vincent’s Gilead Sciences Monash St Vincent’s Cancer Genentech & Institute of University with Institute Therapeutics Abbott Medical ImmunsanT CRC with Merck Research & CSL FOR TREATMENT OF FOR TREATMENT OF FOR TREATMENT OF FOR TREATMENT OF FOR TREATMENT OF FOR TREATMENT OF Acute Myeloid Myelofibrosis Celiac Disease Alzheimer’s Melanoma, Chronic Leukaemia and Pancreatic Disease Breast Cancer Lymphocytic cancer cells Cancer Leukaemia
Infrastructure for researchers 80 % Merit beamtime • Free of charge to users • Travel and accommodation paid • Expectation to publish Facility time 20 % Shifts requested Shifts awarded Including commercial access 900 750 600 450 300 150 0 Far-IR IMBL IRM MX1/MX2 PD SAXS SXR XAS XFM
Infrastructure for researchers Three application rounds Operates 24/7 per year (apart from maintenance periods) Access is peer reviewed based on merit consistent with international best-practice: Quality of the National benefit Track record The need for proposal and applications Synchrotron radiation 40 % 30 % 30 % All facilities are oversubscribed . More than 5600 researcher visits per year The success rate for applications is about 60% . Around 1000 experiments About right for competition to breed excellence.
Our current 10 operational beamlines (Capacity for 30+ beamlines) IRM Far - IR MX2 MX1 XFM IMBL Infrared Terahertz / Far-IR Micro-focused Macromolecular X-ray Imaging and Microscope Spectroscopy Crystallography Crystallography Fluorescence Medical Beamline Microscopy (30–120 keV) (4–25 keV) PD XAS SAXS / WAXS SXR Powder Diffraction X-ray Absorption Small Angle X-ray Soft X-ray Soft X-ray Spectroscopy Scattering / Spectroscopy Imaging (4–37 keV) Wide Angle X-ray (4–50 keV) (90–2500 eV) Scattering (6–20 keV)
Managing Big Data at the Australian Synchrotron Dr Andreas Moll Senior Scientific Software Engineer synchrotron.org.au
Flavours of Big Data: Data volume Imaging and Medical Beamline X-ray Fluorescence Microscopy beamline ~146 TB ~270 TB 15
Flavours of Big Data: Single images 1 Gigapixel image 40 × 9 mm = 66667 × 15000 (600 nm) pixels , raw data 250 GB , scan time 38 hrs. Petrographic section of high grade ore from western shear zone of the Sunrise Dam gold deposit, WA Sr:Fe:Rb map 16 Fisher et al. , Miner. Deposita 50, 665-674 (2015)
Flavours of Big Data: Data rate Micro Crystallography (MX2) beamline Sample Orientation Diffraction Pattern Data acquisition took 15 minutes 17 Next iteration of detector will be 18 seconds and can create raw data with ~4 GB / s !
Dealing with Big Data Big Data definition A volume of data that is too large or too complex to process by simple means, hence requiring significant investments in IT infrastructure, workflows and tools to capture, store, transfer, analyse and visualise datasets. Scientific software Infrastructure • Data management • Storage • Workflows • Compute (CPU + GPU) • Real time analysis • Network • Distributed computing • Automatic workflows for data reduction and processing 18 • Remote analysis tools for users
Infrastructure at the Australian Synchrotron Storage: Central storage: 650 TB Additional storage at RDS: 440 TB We still keep all historic user data (except IMBL) Official data retention period: 6 – 12 months HPC: MASSIVE (operated by Monash University) • Batch system (based on SLURM) • Remote Desktop environment 42 nodes, each with • 2x6 core X5650 CPUs • Realtime visualisation • 48 GB RAM 19 • 2 NVIDIA M2070 GPUs • 58 TB GPFS file system
Data collection and processing Imaging and Medical Beamline • Three experimental enclosures for various resolutions and image modalities • Largest beam in the world, up to 540 x 48 mm in 3B • High-flux from the superconducting multipole wiggler • Dedicated near-beam surgery and animal holding and preparation facilities. 20 • All with the Computed Tomography (CT) capabilities
Computed Tomography Sample Detector X-ray Beam capture reconstruction 21 Visualisation and Slices Analysis Projections (individual TIF files)
Computed Tomography Detector parameters 2B Raw data size X Pixels 2560 Y Pixels 600 ~3 - 5 GB per minute Bit Depth (Ruby) 16 ~ 3 samples / 2 hours Single Image size (MB) 2.9 ~12 samples / shift Acquisition Time* (s) 0.05 ~ 36 samples per day Projections 1800 Slices 25 ~14 TB raw data in a 3 day experiment Total Dataset Size (GB) 132 Time (min) 38 22
Computed Tomography 1) Stitching: 2) Reconstruction with X-tract: Stitches together serial scans into single Uses projections to reconstruct projection image at each angle tomographic slices of the sample 2560 x 600 px x 25 slices with 10% overlap 1 Slice (2560 x 2560 px), now 32 bit! 1800 projections 25 MB per slice 116 GB per sample Full Sample (13620 slices) 332 GB per sample (plus 8 bit (83 GB)) 23 ~ 60 TB total data potential for 1 experiment (3 days)!
Computed Tomography 22 TB 2) Reconstruction with X-tract: Uses projections to reconstruct tomographic slices of the sample 1 Slice (2560 x 2560 px), now 32 bit! 25 MB per slice Full Sample (13620 slices) 332 GB per sample (plus 8 bit (83 GB)) 24 ~ 60 TB total data potential for 1 experiment (3 days)!
Online vs offline Online (during the beamtime) IMBL User at beamline Detector collect imblcompute How to handle Big Data: VNC Run for each projection 48 CPUs in parallel 2 GPUs Local 512GB RAM Storage 60 TB Storage X-tract uses CUDA Compute for GPU acceleration Offline (post beamtime) Paradigm shift: bring the users to the data and not the data to the users 25 Remote analysis instead of data transfer (sftp, hard drives etc.)
Remote access with Strudel 26
Gigapixel image on MASSIVE Each MASSIVE remote access session provides: • 12 CPUs • 1 GPU Gigapixel image = 2,505 files, each 100 MB Analysed using GeoPixe software Can run in Cluster mode for data sorting and extraction How to handle Big Data: Cluster mode • Partition data • Parallelise sorting through data 27
‘Realtime’ processing and data reduction Automatic workflows • reduce data by averaging data, removing unwanted data, etc. • first, quick reconstruction of ‘live data’ for quick user feedback • full processing of the data where possible Example MX2 beamline: Workflow for automatic data processing and protein structure determination from MX diffraction images (close to real-time) 1. single shot assessment of space group and quality metric 2. data reduction of datasets with special care for the type of experiment (chemical or protein crystallography) 28
What we have learned Design and implementation of all workflows were driven by the available infrastructure e.g. MASSIVE and RDS services existed before the workflow Workflows are custom built and can’t be re-used Depend on external service provider Next iteration: • Decouple workflow and infrastructure • Generic workflow software • Microservice architecture Realtime diffraction spot finding at MX2 • Uses newly developed workflow software • Check quality of recorded data live ASCI – Australian Synchrotron Computing Infrastructure 29
ASCI - Australian Synchrotron Computing Infrastructure HTML5 based VNC connection Firewall, nginx Routing + Security create instance Infrastructure Analysis Analysis Analysis Analysis Workflow Service Session Session Session Session Service docker Automatic load balancing of docker containers images 6 nodes, each with IMBL • 48 CPUs XFM • 2 GPUs (NVIDIA GeForce GTX 1080) SAXS/WAXS • 512 GB RAM 30 … 2PB (raw) of Ceph storage
Recommend
More recommend