introduction to big data analytics frameworks
play

Introduction to [Big] Data Analytics Frameworks Data Sciences - PowerPoint PPT Presentation

Introduction to [Big] Data Analytics Frameworks Data Sciences (pilot) Training EC Sbastien Varrette, PhD Parallel Computing and Optimization Group (PCOG), University of Luxembourg (UL), Luxembourg Feb. 7 th and Apr. 1 st , 2019, Luxembourg


  1. Introduction Different HPC Needs per Domains Deep Learning / Cognitive Computing Biomedical Industry / Life Sciences Material Science & Engineering IoT, FinTech ALL Research Computing Domains #Cores Network Bandwidth Flops/Core Network Latency Storage Capacity I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 �

  2. Introduction New Trends in HPC Continued scaling of scientific, industrial & financial applications → . . . well beyond Exascale ֒ F ��������� ��� � C ����� �� E ��������� �� H ��� -P ���������� C �������� S ������ New trends changing the landscape for HPC → Emergence of Big Data analytics ֒ → Emergence of ( Hyperscale ) Cloud Computing ֒ → Data intensive Internet of Things (IoT) applications ֒ → Deep learning & cognitive computing paradigms ֒ Eurolab-4-HPC Long-Term Vision on High-Performance Computing This study was carried out for RIKEN by Editors: Theo Ungerer, Paul Carpenter Funded by the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive) [Source : EuroLab-4-HPC] Special Study Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries Earl C. Joseph, Ph.D. Robert Sorensen Steve Conway Kevin Monroe [Source : IDC RIKEN report, 2016] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 10 / 126 � � � �

  3. Introduction Toward Modular Computing Aiming at scalable , flexible HPC infrastructures → Primary processing on CPUs and accelerators ֒ � HPC & Extreme Scale Booster modules → Specialized modules for: ֒ � HTC & I/O intensive workloads; � [Big] Data Analytics & AI [Source : "Towards Modular Supercomputing: The DEEP and DEEP-ER projects", 2016] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 11 / 126 �

  4. Introduction Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 12 / 126 �

  5. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics Intel Coffee Lake die Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  6. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics GPU (Graphics Processing Unit) : Ideal for ML/DL workloads → Ex: Nvidia Tesla V100 SXM2 (Q2’17) R peak ≃ 7.8 TFlops (DP) ֒ Accelerators � 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors) Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  7. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics GPU (Graphics Processing Unit) : Ideal for ML/DL workloads → Ex: Nvidia Tesla V100 SXM2 (Q2’17) R peak ≃ 7.8 TFlops (DP) ֒ Accelerators � 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors) Intel MIC (Many Integrated Core) Accelerator ASIC (Application-Specific Integrated Circuits) , FPGA (Field Programmable Gate Array) → least software flexibility ֒ → highest performance for specialized problems ֒ � Ex: AI, Mining, Sequencing. . . = ⇒ toward hybrid platforms w. DL enabled accelerators Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  8. Introduction HPC Components: Local Memory Larger, slower and cheaper L1 L2 L3 - - - CPU Memory Bus I/O Bus C C C a a a Memory c c c h h h Registers e e e L1-cache L2-cache L3-cache register (SRAM) (SRAM) (DRAM) Memory (DRAM) reference Disk memory reference reference reference reference reference Level: 1 4 2 3 Size: 500 bytes 64 KB to 8 MB 1 GB 1 TB Speed: sub ns 1-2 cycles 10 cycles 20 cycles hundreds cycles ten of thousands cycles SSD (SATA3) R/W: 550 MB/s; 100000 IOPS 450 e /TB HDD (SATA3 @ 7,2 krpm) R/W: 227 MB/s; 85 IOPS 54 e /TB Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 14 / 126 �

  9. Introduction HPC Components: Interconnect latency : time to send a minimal (0 byte) message from A to B bandwidth : max amount of data communicated per unit of time Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40 µ s to 300 µ s 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4 µ s to 5 µ s Infiniband QDR 40 Gb/s 5 GB/s 1 . 29 µ s to 2 . 6 µ s Infiniband EDR 100 Gb/s 12.5 GB/s 0 . 61 µ s to 1 . 3 µ s Infiniband HDR 200 Gb/s 25 GB/s 0 . 5 µ s to 1 . 1 µ s 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30 µ s Intel Omnipath 100 Gb/s 12.5 GB/s 0 . 9 µ s Infiniband 32.6 % 1.4 % [Source : www.top500.org , Nov. 2017] 40.8 % Proprietary 4.8 % 10G Gigabit Ethernet 7 % 13.4 % Omnipath Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 �

  10. Introduction HPC Components: Interconnect latency : time to send a minimal (0 byte) message from A to B bandwidth : max amount of data communicated per unit of time Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40 µ s to 300 µ s 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4 µ s to 5 µ s Infiniband QDR 40 Gb/s 5 GB/s 1 . 29 µ s to 2 . 6 µ s Infiniband EDR 100 Gb/s 12.5 GB/s 0 . 61 µ s to 1 . 3 µ s Infiniband HDR 200 Gb/s 25 GB/s 0 . 5 µ s to 1 . 1 µ s 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30 µ s Intel Omnipath 100 Gb/s 12.5 GB/s 0 . 9 µ s Infiniband 32.6 % 1.4 % [Source : www.top500.org , Nov. 2017] 40.8 % Proprietary 4.8 % 10G Gigabit Ethernet 7 % 13.4 % Omnipath Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 �

  11. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  12. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Main HPC Topologies CLOS Network / Fat-Trees [Indirect] → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance ֒ � Non blocking bandwidth, lowest network latency Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  13. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Main HPC Topologies CLOS Network / Fat-Trees [Indirect] → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance ֒ � Non blocking bandwidth, lowest network latency Mesh or 3D-torus [Direct] → Blocking network, cost-effective for systems at scale ֒ → Great performance solutions for applications with locality ֒ → Simple expansion for future growth ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  14. Introduction HPC Components: Operating System Exclusively Linux-based ( really 100%) Reasons: → stability ֒ → development flexibility ֒ [Source : www.top500.org , Nov 2017] Linux 100 % Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 17 / 126 �

  15. Introduction HPC Components: Software Stack Remote connection to the platform SSH Identity Management / SSO : LDAP, Kerberos, IPA. . . Resource management : job/batch scheduler → SLURM, OAR, PBS, MOAB/Torque. . . ֒ (Automatic) Node Deployment : → FAI, Kickstart, Puppet, Chef, Ansible, Kadeploy. . . ֒ (Automatic) User Software Management : → Easybuild, Environment Modules, LMod ֒ Platform Monitoring : → Nagios, Icinga, Ganglia, Foreman, Cacti, Alerta. . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 18 / 126 �

  16. Introduction [Big]Data Management Storage architectural classes & I/O layers Application [Distributed] File system Network Network SATA NFS SAS iSCSI CIFS FC ... AFP ... ... DAS Interface SAN Interface NAS Interface Fiber Ethernet/ Fiber Ethernet/ Channel Network DAS Channel Network SATA SAN SAS File System NAS Fiber Channel SATA SAS SATA Fiber Channel SAS Fiber Channel Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 19 / 126 �

  17. Introduction [Big]Data Management: Disk Encl. ≃ 120 K e - enclosure - 48-60 disks (4U) → incl. redundant (i.e. 2) RAID controllers (master/slave) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 20 / 126 �

  18. Introduction [Big]Data Management: File Systems File System (FS) Logical manner to store , organize , manipulate & access data Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 �

  19. Introduction [Big]Data Management: File Systems File System (FS) Logical manner to store , organize , manipulate & access data (local) Disk FS : FAT32 , NTFS , HFS+ , ext{3,4} , {x,z,btr}fs . . . → manage data on permanent storage devices ֒ → poor perf. read : 100 → 400 MB/s | write : 10 → 200 MB/s ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 �

  20. Introduction [Big]Data Management: File Systems Networked FS : NFS , CIFS / SMB , AFP → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O ֒ � read : only 381 MB/s on a system capable of 740MB/s (16 tasks) � write : only 90MB/s on system capable of 400MB/s (4 tasks) [Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 �

  21. Introduction [Big]Data Management: File Systems Networked FS : NFS , CIFS / SMB , AFP → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O ֒ � read : only 381 MB/s on a system capable of 740MB/s (16 tasks) � write : only 90MB/s on system capable of 400MB/s (4 tasks) [Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System [scale-out] NAS → aka Appliances OneFS . . . ֒ → Focus on CIFS, NFS ֒ → Integrated HW/SW ֒ → Ex : EMC (Isilon) , IBM ֒ (SONAS), DDN . . . Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 �

  22. Introduction [Big]Data Management: File Systems Basic Clustered FS GPFS → File access is parallel ֒ → File System overhead operations is distributed and done in parallel ֒ � no metadata servers → File clients access file data through file servers via the LAN ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 23 / 126 �

  23. Introduction [Big]Data Management: File Systems Multi-Component Clustered FS Lustre, Panasas → File access is parallel ֒ → File System overhead operations on dedicated components ֒ � metadata server (Lustre) or director blades (Panasas) → Multi-component architecture ֒ → File clients access file data through file servers via the LAN ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 24 / 126 �

  24. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  25. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  26. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers Name Type Read* [GB/s] Write* [GB/s] ext4 Disk FS 0.426 0.212 nfs Networked FS 0.381 0.090 gpfs (iris) Parallel/Distributed FS 11.25 9,46 lustre (iris) Parallel/Distributed FS 12.88 10,07 gpfs (gaia) Parallel/Distributed FS 7.74 6.524 lustre (gaia) Parallel/Distributed FS 4.5 2.956 ∗ maximum random read/write, per IOZone or IOR measures, using concurrent nodes for networked FS. Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  27. Introduction HPC Components: Data Center Definition (Data Center) Facility to house computer systems and associated components → Basic storage component: rack (height: 42 RU) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 �

  28. Introduction HPC Components: Data Center Definition (Data Center) Facility to house computer systems and associated components → Basic storage component: rack (height: 42 RU) ֒ Challenges: Power (UPS, battery) , Cooling, Fire protection, Security Power/Heat dissipation per rack: Power Usage Effectiveness → HPC computing racks: 30-120 kW ֒ → Storage racks: 15 kW PUE = Total facility power ֒ → Interconnect racks: 5 kW IT equipment power ֒ Various Cooling Technology → Airflow ֒ → Direct-Liquid Cooling, Immersion... ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 �

  29. Introduction Software/Modules Management https://hpc.uni.lu/users/software/ Based on Environment Modules / LMod → convenient way to dynamically change the users environment $PATH ֒ → permits to easily load software through module command ֒ Currently on UL HPC: → > 200 software packages , in multiple versions, within 18 categ. ֒ → reworked software set for iris cluster and now deployed everywhere ֒ � RESIF v2.0, allowing [real] semantic versioning of released builds → hierarchical organization Ex : toolchain/{foss,intel} ֒ $> module avail # List available modules $> module load <category>/<software>[/<version>] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 27 / 126 �

  30. Introduction Software/Modules Management Key module variable: $MODULEPATH / where to look for modules → altered with module use <path> . Ex : ֒ export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 �

  31. Introduction Software/Modules Management Key module variable: $MODULEPATH / where to look for modules → altered with module use <path> . Ex : ֒ export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES Main modules commands : Command Description module avail Lists all the modules which are available to be loaded module spider <pattern> Search for among available modules (Lmod only) module load <mod1> [mod2...] Load a module module unload <module> Unload a module module list List loaded modules module purge Unload all modules (purge) module display <module> Display what a module does module use <path> Prepend the directory to the MODULEPATH environment variable module unuse <path> Remove the directory from the MODULEPATH environment variable Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 �

  32. Introduction Software/Modules Management http://hpcugent.github.io/easybuild/ Easybuild: open-source framework to (automatically) build scientific SW Why? : "Could you please install this software on the cluster?" → Scientific software is often difficult to build ֒ � non-standard build tools / incomplete build procedures � hardcoded parameters and/or poor/outdated documentation → EasyBuild helps to facilitate this task ֒ � consistent software build and installation framework � includes testing step that helps validate builds � automatically generates LMod modulefiles $ > module use $LOCAL_MODULES $ > module load tools/EasyBuild # Search for recipes for a given software $ > eb -S Spark $ > eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $ > eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 29 / 126 �

  33. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 30 / 126 �

  34. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 31 / 126 �

  35. [Big] Data Management in HPC Environment: Overview and Challenges Data Intensive Computing Data volumes increasing massively → Clusters, storage capacity increasing massively ֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 �

  36. [Big] Data Management in HPC Environment: Overview and Challenges Data Intensive Computing Data volumes increasing massively → Clusters, storage capacity increasing massively ֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 �

  37. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ How long to transfer 1 TB of data across various speed networks? Network Time 10 Mbps 300 hrs (12.5 days) 100 Mbps 30 hrs 1 Gbps 3 hrs 10 Gbps 20 minutes (Again) small I/Os really kill performances → Ex : transferring 80 TB for the backup of ecosystem_biology ֒ → same rack, 10Gb/s. 4 weeks − → 63TB transfer. . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 33 / 126 �

  38. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 �

  39. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 �

  40. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances: GPFS Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 35 / 126 �

  41. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances: Lustre Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 36 / 126 �

  42. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances Based on IOR or IOZone, reference I/O benchmarks Read → tests performed in 2013 ֒ 65536 32768 16384 I/O bandwidth (MiB/s) 8192 4096 2048 1024 512 SHM / Bigmem Lustre / Gaia 256 NFS / Gaia SSD / Gaia 128 Hard Disk / Chaos 64 0 5 10 15 Number of threads Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 �

  43. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances Based on IOR or IOZone, reference I/O benchmarks Write → tests performed in 2013 ֒ 32768 16384 8192 I/O bandwidth (MiB/s) 4096 2048 1024 512 SHM / Bigmem Lustre / Gaia 256 NFS / Gaia 128 SSD / Gaia Hard Disk / Chaos 64 0 5 10 15 Number of threads Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 �

  44. [Big] Data Management in HPC Environment: Overview and Challenges Understanding Your Storage Options Where can I store and manipulate my data? Shared storage → NFS - not scalable ≃ 1.5 GB/s (R) O (100 TB) ֒ → GPFS/Spectrumscale - scalable ≃ 10-500 GB/s (R) O (10 PB) ֒ → Lustre - scalable ≃ 10-500 GB/s (R) O (10 PB) ֒ Local storage → local file system ( /tmp ) O (1 TB) ֒ � over HDD ≃ 100 MB/s, over SDD ≃ 400 MB/s → RAM ( /dev/shm ) ≃ 30 GB/s (R) O (100 GB) ֒ Distributed storage → HDFS, Ceph, GlusterFS, BeeGFS, - scalable ≃ 1 GB/s ֒ ⇒ In all cases: small I/Os really kill storage performances Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 38 / 126 �

  45. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 39 / 126 �

  46. [Big] Data Management in HPC Environment: Overview and Challenges Data Transfer in Practice $> wget [-O <output>] <url> # download file from <url> $> curl [-o <output>] <url> # download file from <url> Transfer from FTP/HTTP[S] wget or (better) curl → can also serve to send HTTP POST requests ֒ → support HTTP cookies (useful for JDK download) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 40 / 126 �

  47. [Big] Data Management in HPC Environment: Overview and Challenges Data Transfer in Practice $> scp [-P <port>] <src> <user>@<host>:<path> $> rsync -avzu [-e ’ssh -p <port>’] <src> <user>@<host>:<path> [Secure] Transfer from/to two remote machines over SSH → scp or (better) rsync (transfer only what is required) ֒ Assumes you have understood and configured appropriately SSH! Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 41 / 126 �

  48. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  49. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase SSH is used as a secure backbone channel for many tools → Remote shell i.e remote command line ֒ → File transfer: rsync , scp , sftp ֒ → versionning synchronization ( svn , git ), github, gitlab etc. ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  50. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase SSH is used as a secure backbone channel for many tools → Remote shell i.e remote command line ֒ → File transfer: rsync , scp , sftp ֒ → versionning synchronization ( svn , git ), github, gitlab etc. ֒ Authentication: → password (disable if possible) ֒ → ( better ) public key authentication ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  51. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Local Machine local homedir ~/.ssh/ owns local private key id_rsa id_rsa .pub logs known servers known_hosts Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  52. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub logs known servers known_hosts Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  53. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub SSH server config /etc/ssh/ sshd_config logs known servers known_hosts ssh_host_rsa_key ssh_host_rsa_key .pub Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  54. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  55. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ 1. Initiate connection knows granted owns local private key authorized_keys id_rsa (public) key 2. create random challenge, “encrypt” using public key id_rsa .pub 3. solve challenge using private key return response 4. allow connection iff response == challenge Restrict to public key authentication: /etc/ssh/sshd_config : PermitRootLogin no # Enable Public key auth. # Disable Passwords RSAAuthentication yes PubkeyAuthentication yes PasswordAuthentication no ChallengeResponseAuthentication no Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  56. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  57. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  58. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now! Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  59. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now! $> ssh-keygen -t rsa -b 4096 -o -a 100 # 4096 bits RSA $> ssh-keygen -t ed25519 -o -a 100 # new sexy Ed25519 (better) Public Key Private (identity) key ~/.ssh/id_{rsa,ed25519} .pub ~/.ssh/id_{rsa,ed25519} Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  60. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Windows Use MobaXterm! http://mobaxterm.mobatek.net/ → [tabbed] Sessions management ֒ → X11 server w. enhanced X extensions ֒ → Graphical SFTP browser ֒ → SSH gateway / tunnels wizards ֒ → [remote] Text Editor ֒ → . . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 45 / 126 �

  61. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 46 / 126 �

  62. [Big] Data Management in HPC Environment: Overview and Challenges Sharing Code and Data Before doing Big Data, manage and version correctly normal data What kinds of systems are available? Good : NAS, Cloud → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . . ֒ Better - Version Control systems (VCS) → SVN, Git and Mercurial ֒ Best - Version Control Systems on the Public/Private Cloud → GitHub, Bitbucket, Gitlab ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 �

  63. [Big] Data Management in HPC Environment: Overview and Challenges Sharing Code and Data Before doing Big Data, manage and version correctly normal data What kinds of systems are available? Good : NAS, Cloud → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . . ֒ Better - Version Control systems (VCS) → SVN, Git and Mercurial ֒ Best - Version Control Systems on the Public/Private Cloud → GitHub, Bitbucket, Gitlab ֒ Which one? → Depends on the level of privacy you expect ֒ � . . . but you probably already know these tools � → Few handle GB files. . . Or with Git LFS (Large File Storage) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 �

  64. [Big] Data Management in HPC Environment: Overview and Challenges Centralized VCS - CVS, SVN Computer A Central VCS Server Checkout Version Database File Version 3 Version 2 Version 1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 �

  65. [Big] Data Management in HPC Environment: Overview and Challenges Centralized VCS - CVS, SVN Computer A Central VCS Server Checkout Version Database File Version 3 Version 2 Computer B Checkout Version 1 File Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 �

  66. [Big] Data Management in HPC Environment: Overview and Challenges Distributed VCS - Git Server Computer Version Database Version 3 Computer A Computer B Version 2 File File Version 1 Version Database Version Database Version 3 Version 3 Version 2 Version 2 Version 1 Version 1 Everybody has the full history of commits Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 49 / 126 �

  67. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 file A file B file C Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  68. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 file A Δ 1 file B file C Δ 1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  69. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 file A Δ 1 file B file C Δ 1 Δ 2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  70. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 file A Δ 1 Δ 2 file B Δ 1 file C Δ 1 Δ 2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  71. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  72. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  73. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 snapshot (DAG) storage Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  74. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 snapshot A (DAG) storage B C Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  75. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 snapshot A A1 (DAG) storage B B C C1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  76. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 snapshot A A1 (DAG) storage B B C C1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  77. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 snapshot A A1 A1 (DAG) storage B B B C C1 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  78. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 snapshot A A1 A1 (DAG) storage B B B C C1 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  79. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 snapshot A A1 A1 A2 (DAG) storage B B B B1 C C1 C2 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  80. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 snapshot A A1 A1 A2 (DAG) storage B B B B1 C C1 C2 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  81. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 C5 snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2 C C1 C2 C2 C3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  82. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 C5 snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2 C C1 C2 C2 C3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  83. [Big] Data Management in HPC Environment: Overview and Challenges VCS Taxonomy Mac OS File local rcs Versions delta Subversion centralized cvs svn storage mercurial distributed hg cp -r bontmia time rsync local backupninja machine duplicity duplicity snapshot (DAG) centralized storage bitkeeper bazaar git distributed bzr Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 51 / 126 �

Recommend


More recommend