improving i o performance through colocating interrelated
play

Improving I/O Performance Through Colocating Interrelated Input Data - PowerPoint PPT Presentation

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28 Overview Background and


  1. Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28

  2. Overview Background and Motivation Placement Strategy Experimental Evaluation 2 / 28

  3. Background: Application ◮ GeoMultiSens: Analyze the Earth’s surface based on remote sensing images Homogenization Visual Exploration Data Sources Evaluation ◮ The more images, the better the results (several PB available) ◮ Implementation as MapReduce job in Flink ◮ Goal: distribute files in the distributed file system (XtreemFS) such that the computation is efficient and performant 3 / 28

  4. Background: Application Details (1) ◮ Data parallel application ◮ Parallelization along geographical regions (UTM grid) Read from XtreemFS Flink Map Flink Map (via Flink DataSource) Image 1 Image 2 ... Composite Classi fi cation Image k 4 / 28

  5. Background: Application Details (2) Read from XtreemFS Flink Map Flink Map (via Flink DataSource) Image 1 Image 2 ... Composite Classi fi cation Image k ◮ composites have large memory footprint ◮ (de)serialization is expansive (python ↔ java) ◮ ⇒ composites should not be moved ◮ ⇒ analysis of one region (group) should be on one node 5 / 28

  6. File Placement Issues (1) ◮ State of the art: more or less random distribution ⇒ no data local processing possible node 1 node 2 node 3 assigned 32TPT 32TQU 32TKS region(s) 32TQU/ fi le_1 32TKS/ fi le_2 32TPT/ fi le_3 fi les stored on local hard 32TKS/ fi le_1 32TQU/ fi le_2 32TKS/ fi le_3 drive 32TPT/ fi le_1 32TPT/ fi le_2 32TQU/ fi le_3 ◮ Network traffic ◮ Disk scheduling 6 / 28

  7. File Placement Issues (2) ◮ Goal: colocated UTM regions (groups) for local access node 1 node 2 node 3 assigned 32TQU 32TKS 32TPT region(s) 32TQU/ fi le_1 32TKS/ fi le_1 32TPT/ fi le_1 fi les stored on local hard 32TQU/ fi le_2 32TKS/ fi le_2 32TPT/ fi le_2 drive 32TQU/ fi le_3 32TKS/ fi le_3 32TPT/ fi le_3 ◮ Network traffic ◮ ⇒ local grouping ◮ Disk scheduling ◮ ⇒ load balancing issues (Europe: 1400 groups between 3 MB and 25 GB) 7 / 28

  8. Hybrid placement optimization strategy ◮ place all files of the same group in same, tagged folder ◮ distributed file system places all files of same group on same server ◮ load balancing (Storage Server Assignment Problem) is NP hard ◮ thus, use approximation algorithm 8 / 28

  9. Storage Server Assignment Problem (1) File groups (regions) Storage machines (OSDs) m 1 m 2 Goal: Assign file groups to machines such that the most loaded machine is loaded as little as possible 9 / 28

  10. Storage Server Assignment Problem (2) File groups Two assignments m 1 m 2 m 1 m 2 ◮ our assignment problem is equivalent to multi-processor scheduling ◮ Approximation algorithm: Largest Processing Time first (LPT) becomes largest group size first in our scenario 10 / 28

  11. LPT - Formal Description ◮ S set of storage servers (OSDs) with capacities c : S → N ◮ F set of file groups with sizes s : F → N ◮ Sort F = { f 1 , . . . , f n } such that s ( f i ) ≥ s ( f j ) for all 1 ≤ i < j ≤ n ◮ S i denotes the storage server assigned to group f i ◮ for i = 1 , . . . , n , S i is given by ( ℓ ( S ) + s ( f i ) S i = arg min c ( S )) S ∈S 11 / 28

  12. LPT - Step by Step Example (1) Sorted file groups 1 m 1 1 m 2 2 1 m 3 2 12 / 28

  13. LPT - Step by Step Example (2) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 13 / 28

  14. LPT - Step by Step Example (3) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 14 / 28

  15. LPT - Step by Step Example (4) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 15 / 28

  16. LPT - Step by Step Example (5) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 16 / 28

  17. LPT - Step by Step Example (6) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 17 / 28

  18. LPT - Step by Step Example (7) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 18 / 28

  19. LPT: Key Properties ◮ simple and fast algorithm ◮ suitable for offline and online problems ◮ good theoretical performance ◮ practical evaluation: differs less than 1 % (offline) / less than 5 % (online) from the optimal solution 19 / 28

  20. Implementation: Architecture client OSDs mount fi le system object storage devices GMS application metadata and replica LPT implementation catalogue client tool MRC new code: ◮ client tool ◮ OSD selection policy for MRC 20 / 28

  21. Implementation: Add Group(s) read/write client OSDs fi le system access fi les via POSIX interface meta data operations assign GMS fi les add fi le groups (tag folders) con fi gure client tool MRC OSD selection policy add folders(/path/to/xtreemfs mount/some/subdirs/32TQU) ⇒ MRC adds mapping entry: some/subdirs/32TQU → OSD 17 21 / 28

  22. Implementation: Add File(s) read/write client OSDs fi le system access fi les meta data operations via POSIX interface assign GMS fi les add fi le groups (tag folders) con fi gure client tool MRC OSD selection policy open(/path/to/xtreemfs mount/ some/subdirs/32TQU/LC8/file.tif ) ⇒ MRC finds match for prefix some/subdirs/32TQU ⇒ file.tif is stored on OSD 17 22 / 28

  23. Experimental setup ◮ input data: 3.3 TB of satellite images in 355 groups ◮ hardware: one master, 29 worker/storage nodes, each with 16 CPU nodes and 10 Gb network ◮ job: read and decompress all data, in the same way as for land cover classification ◮ tested file distributions: ◮ Random File (state of the art default) ◮ Random File Group (e.g., CoHadoop) ◮ LPT File Group (our strategy) ◮ each tested with HDDs and SSDs ◮ 10 repetitions for each setup 23 / 28

  24. Network traffic ◮ measure total (incoming) network traffic of the whole job Random File Random File Group LPT File Group 0 1000 2000 3000 4000 Total Rx in GB ◮ 95 % decrease compared to Random File ◮ 68 % decrease compared to Random File Group 24 / 28

  25. Running times and CPU wait times (relative values) 1.25 1.00 0.75 0.50 Data Placement Random File Random File Group 0.25 LPT File Group 0.00 HDD HDD SSD SSD runtime total CPU wait time runtime total CPU wait time ◮ baseline: Random File takes 40 min with HDDs ◮ 39 % running time and 50 % CPU wait time reduction compared to Random File ◮ 65 % running time and 47 % CPU wait time reduction compared to Random File Group 25 / 28

  26. Running times and CPU wait times (relative values) 1.25 1.00 0.75 0.50 Data Placement Random File 0.25 Random File Group LPT File Group 0.00 HDD HDD SSD SSD runtime total CPU wait time runtime total CPU wait time ◮ baseline: Random File takes 16 min with SSDs ◮ file placement has no significant impact with SSD setups ◮ low network usage seems to have little impact ⇒ HDD speedup mostly due to better scheduling 26 / 28

  27. Conclusions/Summary ◮ Lightweight file placement mechanism that combines: ◮ colocation of related input files for local performance ◮ nearly optimal storage server selection for global perfomance (load balancing) ◮ Empirically verified benefits of colocated LPT placement: ◮ network traffic reduced by around 95% compared to Random File placement ◮ time to read input reduced by 39% / 65% compared to Random File / Random Group placement ◮ difference to optimal solution less than 5% of the optimal solution ⇒ XtreemFS is ready for efficient large-scale analysis of the Earth’s surface 27 / 28

  28. References ◮ XtreemFS: www.xtreemfs.org , https://github.com/xtreemfs/xtreemfs ◮ client tool: https://github.com/felse/xtreemfs_client ◮ Application (GeoMultiSens): http://www.geomultisens.de/ ◮ Many thanks to the GeoMultiSens team! ◮ Felix Seibert: https://www.zib.de/members/seibert ◮ Funding: GeoMultiSens (grants 01IS14010C and 01IS14010B) and the Berlin Big Data Center (BBDC) (grant 01IS14013B). 28 / 28

Recommend


More recommend