berkeley archival storage encapsulation library base
play

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, - PowerPoint PPT Presentation

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, Junmin Gu Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory A. Sim, CRD , LBNL Sep. 30, 2015 1 BASE Berkeley


  1. Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, Junmin Gu Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory A. Sim, CRD , LBNL Sep. 30, 2015 1

  2. BASE • Berkeley Archival Storage Encapsulation Library • Support of the archival data on mass storage system is critical to the operations of ESGF • One of fundamental ESGF data management services • Large-scale data access from NERSC HPSS • For the ESGF Gateway system to integrate data access to archival files at NERSC HPSS • From the experience of Berkeley Storage Manager (BeStMan) during 2005-2015 and Hierarchical Resource Manager (HRM) during 1998-2006 • Ensure efficient data access to the archival storage at NERSC for ESGF A. Sim, CRD , LBNL Sep. 30, 2015 2

  3. BeStMan system architecture SRM File Service Queue USER 1 USER 2 Local Policy Module WAN/ LAN Security Module MSS Access Management Request Queue Management (PFTP, HSI, MSRCP, SCP...) USER n MSS USER QUEUE Management Network Access Management DISK Management ( GridFTP. FTP, BBFTP, SCP... ) WAN/LAN GridFTP server WAN GridFTP FTP server DISK BBFTP server SRM A. Sim, CRD , LBNL Sep. 30, 2015 3

  4. BASE design Python module for NERSC HPSS Berkeley Archival Storage Encapsulation (BASE) Library Browse (ls) Retrieve (get) Archive (put) HSI Checksum enabled NERSC HPSS Local DISK Storage A. Sim, CRD , LBNL Sep. 30, 2015 4

  5. Main functions • Python module for three main functions • Browsing, retrieving and archiving • File browsing function • Getting file size information for those files on HPSS • File retrieving function • Getting the file from the HPSS source location to the local destination disk path • File archiving function • Putting the file from the local source disk path to the HPSS destination path • ESGF Gateway service would not use this function A. Sim, CRD , LBNL Sep. 30, 2015 5

  6. Backend calls • HSI command • Used in the backend to access HPSS • Its output log would be parsed for the operation status • Upon successful HSI operation, the output log would be removed to reduce the disk storage usage. • For any reasons, when the HSI operation fails, the output log would be kept so that the cause of the failure would be addressed. A. Sim, CRD , LBNL Sep. 30, 2015 6

  7. Interface • The user codes or higher service integration would use the Python methods. • Backend of the Python methods is C++ class methods A. Sim, CRD , LBNL Sep. 30, 2015 7

  8. Checksum • Checksum comparison is enabled • If the HPSS file is archived with checksum option. • Checksum value would be saved on HPSS file system. • If the archived file does not have the checksum value • The checksum comparison would be skipped. • By default, sha256 would be compared • Checksum type can be configured with an option. • How to change the checksum type is in the manual. A. Sim, CRD , LBNL Sep. 30, 2015 8

  9. SVN repo • SVN repository for BASE source codes • https://code.lbl.gov/projects/base/ • Anonymous access is enabled • svn checkout --username anonsvn https://code.lbl.gov/svn/base/trunk/base A. Sim, CRD , LBNL Sep. 30, 2015 9

  10. Configure/Make/Install • Configure • Will find necessary paths and options. • Make and Make install • Will build the library and place the library file in the lib directory of the distribution directory • Example • cd base ./configure make make install ls –l dist/lib A. Sim, CRD , LBNL Sep. 30, 2015 10

  11. HSI preparation • Download from NERSC (version 4.0.1.2) • NERSC NIM account is needed • https://www.nersc.gov/users/storage-and-file- systems/hpss/storing-and-retrieving-data/software-downloads/ • Install • HSI installation instruction • https://www.nersc.gov/users/storage-and-file- systems/hpss/storing-and-retrieving-data/clients/hsi- configuration-and-installation/ • Credential setup • For the first time use, the NERSC HPSS password needs to be set up in $HOME/.netrc file • NERSC NIM account is needed • http://www.nersc.gov/users/storage-and-file- systems/hpss/getting-started/hpss-passwords/#toc-anchor-3 A. Sim, CRD , LBNL Sep. 30, 2015 11

  12. Python samples • Samples directory includes all python code samples • % ls samples/ HOW_TO_RUN.txt sample-ls.py sample-put.py esgf_base_mss.py sample-multi-read.py sample-get.py sample-multi-write.py A. Sim, CRD , LBNL Sep. 30, 2015 12

  13. class sdm • Esgf_base_mss.py has the class sdm • includes getSize(), getFile() and putFile() calls for browsing, retrieving and archiving respectively. • class sdm(object): def getSize(self, src): def getFile(self, src, tgt): def putFile(self, src, tgt): A. Sim, CRD , LBNL Sep. 30, 2015 13

  14. Browsing • Browsing a file • Getting file size information for those files on HPSS • import esgf_base_mss mssf = esgf_base_mss.sdm() filesize = mssf.getSize(hpss_file_path) A. Sim, CRD , LBNL Sep. 30, 2015 14

  15. Retrieving • Retrieving a file • Retrieving a file from the NERSC HPSS to the local disk path • import esgf_base_mss mssf = esgf_base_mss.sdm() mssf.getFile(hpss_file_path, local_file_path); A. Sim, CRD , LBNL Sep. 30, 2015 15

  16. Archiving • Archiving a file • Archiving a file from the local disk path to the NERSC HPSS • import esgf_base_mss mssf = esgf_base_mss.sdm() mssf.putFile(local_file_path, hpss_file_path); A. Sim, CRD , LBNL Sep. 30, 2015 16

  17. Configuration options • esgf_base_mss.py writes the configuration options for the library • base_mss.rc • To change the options, update esgf_base_mss.py • mss*HSI=hsi mss*checksum=sha256 mss*MSSHostName=archive.nersc.gov mss*EnableLogging=true mss*MSSLogFile=/.../samples/msslogs/mss.log A. Sim, CRD , LBNL Sep. 30, 2015 17

  18. Notes on the API runs • By the NERSC policy, there is a maximum number of concurrent connections to the NERSC HPSS. • It used to be 15. • When the 16 th connection is tried, HIS connection immediately gets failed with an error message of “421 Service not available - maximum number of sessions exceeded”. A. Sim, CRD , LBNL Sep. 30, 2015 18

  19. Multi-file requests • This is optional • Programming can be done for a multi-file request. • It needs to be done at the user level using the API with the control over the maximum number of concurrent connections. • The same multi-file request can be done with sequential individual file requests in multi-threads. • Both cases should have controls over the maximum number of concurrent connections to the HPSS. A. Sim, CRD , LBNL Sep. 30, 2015 19

  20. Retrieving multiple files • For retrieving multiple files from the NERSC HPSS source paths to the local disk destination paths (e.g. sample-multi-read.py for 3 files) from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_1, local_file_path_1))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_2, local_file_path_2))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_3, local_file_path_3))); esgf_base_mss.runTask(tasks, 3); • Note that the value in the runTask(tasks, N) to be less than the maximum number of concurrent allowed connections to HPSS. • Also, make sure that the multiprocessing package for python is imported. A. Sim, CRD , LBNL Sep. 30, 2015 20

  21. Archiving multiple files • For archiving multiple files from the local disk source paths to the NERSC HPSS destination paths (e.g. sample- multi-write.py for 3 files): from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_1, hpss_file_path_1))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_2, hpss_file_path_2))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_3, hpss_file_path_3))); esgf_base_mss.runTask(tasks, 3); • Note that the value in the runTask(tasks, N) to be less than the maximum number of concurrent allowed connections to HPSS. • Also, make sure that the multiprocessing package for python is imported. A. Sim, CRD , LBNL Sep. 30, 2015 21

  22. Summary • BASE Library • Python API and C/C++ library file • HSI access for HPSS • NERSC HPSS as the first step • Source codes are available under BSD license • Anonymous access • https://code.lbl.gov/projects/base/ • Support is available • sdmsupport@lbl.gov A. Sim, CRD , LBNL Sep. 30, 2015 22

Recommend


More recommend