accessing data in the cloud
play

Accessing Data in the Cloud Using SAS to read data from Amazon - PowerPoint PPT Presentation

Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service (S3)? An object store, not a file system Write once, read many (WORM) Eventually


  1. Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com

  2. What is Amazon Simple Storage Service (S3)? • An object store, not a file system • Write once, read many (WORM) • Eventually consistent • 99.999999999% durability • Unlimited storage capacity • Highly scalable and available data storage • Low latency and high throughput performance

  3. What Public Data is Available in S3? • AWS Public Datasets • https://aws.amazon.com/public-datasets/ • Geospatial and Environmental Datasets • Genomics and Life Science Datasets • Datasets for Machine Learning • Regulatory and Statistical Data • awesome-public-datasets • https://github.com/caesar0301/awesome- public-datasets • NYC Taxi and Limousine Commission • http://www.nyc.gov/html/tlc/html/about/trip_r ecord_data.shtml

  4. What is the typical workflow to use raw data from S3? • Download the data file from S3 to your PC using http/https • Upload/Import the data to SAS

  5. What would make this more efficient? • Cutting out the middle-man (your local PC)

  6. How can we have S3 communicate direct to the SAS Server? • Use the FILENAME URL access method ✓ Easy to implement ✗ File is retrieved using the http protocol (serially) ✗ The slowest of all options, subject to timeouts for very large files • Use PROC S3 to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Only available from 9.4M4 ✗ Only works with secure S3 files, not public S3 files

  7. How can we have S3 communicate direct to the SAS Server? • Use the AWS CLI to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Need to install the AWS CLI on the SAS Server ✗ Need the ability to run X commands on the SAS Server • “Mount” the S3 storage on the SAS Server ✓ Treat it like a local disk ✗ S3 is not designed for block storage/access ✗ Potential issues with current storage driver implementations

  8. Example: NYC Trip Data in S3 • NYC Yellow Cab trip data for January 2017 • 9,710,124 records • CSV format • 815 MB • Location • Bucket: nyc-tlc • Object Key: trip data/yellow_tripdata_2017-01.csv • HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv • S3 Protocol: “s3:// nyc-tlc/trip data/yellow_tripdata_2017-01.csv ”

  9. FILENAME URL Access Method NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17 variables. real time 36.09 seconds cpu time 33.85 seconds

  10. PROC S3 NOTE: PROCEDURE S3 used (Total process NOTE: PROCEDURE IMPORT used (Total time): process time): real time 3.77 seconds real time 26.75 seconds cpu time 6.31 seconds cpu time 26.75 seconds

  11. AWS CLI NOTE: DATA statement used (Total process NOTE: PROCEDURE IMPORT used (Total process time): time): real time 5.80 seconds real time 26.59 seconds cpu time 0.00 seconds cpu time 26.59 seconds

  12. Questions? Contact michael@selerity.com.au 1300 727 757 seleritysas.com

Recommend


More recommend