Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com
What is Amazon Simple Storage Service (S3)? • An object store, not a file system • Write once, read many (WORM) • Eventually consistent • 99.999999999% durability • Unlimited storage capacity • Highly scalable and available data storage • Low latency and high throughput performance
What Public Data is Available in S3? • AWS Public Datasets • https://aws.amazon.com/public-datasets/ • Geospatial and Environmental Datasets • Genomics and Life Science Datasets • Datasets for Machine Learning • Regulatory and Statistical Data • awesome-public-datasets • https://github.com/caesar0301/awesome- public-datasets • NYC Taxi and Limousine Commission • http://www.nyc.gov/html/tlc/html/about/trip_r ecord_data.shtml
What is the typical workflow to use raw data from S3? • Download the data file from S3 to your PC using http/https • Upload/Import the data to SAS
What would make this more efficient? • Cutting out the middle-man (your local PC)
How can we have S3 communicate direct to the SAS Server? • Use the FILENAME URL access method ✓ Easy to implement ✗ File is retrieved using the http protocol (serially) ✗ The slowest of all options, subject to timeouts for very large files • Use PROC S3 to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Only available from 9.4M4 ✗ Only works with secure S3 files, not public S3 files
How can we have S3 communicate direct to the SAS Server? • Use the AWS CLI to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Need to install the AWS CLI on the SAS Server ✗ Need the ability to run X commands on the SAS Server • “Mount” the S3 storage on the SAS Server ✓ Treat it like a local disk ✗ S3 is not designed for block storage/access ✗ Potential issues with current storage driver implementations
Example: NYC Trip Data in S3 • NYC Yellow Cab trip data for January 2017 • 9,710,124 records • CSV format • 815 MB • Location • Bucket: nyc-tlc • Object Key: trip data/yellow_tripdata_2017-01.csv • HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv • S3 Protocol: “s3:// nyc-tlc/trip data/yellow_tripdata_2017-01.csv ”
FILENAME URL Access Method NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17 variables. real time 36.09 seconds cpu time 33.85 seconds
PROC S3 NOTE: PROCEDURE S3 used (Total process NOTE: PROCEDURE IMPORT used (Total time): process time): real time 3.77 seconds real time 26.75 seconds cpu time 6.31 seconds cpu time 26.75 seconds
AWS CLI NOTE: DATA statement used (Total process NOTE: PROCEDURE IMPORT used (Total process time): time): real time 5.80 seconds real time 26.59 seconds cpu time 0.00 seconds cpu time 26.59 seconds
Questions? Contact michael@selerity.com.au 1300 727 757 seleritysas.com
Recommend
More recommend