Accessing Data in the Cloud Using SAS to read data from Amazon - PowerPoint PPT Presentation
Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service (S3)? An object store, not a file system Write once, read many (WORM) Eventually
Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com
What is Amazon Simple Storage Service (S3)? • An object store, not a file system • Write once, read many (WORM) • Eventually consistent • 99.999999999% durability • Unlimited storage capacity • Highly scalable and available data storage • Low latency and high throughput performance
What Public Data is Available in S3? • AWS Public Datasets • https://aws.amazon.com/public-datasets/ • Geospatial and Environmental Datasets • Genomics and Life Science Datasets • Datasets for Machine Learning • Regulatory and Statistical Data • awesome-public-datasets • https://github.com/caesar0301/awesome- public-datasets • NYC Taxi and Limousine Commission • http://www.nyc.gov/html/tlc/html/about/trip_r ecord_data.shtml
What is the typical workflow to use raw data from S3? • Download the data file from S3 to your PC using http/https • Upload/Import the data to SAS
What would make this more efficient? • Cutting out the middle-man (your local PC)
How can we have S3 communicate direct to the SAS Server? • Use the FILENAME URL access method ✓ Easy to implement ✗ File is retrieved using the http protocol (serially) ✗ The slowest of all options, subject to timeouts for very large files • Use PROC S3 to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Only available from 9.4M4 ✗ Only works with secure S3 files, not public S3 files
How can we have S3 communicate direct to the SAS Server? • Use the AWS CLI to download files to the SAS Server’s filesystem ✓ Very fast, as it uses parallel downloads ✗ Need to install the AWS CLI on the SAS Server ✗ Need the ability to run X commands on the SAS Server • “Mount” the S3 storage on the SAS Server ✓ Treat it like a local disk ✗ S3 is not designed for block storage/access ✗ Potential issues with current storage driver implementations
Example: NYC Trip Data in S3 • NYC Yellow Cab trip data for January 2017 • 9,710,124 records • CSV format • 815 MB • Location • Bucket: nyc-tlc • Object Key: trip data/yellow_tripdata_2017-01.csv • HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv • S3 Protocol: “s3:// nyc-tlc/trip data/yellow_tripdata_2017-01.csv ”
FILENAME URL Access Method NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17 variables. real time 36.09 seconds cpu time 33.85 seconds
PROC S3 NOTE: PROCEDURE S3 used (Total process NOTE: PROCEDURE IMPORT used (Total time): process time): real time 3.77 seconds real time 26.75 seconds cpu time 6.31 seconds cpu time 26.75 seconds
AWS CLI NOTE: DATA statement used (Total process NOTE: PROCEDURE IMPORT used (Total process time): time): real time 5.80 seconds real time 26.59 seconds cpu time 0.00 seconds cpu time 26.59 seconds
Questions? Contact michael@selerity.com.au 1300 727 757 seleritysas.com
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.