getting the most from hash objects
play

Getting the Most from Hash Objects Bharath Gowda Getting the most - PowerPoint PPT Presentation

Getting the Most from Hash Objects Bharath Gowda Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup


  1. Getting the Most from Hash Objects Bharath Gowda

  2. Getting the most from Hash objects Techniques covered are: • SQL join • Data step merge using BASE engine • Data step merge using SPDE merge • Index Key lookup • Hash object lookup • Simple join, iteration and sorting with hash

  3. Test setup • Only using local SAS datasets. • Datasets (including SPDE) reside in the local work library directory. • Storing data locally nullifies the network delays and I/O delays. • The environment has a super fast solid state hard disk which decreases I/O delay. • Same join logic is implemented over all the techniques.

  4. PROC SQL • Requires no sorting prior to joins Rows in pred_cost 13,160,905 • Multi way joins can be performed. • SQL uses internal utility tables for joins Rows in speciality 16,170,805 Proc sql; Rows in acute_pred create table acute_pred as 12,001,616 (inner join) select a.*,b.nwau_sas from pred_cost as a inner join speciality real time 0:20:28.79 as b user cpu time 0:14:15.31 on a.yearid = b.yearid and a.facility_identifier= b.facility_identifier Memory 254512.00k ; quit;

  5. Data step merge • It requires a sort first by the key Rows in pred_cost 13,160,905 variables in all the input datasets. Proc sort data=pred_cost out= srt_pred_cost; Rows in speciality 16,170,805 by yearid facility_identifier; Run; Rows in acute_pred 12,001,616 Proc sort speciality out= srt_spec (keep= yearid (inner join) facility_identifier nwau_sas); by yearid facility_identifier; real time Run; user cpu time (avg 0:30:28.79 combined sorting and 0:21:15.31 Data acute_pred; merging) Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; Memory(combined avg) 162393.59k If a and b; Run;

  6. SPDE merge • SPDE stands for Scalable Performance data engine Rows in pred_cost 13,160,905 • SPDE combines software and hardware capabilities. Rows in speciality 16,170,805 libname workspde spde "%sysfunc(pathname(work))" temp=yes; Rows in acute_pred proc copy in=work out= workspde; 12,001,616 (simple join) select pred_cost specialty; run; data workspde.acute_pred; real time 0:16:15.31 merge workspde.Hiepred_acutecost_1718(in=a) user cpu time 0:12:28.79 workspde.nwau16 (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; Memory 1261040.70k If a and b; run;

  7. Index key lookup • Main dataset is not sorted. Rows in pred_cost 13,160,905 • Multiple set statements with key= option . • _IORC_ is one of the key automatic variable which needs attention. Rows in speciality 16,170,805 Proc datasets lib=work nolist; Rows in acute_pred Modify speciality; 12,001,616 (inner join) Index create myindex=(yearid facility_identifier); run; real time 0:12:15.31 user cpu time 0:06:28.79 Data acute_pred; set pred_cost; Set speciality key=myindex ; Memory 142279.71k If _IORC_ =0 ; run;

  8. HASH Objects • One of the fastest approaches Data acute_pred; for the look up activities. length nwau_sas 8.; • In memory computation makes If _n_=1 then do; this technique the fastest. declare hash hn16(dataset: ‘speciality’,duplicate:’e’); hn16.definekey(‘yearid’,’facility_identifier’); • Dataset size is a key factor for hn16.definedata(‘nwau_sas’); hn16.definedone(); memory consumption. end; • Pre sorting is not required. set pred_cost; • Hash object needs to be defined Rc=hn16.find(key :yearid, key:facility_identifier); and instantiated. If rc=0; • Lookups are performed with the Run; find() function.

  9. HASH Objects • Simple inner joins and left joins can be achieved by keeping a check on Rows in pred_cost 13,160,905 the return code variable. Rows in speciality 16,170,805 Rows in acute_pred if rc=0; 12,001,616 (inner join) real time 0:08:28.79 user cpu time 0:03:15.31 Memory 1465437.28k

  10. HASH Iterators • Hash iterator objects need to be defined and instantiated. • Hash objects are assigned to hash iterator objects. Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset:’speciality’, ordered:’descending’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(all:’Y’); hn16.definedone(); Declare hiter iter_ex(‘hn16’); End; /*hn16.output(dataset:’ sort_dec ’)*/

  11. HASH Iterators • First() and next() functions to Declare hiter iter_ex(‘hn16’); iterate top to bottom and last() End; and prev() functions to iterate Iter_ex.First(); bottom to top do I = 1 to 10; output top; • Can be used to get the top and Iter_ex.Next(); end; bottom records. Iter_ex.last(); do I = 1 to 10; output bottom; Iter_ex.prev(); end; Run;

  12. Avoid or use? • Where should you apply these techniques? Index Datastep SPDE Hash Hash Type Proc SQL lookup Merge Merge lookup Iterators (key=) Small to medium sized tables (< 10,000 ✓ ✓     rows) ✓  ✓ ✓ ✓ ✓ Huge table (> 10 million rows) ✓  ✓ ✓ ✓ ✓ Unsorted data(> 10 million rows) ✓  ✓    Datasets with multiple indexes  ✓ ✓    Less memory hungry techniques

  13. Getting the most from Hash objects Bharath Gowda, SAS analyst Independent Contractor 0468304568 | bharathg1307@gmail.com

Recommend


More recommend