Getting the Most from Hash Objects Bharath Gowda
Getting the most from Hash objects Techniques covered are: • SQL join • Data step merge using BASE engine • Data step merge using SPDE merge • Index Key lookup • Hash object lookup • Simple join, iteration and sorting with hash
Test setup • Only using local SAS datasets. • Datasets (including SPDE) reside in the local work library directory. • Storing data locally nullifies the network delays and I/O delays. • The environment has a super fast solid state hard disk which decreases I/O delay. • Same join logic is implemented over all the techniques.
PROC SQL • Requires no sorting prior to joins Rows in pred_cost 13,160,905 • Multi way joins can be performed. • SQL uses internal utility tables for joins Rows in speciality 16,170,805 Proc sql; Rows in acute_pred create table acute_pred as 12,001,616 (inner join) select a.*,b.nwau_sas from pred_cost as a inner join speciality real time 0:20:28.79 as b user cpu time 0:14:15.31 on a.yearid = b.yearid and a.facility_identifier= b.facility_identifier Memory 254512.00k ; quit;
Data step merge • It requires a sort first by the key Rows in pred_cost 13,160,905 variables in all the input datasets. Proc sort data=pred_cost out= srt_pred_cost; Rows in speciality 16,170,805 by yearid facility_identifier; Run; Rows in acute_pred 12,001,616 Proc sort speciality out= srt_spec (keep= yearid (inner join) facility_identifier nwau_sas); by yearid facility_identifier; real time Run; user cpu time (avg 0:30:28.79 combined sorting and 0:21:15.31 Data acute_pred; merging) Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; Memory(combined avg) 162393.59k If a and b; Run;
SPDE merge • SPDE stands for Scalable Performance data engine Rows in pred_cost 13,160,905 • SPDE combines software and hardware capabilities. Rows in speciality 16,170,805 libname workspde spde "%sysfunc(pathname(work))" temp=yes; Rows in acute_pred proc copy in=work out= workspde; 12,001,616 (simple join) select pred_cost specialty; run; data workspde.acute_pred; real time 0:16:15.31 merge workspde.Hiepred_acutecost_1718(in=a) user cpu time 0:12:28.79 workspde.nwau16 (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; Memory 1261040.70k If a and b; run;
Index key lookup • Main dataset is not sorted. Rows in pred_cost 13,160,905 • Multiple set statements with key= option . • _IORC_ is one of the key automatic variable which needs attention. Rows in speciality 16,170,805 Proc datasets lib=work nolist; Rows in acute_pred Modify speciality; 12,001,616 (inner join) Index create myindex=(yearid facility_identifier); run; real time 0:12:15.31 user cpu time 0:06:28.79 Data acute_pred; set pred_cost; Set speciality key=myindex ; Memory 142279.71k If _IORC_ =0 ; run;
HASH Objects • One of the fastest approaches Data acute_pred; for the look up activities. length nwau_sas 8.; • In memory computation makes If _n_=1 then do; this technique the fastest. declare hash hn16(dataset: ‘speciality’,duplicate:’e’); hn16.definekey(‘yearid’,’facility_identifier’); • Dataset size is a key factor for hn16.definedata(‘nwau_sas’); hn16.definedone(); memory consumption. end; • Pre sorting is not required. set pred_cost; • Hash object needs to be defined Rc=hn16.find(key :yearid, key:facility_identifier); and instantiated. If rc=0; • Lookups are performed with the Run; find() function.
HASH Objects • Simple inner joins and left joins can be achieved by keeping a check on Rows in pred_cost 13,160,905 the return code variable. Rows in speciality 16,170,805 Rows in acute_pred if rc=0; 12,001,616 (inner join) real time 0:08:28.79 user cpu time 0:03:15.31 Memory 1465437.28k
HASH Iterators • Hash iterator objects need to be defined and instantiated. • Hash objects are assigned to hash iterator objects. Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset:’speciality’, ordered:’descending’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(all:’Y’); hn16.definedone(); Declare hiter iter_ex(‘hn16’); End; /*hn16.output(dataset:’ sort_dec ’)*/
HASH Iterators • First() and next() functions to Declare hiter iter_ex(‘hn16’); iterate top to bottom and last() End; and prev() functions to iterate Iter_ex.First(); bottom to top do I = 1 to 10; output top; • Can be used to get the top and Iter_ex.Next(); end; bottom records. Iter_ex.last(); do I = 1 to 10; output bottom; Iter_ex.prev(); end; Run;
Avoid or use? • Where should you apply these techniques? Index Datastep SPDE Hash Hash Type Proc SQL lookup Merge Merge lookup Iterators (key=) Small to medium sized tables (< 10,000 ✓ ✓ rows) ✓ ✓ ✓ ✓ ✓ Huge table (> 10 million rows) ✓ ✓ ✓ ✓ ✓ Unsorted data(> 10 million rows) ✓ ✓ Datasets with multiple indexes ✓ ✓ Less memory hungry techniques
Getting the most from Hash objects Bharath Gowda, SAS analyst Independent Contractor 0468304568 | bharathg1307@gmail.com
Recommend
More recommend