big data sets seen as a big problem and how to deal with
play

Big Data Sets Seen as a Big Problem and How to Deal with Them - PowerPoint PPT Presentation

Big Data Sets Seen as a Big Problem and How to Deal with Them Daniil Shliakhov, Kharkiv, Ukraine Frankfurt 2018 B I G D ATA S E T S S E E N A S A B I G P R O B L E M INTRO Running time is an issue! Running time Normal Size Data Sets


  1. Big Data Sets Seen as a Big Problem and How to Deal with Them Daniil Shliakhov, Kharkiv, Ukraine Frankfurt 2018

  2. B I G D ATA S E T S S E E N A S A B I G P R O B L E M INTRO Running time is an issue! Running time Normal Size Data Sets Large Data Sets Page 2

  3. B I G D ATA S E T S S E E N A S A B I G P R O B L E M INTRO Parameter Treatment n Mean SD Median Min Max Alkaline Baseline Phosphatase (U/L) Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 1 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 2 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Page 3

  4. GENERAL TIPS RETRIEVING SAS DATA SETS VIEW option

  5. G E N E R A L T I P S VIEW OPTION A simple data step? How much time it may take to run this step if ADAM.ADLB is huge? data adlb; set adam.adlb; run; Page 5

  6. G E N E R A L T I P S VIEW OPTION 35 seconds? Too looooong L data adlb; set adam.adlb; run; Real time 34:23 CPU time 3:10 Page 6

  7. G E N E R A L T I P S VIEW OPTION A SAS VIEW is a type of SAS data set that retrieves data values from other files data adlb / VIEW=adlb; set adam.adlb; run; Page 7

  8. G E N E R A L T I P S VIEW OPTION Less than 1 second? MAGIC! data adlb / VIEW=adlb; set adam.adlb; run; Real time 0:03 CPU time 0:01 Page 8

  9. G E N E R A L T I P S VIEW OPTION Simple merge, common sort… How long? data analysis ; merge adam.adsl adam.adlb; by studyid usubjid; run ; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; Page 9

  10. G E N E R A L T I P S VIEW OPTION Simple merge, common sort… Too looooong again data analysis ; merge adam.adsl adam.adlb; by studyid usubjid; run ; proc sort data=analysis; by trt01an parcat paramcd avisitn; run; Data Step PROC step Real time 36:04 1:12.40 CPU time 8:18 12.45 Page 10

  11. G E N E R A L T I P S VIEW OPTION Adding VIEW option to do magic data analysis / VIEW=analysis ; merge adam.adsl adam.adlb; by studyid usubjid; run ; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; Page 11

  12. G E N E R A L T I P S VIEW OPTION VIEW option gives extra time to drink coffee with colleagues data analysis / VIEW=analysis ; merge adam.adsl adam.adlb; by studyid usubjid; run ; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run; Data Step PROC step Real time 0:04 1:17.65 CPU time 0:02 20.32 Page 12

  13. GENERAL TIPS RETRIEVING SAS DATA SETS IF or WHERE?

  14. G E N E R A L T I P S IF or WHERE? IF vs. WHERE. Who is the champion? data adlb ; set adam.adlb; if ANL01FL = 'Y'; run ; data adlb ; set adam.adlb; where ANL01FL = 'Y'; run ; Page 14

  15. G E N E R A L T I P S IF or WHERE? IF is champion! Woohoo! data adlb ; set adam.adlb; IF statement WHERE if ANL01FL = 'Y'; statement run ; Real time 31:64 33:31 data adlb ; CPU time 3:53 5:68 set adam.adlb; where ANL01FL = 'Y'; run ; Page 15

  16. G E N E R A L T I P S IF or WHERE? INDEX helps WHERE to win J data adlb ; set adam.adlb; if ANL01FL = 'Y'; run ; INDEX APPLIED data adlb ; set adam.adlb; where ANL01FL = 'Y'; run ; IF statement WHERE statement Real time 32:15 27:26 CPU time 4:28 2.98 Page 16

  17. SAS PROCEDURES DESCRIPTIVE STATS

  18. S A S P R O C E D U R E S DESCRIPTIVE STATS proc means data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; Page 18

  19. S A S P R O C E D U R E S DESCRIPTIVE STATS proc univariate data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; Page 19

  20. S A S P R O C E D U R E S DESCRIPTIVE STATS proc summary data=adlb noprint; by trt01an parcat paramcd avisitn; var aval; output out = mnout n = n mean = mean median = median std = std min = min max = max; run; Page 20

  21. S A S P R O C E D U R E S DESCRIPTIVE STATS proc sql noprint; create table mnout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as n, MEAN(aval) as mean, MEDIAN(aval) as median, STD(aval) as std, MIN(aval) as min, MAX(aval) as max from adlb group by trt01an, parcat, paramcd, avisitn quit; Page 21

  22. S A S P R O C E D U R E S DESCRIPTIVE STATS DESCRIPTIVE STATS COMPARISON MEANS UNIVARIATE SUMMARY SQL Real time 15:14 24:78 13:24 13:45 CPU time 3:38 1:76 3:33 2:53 Page 22

  23. SAS PROCEDURES FREQUENCY

  24. S A S P R O C E D U R E S FREQUENCY proc freq data=adlb noprint; by trt01an parcat paramcd; tables avisitn / out=frout; run; Page 24

  25. S A S P R O C E D U R E S FREQUENCY proc summary data=adlb nway noprint; by trt01an parcat paramcd avisitn; output out=frout; run; Page 25

  26. S A S P R O C E D U R E S FREQUENCY proc sql noprint; create table frout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as count from adlb group by trt01an, parcat, paramcd, avisitn quit; Page 26

  27. S A S P R O C E D U R E S FREQUENCY FREQUENCY COMPARISON FREQ SQL SUMMARY Real time 13:62 12:19 12:02 CPU time 2:04 1:63 0:88 Page 27

  28. CONCLUSIONS Do not be afraid to work with big data sets. Just choose the “right” procedure!

  29. T H A N K YO U Daniil Shliakhov daniil.shlyakhov@intego-group.com Kharkiv, Ukraine www.intego-group.com

Recommend


More recommend