Data Processing Gymnastics Get your Data into the Format you Need David Lawrence Department of Social Security
Data Repackaging Tall and Thin to Short and Fat 26 October 2017
The Requirement Random Sample Survey (RSS) Result Data • Supplied in several datasets – some with multiple records per respondent • Want to convert it to a single record per respondent The RSS is used by DSS as the primary Assurance Tool – a sample of recipients has their entitlement reassessed and from this the level of Payment Accuracy can be estimated Data Repackaging 3
The Problem – Repeating Groups • Various Facets of the Survey are divided into separate datasets • Where the respondent could have multiple occurrences of a given result (eg payment components or debts) these are stored as individual records in the appropriate dataset • These individual records (naturally) have the same structure • The maximum number of these records per subject is not known Data Repackaging 4
The Old Way • PROC TRANSPOSE the data for each of the variables in the repeating group • Merge the transposed data steps by the classification variables Drawbacks • Repetitive Code • Maximum number of Repeating Groups not known. This must be determined before the transpose begins Data Repackaging 5
Solution – Step 1 Count the Repeating Groups /* Counts the Number of Observations per ID Number */ data CRNCUST( keep = CRN obscnt ); retain obscnt; set DS3Extrct; by crn; if first.crn then obscnt = 1 ; else obscnt + 1 ; if last.crn then output; run ; /* Extracts the Maximum Number of Observations */ proc means data = CRNCUST; var obscnt; output out = DEBTstatscust max = maxobs; run ; /* sets maximum macro */ data _null_; set DEBTstatscust; call symput( 'cmax', trim( left( maxobs ))); run ; Data Repackaging 6
Step 2 – Generalised Macro for Array Creation / * Creates macro for arrays */ %macro matrixgen( arrayname, startel, endel, type ); array &arrayname {&cmax} &type &startel - &endel; retain &startel - &endel; %mend matrixgen ; • Note that number of array elements is set to &cmax – the max number of observations macro variable created in Step 1 above. • Type allows the array to be declared as numeric or character • The required retain statements to keep are included in the macro definition Data Repackaging 7
Step 3a Create and Initialise the Arrays data DS3Repack; set DS3Extrct; by crn; % matrixcdebt ( IdNo, DebtId1, DebtId&max, ); % matrixcdebt ( BenT, BenType1, BenType&cmax, $ ); % matrixcdebt ( Amount, DebtAMT1, DebtAMT&cmax, ); % matrixcdebt ( SYear, SurvYear1, SurvYear&cmax, $ ); % matrixcdebt ( SvPRD, Surv_Period1, Surv_Period&cmax, $ ); retain crncnt; if first.crn then do; crncnt = 1 ; do x = 1 to &cmax; IdNo(x) = . ; Amount(x) = 0 ; BenT(x) = ‘’ SYear(x) = ''; SvPRD(x) = ''; end; end; Data Repackaging 8
Step 3b Populate the Arrays and Output IdNo(crncnt) = DebtId; Amount(crncnt) = DebtAMT; BenT(crncnt) = BenType; SYear(crncnt) = SurvYear; SvPRD(crncnt) = Surv_Period; crncnt + 1 ; DROP DebtId DebtAMT BenType SurvYear Surv_Period crncnt x; if last.crn then do; NoDebts = crncnt – 1; output; end; run ; Data Repackaging 9
Date Processing for Time Series Extraction Minimising errors for multiple date parameters 26 October 2017
The Requirement • Extracting and Processing Data for multiple points in time from a source with dates supplied in several formats • DSS obtains data from the Department of Human Services Enterprise Data Warehouse • The platform is accessed through SAS EG but the raw data is accessed via a SAS EG connection to Teradata Data Repackaging 11
The Problem – Multiple Date Formats Dates are stored on the EDW either as Teradata Dates • Extract Date and Logical Delete Date Or as a number (yyyymmdd) – eg 20 th June 2017 = 20,170,620 Extraction from one EDW Source is via a SAS/SQL Macro that requires SAS Dates as parameters Once data is extracted processing in SAS uses SAS Date format Some tasks require extractions from multiple points in time to build Time Series or comparison tables . Data Repackaging 12
The Old Way • Hard Coding or Inputting all the relevant dates and date formats • Numbering extracts from multiple files and using a Proc Format to assign the appropriate dates Drawbacks • The more input parameters required – the greater the chance of an incomplete update. • The incorrect result may not be easily noticed Data Repackaging 13
Solution – Declare the Minimum Number of Dates Ideally declare one or two dates Do so in Year and Month Components Let SAS create the rest %let Report_Year = 2017; %let Report_Month = 6; Data Repackaging 14
Date Creation Processing Tools Create a format to determine the last day of the month proc format; value mntdfmt 1, 3, 5, 7, 8, 10, 12 = 31 4, 6, 9, 11 = 30 2 = 28; run; Data Repackaging 15
Creating the Date Variables (Basic) Further manipulation of the inputs can create start dates and title text data dateprep; Report_endD = put( &Report_Month, mntdfmt. ); if &Report_Month = 2 and mod( &Report_Year, 4 ) = 0 and mod( &Report_Year, 200 ) ne 0 then Report_endD = 29; ReportEndDt = mdy( &Report_Month, Report_endD, &Report_Year); DateNum = ( &Report_Year * 10000 ) + ( &Report_Month * 100 ) + Report_endD; DateText = put( DateNum, $8. ); TD_Date = “date ‘” ||substr( DateText , 1, 4 )||’ - ’|| substr( DateText, 5, 2 )|| ’ - ’ || substr( DateText , 7, 2 )||”’”; call symput( ‘ SasDate ’ , trim( left( ReportEndDt ))); call symput( ‘ TeradataDate ’ , trim( left( TD_Date ))); call symput( ‘ NumericDate ’ , trim( left( DateNum ))); run ; Data Repackaging 16
Multiple Points in Time for a Time Series Basic Parameters %let srsstart = 2015; %let srsend = 2017; %let frstmnt = 6; %let lstmnt = 8; Data Repackaging 17
Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; Data Repackaging 18
Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %do y = &loopstart %to &loopend; %if &x = &srsstart %then %do; data dateprep&x&y; %let loopstart = &frstmnt; fildy = input( put( &y, fildfmt. ), 2. ); %end; if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; %else %do; fildtd = mdy( &y, fildy, &x ); %let loopstart = 1; fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; %end; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; %if &x = &srsend %then %do; if &y <= 6 then do; %let loopend = &lstmnt; year1 = &x; %end; year2 = &x- 1 ; %else %do; end; %let loopend = 12; else do; %end; year1 = &x + 1 ; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); run; Data Repackaging 19
Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %do y = &loopstart %to &loopend; %if &x = &srsstart %then %do; data dateprep&x&y; %let loopstart = &frstmnt; fildy = input( put( &y, fildfmt. ), 2. ); %end; if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; %else %do; fildtd = mdy( &y, fildy, &x ); %let loopstart = 1; fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; %end; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; %if &x = &srsend %then %do; if &y <= 6 then do; %let loopend = &lstmnt; year1 = &x; %end; year2 = &x- 1 ; %else %do; end; %let loopend = 12; else do; %end; year1 = &x + 1 ; proc print data=dateprep&x&y; run; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); %end; run; %end; %mend ; Data Repackaging 20
Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; %do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x- 1 ; end; else do; year1 = &x + 1 ; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend ; Data Repackaging 21
Recommend
More recommend