Analysis of variance and regression November 13, 2007
SAS language • The SAS environments • Reading in, data-step • Summary statistics • Subsetting data • More on reading in, missing values • Combination of data sets
Lene Theil Skovgaard, Dept. of Biostatistics, Institute of Public Health, University of Copenhagen e-mail: L.T.Skovgaard@biostat.ku.dk http://staff.pubhealth.ku.dk/~lts/regression07_2
SAS language, November 2007 1 SAS exercises on this course • Two teachers to help you • Private user names and passwords !! • two share each machine • many of you know SAS ANALYST from course on basic statistics, but here we focus on the SAS language • References: – Aa. T. Andersen, T.V. Bedsted, M. Feilberg, R.B. Jakobsen and A. Milhøj: Elementær indføring i SAS. Akademisk Forlag (in Danish, 2002) – by Aa. T. Andersen, M. Feilberg, R.B. Jakobsen and A. Milhøj: Statistik med SAS. Akademisk Forlag (in Danish, 2002) – R.P Cody og J.K. Smith: Applied statistics and the SAS programming language. 4. ed., Prentice Hall, 1997.
SAS language, November 2007 2 Menus vs. Language • Menus + No learning by heart + No syntax error + Stepwise learning − Inflexible − A bit hard to find your whereabouts − Does not contain everything − Tedious in the long run
SAS language, November 2007 3 Menus vs. Language • Language: − Some learning by heart − Many syntax errors in the beginning + Logical coherent + Reproducably + Easier to document + Easier to communicate
SAS language, November 2007 4 Basic structure • SAS Core – Database system (“Engine”) – Programming language • SAS Base – Data manipulation: DATA, SORT, PRINT, (PLOT) – Minimal statistics: MEANS, UNIVARIATE, TABULATE • Special modules – SAS/STAT: TTEST, GLM, GENMOD, etc. – SAS/GRAPH: GPLOT – SAS/ASSIST, QC, ETS, FSP, IML, . . . – SAS ANALYST – SAS Enterprise
SAS language, November 2007 5 SAS in a nutshell � Raw data � � Data file � Program + + Log − → Data file Output • Batch SAS: – *.sas Program file – *.log Log file – *.lst Output file
SAS language, November 2007 6 • SAS Display Manager — Environment for program development and data handling – Program editor: common or enhanced – Output window – Log window – Graphics window – Explorer, Viewtable, Toolbar, Results Note: Program code must be saved
SAS language, November 2007 7 Example O’Neill et.al. (1983): Lung function for 25 patients with cystic fibrosis.
SAS language, November 2007 8 Some of these data may be found in the text file T:\pemax.txt (created using e.g. Wordpad) age sex height weight fev1 pemax 7 1 109 13.1 32 95 7 2 112 12.9 19 85 8 1 124 14.1 22 100 8 2 125 16.2 41 85 8 1 127 21.5 52 95 9 1 130 17.5 44 80 11 2 139 30.7 28 65 12 2 150 28.4 18 110 12 1 146 25.1 24 70 13 2 155 31.5 23 95 13 1 156 39.9 39 110 14 2 153 42.1 26 90 14 1 160 45.6 45 100 15 2 158 51.2 45 80 16 2 160 35.9 31 134 17 2 153 34.8 29 134 17 1 174 44.7 49 165 17 2 176 60.1 29 120 17 1 171 42.6 38 130 19 2 156 37.2 21 85 19 1 174 54.6 37 85 20 1 178 64.0 34 160 23 1 180 73.8 57 165 23 1 175 51.1 33 95 23 1 179 71.5 52 195
SAS language, November 2007 9 Reading in data (more later on...) data sasuser.pemax; infile ’T:\pemax.txt’ firstobs=2; input age sex height weight fev1 pemax; run; To execute the program, we click on ’running man’, and then we look at the log file NOTE: 25 records were read from the infile ’pemax.txt’. The minimum record length was 21. The maximum record length was 21. NOTE: The data set SASUSER.PEMAX has 25 observations and 6 variables. NOTE: DATA statement used: real time 0.11 seconds cpu time 0.01 seconds No output
SAS language, November 2007 10 What if it did not work as intended? 1. Find out why! 2. Correct 3. Try again SAS is executed sequentially. If we want to add something, we can just do it later. Recall commands • When a program bit has been executed, it may sometimes disappear from the program editor • Earlier bits may be recovered using F4 • Note, that the bits accumulate: If you use F4 several times, you will get the previous bits successively after one another
SAS language, November 2007 11 Definition of new variables , transformation We want to study body mass index, bmi : data sasuser.pemax; infile ’T:\pemax.txt’ firstobs=2; input age sex height weight fev1 pemax; bmi=weight/(height/100)**2; run; proc print data=sasuser.pemax; run; Obs age sex height weight fev1 pemax bmi 1 7 1 109 13.1 32 95 11.0260 2 7 2 112 12.9 19 85 10.2838 3 8 1 124 14.1 22 100 9.1701
SAS language, November 2007 12 Transformations • Arithmetics – The usual operators: + - * / – Raising to a power: **, e.g.. x**2 – Square root: sqrt(x) – Logarithms: log(x), log10(x), log2(x) log 2 ( x ) = log( x ) All logarithms are proportional log(2) • Relations: = < > <= >= <> (unequal) eq lt gt le ge ne (alternative notation) • Logical operators: and or not
SAS language, November 2007 13 Other types of variable definitions data sasuser.pemax; infile ’T:\pemax.txt’ firstobs=2; proc print data=sasuser.pemax; input age sex height weight fev1 pemax; var csex age bmi fat; run; length csex $ 6 ; /* in order to avoid truncation */ Obs csex age bmi fat if sex=1 then csex=’male’; 1 female 7 11.0260 0 if sex=2 then csex=’female’; 2 male 7 10.2838 0 . . . . . fat=(bmi>18); . . . . . run; 14 male 15 20.5095 1
SAS language, November 2007 14 Ingrediences in DATA step • Specification line (name of new data set) • Data source (here: read from file) • Variables to read in • Possible calculations • Possible redefinitions • To be concluded with run;
SAS language, November 2007 15 Variables • The columns in a data set • May be numerical variables (contain numbers) • — or character variables (contain text strings, letters) • Values of a character variable is enclosed in citation signs, e.g. ’male’ ( except in data files ) • Period ( . ) denotes a missing value for a numerical variable
SAS language, November 2007 16 Variable names • SAS does not care about upper/lower case ( SEX , sex and Sex refer to the same variable) • Names may be up to 32 characters long (previously only 8) • Names may contain English letters, digits and underscore ( _ ) • — but they are not allowed to start with a digit
SAS language, November 2007 17 Calculation of summary statistics in SAS proc means data=sasuser.pemax; run; The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------- age 25 14.4800000 5.0589854 7.0000000 23.0000000 sex 25 1.4400000 0.5066228 1.0000000 2.0000000 fev1 25 34.7200000 11.1971723 18.0000000 57.0000000 pemax 25 109.1200000 33.4369058 65.0000000 195.0000000 bmi 25 15.3422331 3.8633242 9.1701353 22.7777778 ------------------------------------------------------------------------- These are default , others may be chosen as options
SAS language, November 2007 18 From Help pages: /*Some of the keywords available with PROC MEANS: N - number of observations MEAN - mean value MIN - minimum value MAX - maximum value SUM - total of values NMISS - number of missing values MAXDEC=n - set maximum number of decimal places */ statistic-keyword(s) specifies which statistics to compute and the order to display them in the output. The available keywords in the PROC statement are Descriptive statistic keywords CLM RANGE CSS SKEWNESS|SKEW CV STDDEV|STD KURTOSIS|KURT STDERR LCLM SUM MAX SUMWGT MEAN UCLM MIN USS N VAR NMISS Quantile statistic keywords MEDIAN|P50 Q3|P75 P1 P90 P5 P95 P10 P99 Q1|P25 QRANGE Hypothesis testing keyword PROBT T
SAS language, November 2007 19 If we want to see the medians: proc means data=sasuser.pemax median; var age bmi fev1; run; The MEANS Procedure Variable Median ------------------------ age 14.0000000 bmi 14.8660771 fev1 33.0000000 ------------------------ Oops: Now, we got only the median!
SAS language, November 2007 20 proc means data=sasuser.pemax N mean median; var age bmi fev1; run; The MEANS Procedure Variable N Mean Median ---------------------------------------------- age 25 14.4800000 14.0000000 bmi 25 15.3422331 14.8660771 fev1 25 34.7200000 33.0000000 ----------------------------------------------
SAS language, November 2007 21 Sorting the data • often used because other procedures demand this • Example proc sort data=sasuser.pemax out=sorted_pemax; by sex descending weight; run; • If out=xxx is omitted, the original data set will be replaced by the sorted data. • Note the option DESCENDING in front of weight
SAS language, November 2007 22 BY statement • may be found in many procedures: (MEANS, REG, GLM, . . . ) • performs the analyses within each group separately • demands sorted data proc sort data=sasuser.pemax; by sex; run; proc means; where sex ne .; by sex; run; • Remember to delete missing values, otherwise they will form a separate group
Recommend
More recommend