Experimental design (continued) Spring 2017 Michelle Mazurek Some content adapted from Bilge Mutlu, Vibha Sazawal, Howard Seltman 1
Administrative • No class Tuesday • Homework 1 • Plug for Tanu Mitra grad student session 2
Today’s class • Finish threats to validity • Experimental design / choices • Alternatives to experiments 3
Quick review • Internal validity: causality – Isolate variable of interest – Randomized assignment • External validity – Representative sample – Representative environment/task/analysis • Valid constructs – Measure something meaningful – Reliable 4
Know what you’re measuring • Especially when dealing with large-scale data from the internet – What are you missing? What is duplicated? – What is the precision and accuracy of the data? – Are you capturing what you think you’re capturing? – *Vantage point* – Representativeness / diversity 5
Calibrating constructs • Examine outliers and spikes • Check for self-consistency • Compare multiple measures – Multiple datasets – Multiple ways of calculating a value • Test with synthetic data • Check longitudinal data periodically! 6
Mis-measurements, now what? • Discard? (Why might this be bad?) – Discard outliers? Definition? • Use an explicit adjustment? 7
Other measurement notes • (Don’t really fit here, but from Paxson paper) • Metadata and good analysis logging is critical! • Be clear about unknowns and limitations 8
4. Power • Power: Likelihood that if there’s a real effect, you will find it. • Why might you not find it? – Sample size – Effect size – Missing explanatory variables – Va Variability 9
JXxto/T_ssNrNODtI/AAAAAAAAAo0/LXcl0Pxzg40/s1 Promote power • Covariates: Measure possible http://4.bp.blogspot.com/-Fuha1- confounds, include in analysis • Use reliable measurements • Control the environment • Potential tradeoff: Generalizability for power – E.g., limit variability between subjects 10
EX EXPERI ERIMEN ENTAL DES ESIGN 11
Some important decisions • What is the hypothesis? • Between or within subjects? • What treatment levels / conditions? • What dependent variables to measure? 12
Good hypothesis design • Predicted relationship between (at least) 2 vars – Testable, falsifiable • Operational – Vars are clearly defined – Relationship / how you measure it clearly defined 13
Good hypothesis design (cont.) • Justified – Exploratory results – Theory in related area – Well justified intuition? • Parsimonious 14
Between vs. Within • Between: Each participant belongs to exactly one condition • Within: Each participant belongs to multiple 15
Between vs. Within • More participants • More time each • Cleaner/less bias • More power (less variability subj-subj) 16
Improving on between-subjects • Matching: Get like participants for each condition • Pro: reduces variability • Con: Hard to find; what do you match on? • In general, be very cautious 17
Improving on within-subjects • Ordering effects can be HUGE – Learning, fatigue – Range effects: learn most for closest conditions • Mitigate via co counte terba rbalanci cing – All possible orders A B C D – Balanced latin square C A D B B D A C D C B A 18
Counterbalancing doesn’t fix: • Range effects (most average treatment) • Context effects (what most participants are more familiar with) 19
Mixed models are also possible • Everyone gets the same three tasks • Order of tasks varies • Tool with which to execute tasks varies 20
Selecting conditions • How many IVs? – Password meter example • How many / which levels for each? – Cannot infer anything about levels you didn’t test 21
Full-factorial (or not) • Full-factorial: All possible combinations of all Ivs – And all orderings? • Not: Only a subset – Selected how? – Recall: Vary at most one thing each time! • Planned comparisons! 22
Why multivariate? • What is different between running one experiment with two IVs vs. two experiments with one IV each? • Interaction effects! 23
Dependent variables • What and how to measure? – Construct validity, again! – Performance (time, errors, FP/FN, etc.) – Opinions/attitude – Audio recording, screen capture, keystrokes, copy- pasting behavior, etc. – Demographics • Multiple measures toward higher-level construct? 24
NO NOT J T JUST E ST EXPE PERIM IMENTS NTS 25
Kinds of measurement studies • Experimental • Observational/correlational • Quasi-experimental 26
Observational/correlational • Observe that X and Y (don’t) increase and decrease together / in opposition • Research doesn’t apply any control or treatment: just measure incidence – Does lead exposure correlate with crime rate? • Directionality and third-variable both issues 27
Quasi-experiments • Subset of observational studies • Can’t randomize assignment • But, experimenter controls something Group 1 Group 1 Treatment Group 2 Group 2 28
Observational examples • Cohort study • Regression discontinuity • BIBIFI example 29
Pluses and minuses • Can measure things that simply can’t be done with true experiments • In general, association at best – causality very hard to establish – Some statistical techniques to help exist • Low internal validity – can you maximize it within the available constraints? 30
Recommend
More recommend