experimental design evaluation
play

Experimental Design & Evaluation 11. Controlled Experiment - PowerPoint PPT Presentation

Experimental Design & Evaluation 11. Controlled Experiment SunyoungKim,PhD Todays agenda Hypothesis testing Threats Experimental Biases Hypothesis Testing Hypothesis testing How to prove a hypotheses in


  1. Experimental Design & Evaluation 11. Controlled Experiment Sunyoung�Kim,�PhD�

  2. Today’s agenda Hypothesis testing • Threats • Experimental Biases •

  3. Hypothesis Testing

  4. Hypothesis testing How to “prove” a hypotheses in science? In most cases, it is impossible to prove the hypothesis directly. This is done by disproving the null hypothesis. Easier to disprove things, by counter-example • First we suppose the null hypothesis true: Null • hypothesis=opposite of hypothesis Then a conflicting result is found • Disprove the null hypothesis – Hence, the hypothesis is proved •

  5. Hypothesis testing 1. Perform statistical analysis 2. Draw conclusion 3. Communicate results

  6. T-test 1. Assume that the true means of the two populations are not different: Null Hypothesis (H0) 2. Compute the means of the two samples 3. Compute the difference between the two sample means 4. Compute the chance of observing this much difference: P-value 5. If the chance is low, this seems contradictory: P < 0.05 6. Thus, the assumption is unlikely to be true 7. Thus, the true means are different: H1: Alternative hypothesis

  7. Example) Kindle vs. iPad Hypothesis: College students type faster using iPad’s keyboard than using Kindle’s keyboard. Independent variable: Device (iPad or Kindle) • Dependent variable: Typing speed • Confounding variable: Prior technology experience •

  8. Subject Subject Kindle T Kindle Time (s) ime (s) iPad T iPad Time (s) ime (s) 1 43 34 2 33 3 43 36 4 35 31 5 36 41 6 39 39 7 42 5 8 43 29 9 41 30 10 39 41 “College students type faster using iPad than Kindle, t(16) = 2.827, P = 0.012.”

  9. Example) Drinking Water Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water. Independent variable: • Dependent variable: •

  10. bottom surface 1 0.43 0.415 2 0.266 0.238 3 0.567 0.39 4 0.531 0.41 5 0.707 0.605 6 0.716 0.609 7 0.651 0.632 8 0.589 0.523 9 0.469 0.411 10 0.723 0.612 “There is no difference in the concentration of Zinc at the bottom and the surface of the water, t(18) = 1.309, P = 0.207.”

  11. Example) Caffeine and Metabolism A study of the effect of caffeine on muscle metabolism used eighteen male volunteers who each underwent arm exercise tests. Nine of the men were randomly selected to take a capsule containing pure caffeine one hour before the test. The other men received a placebo capsule. During each exercise the subject's respiratory exchange ratio (RER) was measured. (RER is the ratio of CO2 produced to O2 consumed and is an indicator of whether energy is being obtained from carbohydrates or fats). Independent variable: • Dependent variable: •

  12. Placebo Placebo Caf Caffeine feine 1 105 96 2 119 99 3 100 94 4 97 89 5 96 96 6 101 93 7 94 88 8 95 105 9 98 88

  13. What if you have more than two cases? Use “One-way ANOVA” While t-test is for comparing 2 means, ANOVA is for >2 • Calculate ‘F’ ratio •

  14. Example) Tar Contents in Cigarettes We want to see whether the tar contents (in milligrams) for three different brands of cigarettes is different. Lab Precise took 6 samples from each of the three brands and got the following measurements: Independent variable: • Dependent variable: •

  15. Brand A Brand B Brand C 10.21 11.32 11.6 10.25 11.2 11.9 10.24 11.4 11.8 9.8 10.5 12.3 9.77 10.68 12.2 9.73 10.9 12.2 The three cigarette brands resulted in having different mean amount of tar, F(2,15) = 65.464, P = 0.000

  16. Threats

  17. Threats to Your Findings Validity • Reliability •

  18. Validity Validity is concerned with the study's success at measuring what the researchers set out to measure: how well a test measures what it is purported to measure? Is it internally valid? • Is it externally valid? •

  19. Internal Validity How well an experiment is done, especially whether it avoids confounding (more than one possible independent variable [cause] acting at the same time)? The extent to which a causal conclusion based on a study is warranted Differences (in means) should be a result of experimental factors • (e.g. what we are testing) Variances in means result from differences in participants • Other variances are controlled or exist randomly •

  20. Threats to Internal Validity Ordering effects: Effects might be due to the test conditions • o People learn, and people get tired o Don’t present tasks or interfaces in same order for all users o Randomize or counterbalance the ordering Selection effects: Effects might be due to participant differences • o Don’t use pre-existing groups o Randomly assign users to independent variables Experimental bias: Effects might be due to the test conditions • o Experimenter may be enthusiastic about interface X but not Y

  21. Internal Validity: Example An industrial psychologist wants to study the effects of soft classical music on the productivity of a group of typists in a typing pool. At the beginning of the month, the psychologist meets with the typists to explain the study, gets their consent to play the music during the working day, and then begins to have music piped into the office where the typists work. At the end of the month, the typists' supervisor reports a 30% increase in the number of documents completed by the typing pool that month. "Soft music increases productivity," the psychologist concludes.

  22. External Validity How generalizable is the result? The extent to which the results of a study can be generalized to other situations and to other people Extent to which results can be generalized to broader context • Participants in your study are “representative” • Test conditions can be generalized to real world •

  23. Threats to External Validity Population: Findings are not generalizable to other people • o Draw a random sample from your real target population Ecological: Findings are not generalizable to other situations • o Make lab conditions as realistic as possible in important respects Training • o Training should mimic how real interface would be encountered and learned Task • o Base your tasks on task analysis

  24. External Validity: Example An educational researcher wants to study the effectiveness of a new method of teaching reading to first graders. The researcher asks all 30 of the first-grade teachers in a particular school district if they would like to receive training in the new method and then use it during the coming school year. Fourteen teachers volunteer to learn and use the new method; 16 teachers say that they would prefer to use their current approach. At the end of the school year, students who have been instructed with the new method have significantly higher average scores on a reading achievement test than students who have received more traditional reading instruction. "The new method is definitely better than the old one," the researcher concludes.

  25. Reliability The quality of measurements • Reliability is the "consistency" or "repeatability" of your measures •

  26. Threats to Reliability Uncontrolled variation • o Previous experience (e.g., Novice vs. Experts) o User differences o Task design: Do tasks measure what you try to measure? o Measurement error Solutions • Eliminate uncontrolled variation • Repetition •

  27. Validity vs. Reliability

  28. Threats to Your Findings Internal Validity: Are observed results actually caused by the • independent variables? External Validity: Can observed results be generalized to the • world outside the lab? Reliability: Will consistent results be obtained by repeating • the experiment?

  29. Experimental Biasis

  30. Experimental Biases Hawthorne effect • Experimenter effect • Placebo effect • Novelty effect •

  31. Hawthorne Effect The phenomenon that subject behavior changes by the mere fact • that they are being observed.

  32. Experimenter Effect A researcher’s bias influences what they see • Example from Wikipedia: music backmasking • Once the subliminal lyrics are pointed out, they become • obvious Dowsing • Not more likely than chance • The issue: If you expect to see something, maybe something in • that expectation leads you to see it Solved via double-blind studies •

  33. Placebo Effect Subject expectancy • If you think the treatment, condition, etc has some benefit, • then it may Placebo-based anti-depressants, muscle relaxants, etc. • In computing, an improved GUI, a better device,etc. • Steve Jobs: http://www.youtube.com/watch?v=8JZBLjxPBUU • Bill Buxton: http://www.youtube.com/watch?v=Arrus9CxUiA •

  34. Novelty Effect Typically with technology • Performance improves when technology is instituted because • people have increased interest in new technology Examples: Computer-Assisted instruction in secondary schools, • computers in the classroom in general, etc.

Recommend


More recommend