Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur) 1
• Interviews, diary studies • Start stats • Thursday: Ethics/IRB • Tuesday: More stats • New homework is available 2
INTERVIEWS 3
Why an interview • Rich data (from fewer people) • Good for exploration – When you aren’t sure what you’ll find – Helps identify themes, gain new perspectives • Usually cannot generalize quantitatively • Potential for bias (conducting, analyzing) • Structured vs. semi-structured 4
Interview best practices • Make participants comfortable • Avoid leading questions • Support whatever participants say – Don’t make them feel incorrect or stupid • Know when to ask a follow-up • Get a broad range of participants (hard) 5
Try it! • In pairs, write two interview questions about password security/usability • Change partners with another pair and ask each other; report back 6
DIARY STUDIES 7
Why do a diary study? • Rich longitudinal data (from a few participants) – In the field … ish • Natural reactions and occurences – Existence and quantity of phenomena – User reactions in the moment rather than via recall • Lots of work for you and your participants • On paper vs. technology-mediated 8
Experience sampling • Kind of a prompted diary • Send participants a stimulus when they are in their natural life, not in the lab 9
Diary / ESM best practices • When will an entry be recorded? – How often? Over what time period? • How long will it take to record an entry? – How structured is the response? • Pay well – Pay per response? But don’t create bias 10
Facebook regrets (Wang et al.) • Online survey, interviews, diary study, 2 nd survey • What do people regret posting? Why? • How do users mitigate? 11
FB regrets – Interviews • Semi-structured, in-person, in-lab • Recruiting via Craigslist – Why pre-screen questionnaire? – 19/301 • Coded by a single author for high-level themes 12
FB regrets – Diary study • “The diary study did not turn out to be very useful” • Daily online form (30 days) – Facebook activities, incidents – “Have you changed anything in your privacy settings? What and why?” – “Have you posted something on Facebook and then regretted doing it? Why and what happened?” – 22+ days of entries: $15 – 12/19 interviewees entered 1+ logs (217 total logs) 13
Location-sharing (Consolvo et al.) • Whether and what about location to disclose – To people you know • Preliminary interview – Buddy list, expected preferences • Two-week ESM (simulated location requests) • Final interview to reflect on experience 14
ESM study • Whether to disclose or not, and why – Customized askers, customized context questions – If so, how granular? – Where are you and what are you doing? – One-time or standing request • $60-$250 to maximize participation • Average response rate: above 90% 15
Statistics for experimental comparisons • The main idea: Hypothesis testing • Choosing the right test: Comparisons • Regressions • Other stuff – Non-independence, directional tests, effect size • Tools 16
What’s the big idea, anyway? OVERVIEW 17
Statistics • In general: analyzing and interpreting data • We often mean: Statistical hypothesis testing – Question: Are two things different? – Is it unlikely the data would look like this unless there is actually a difference in real life? 18
Important note • This lecture is not going to be precise or complete. It is intended to give you some intuition and help you understand what questions to ask. 19
The prototypical case • Q: Q: Do ponies who drink more caffeine make better passwords? • Experiment: Recruit 30 ponies. Give 15 caffeine pills and 15 placebos. They all create passwords. http://www.fanpop.com/clubs/my-little-pony-friendship-is-magic/images/33207334/title/little-pony-friendship-magic-photo 20
Hypotheses • Nul Null hypot l hypothesis hesis: There is no difference Caffeine does not affect pony password strength. • Al Alternat ternative hypot ive hypothesis hesis: There is a difference Caffeine affects pony password strength. • Note what is not here (more on this later): – Which direction is the effect? – How strong is the effect? 21
Hypotheses, continued • Statistical test gives you one of two answers: 1. Reject the null: We have (strong) evidence the alternative is true. 2. Don’t reject the null: We don’t have (strong) evidence the alternative is true. • Again, note what isn’t here: – We have strong evidence the null is true. (NOPE) 22
P values • What is the probability that the data would look like this if there’s no actual difference ? – i.e., Probability we tell everyone about ponies and caffeine but it isn’t really true • Most often, α = 0.05; some people choose 0.01 – If p < 0.05 , reject null hypothesis; there is a “significant” difference between caffeine and placebo – A p-value is not magic, just probability, and the threshold is arbitrary – But, reported TRUE or FALSE: You don’t say something is “more significant” because the p-value is lower 23
Type II Error (False negative) • There is a difference, but you didn’t find evidence – No one will know the power of caffeinated ponies • Hypothesis tests DO NOT BOUND this error • Instead, statistical power is the probability of rejecting the null hypothesis if you should – Requires that you estimate the effect size (hard) 24
Hypotheses, power, probability • After an experiment, one of four things has happened (total P=1). PROBABILITY You rejected the null You didn’t Reality: Difference Estimated via power analysis ? Reality: No difference Bounded by α ? • Which box are you in? You don’t know. 25
Correlation and causation • Correlation: We observe that two things are related Do rural or urban ponies make stronger passwords? • Causation: We randomly assigned participants to groups and gave them different treatments – If designed properly Do password meters help ponies? 26
CHOOSING THE RIGHT TEST 27
What kind of data do you have? • Explanatory variables: inputs, x-values – e.g., conditions, demographics • Outcome variables: outputs, y-values – e.g., time taken, Likert responses, password strength 28
What kind of data do you have? • Quantitative – Discrete (Number of caffeine pills taken by each pony) – Continuous (Weight of each pony) • Categorical http://i196.photobucket.com/albums/aa92/ karina408_album/Wallpaper-53.jpg – Binary (Is it or isn’t it a pony?) – Nominal: No order (Color of the pony) – Ordinal: Ordered (Is the pony super cool, cool, a little cool, or uncool) 29
What kind of data do you have? • Does your dependent data follow a normal distribution? (You can calculate this!) http://www.wikipedia.org – If so, use parametric tests. – If not, use non-parametric tests. • Are your data independent? – If not, repeated-measures, mixed models, etc. 30
If both are categorical …. • Participants each used one of two systems – Did they like the system they got? (Yes/no) • H A : System affects user sentiment • Use (Pearson’s) χ 2 (Chi-squared) test of independence. – Fewer than 5 data points in any single cell, use Fisher’s Exact Test (also works with lots of data) 31
Contingency tables • Rows one variable, columns the other • Example: – Row = condition – Column = true/false • χ 2 = 97.013, df = 14, p = 1.767e-14 32
Explanatory: categorical Outcome: continuous …. • Participants each used one system – Measure a continuous value (time taken, pwd guess #) • H A : System affects password strength • Normal, continuous outcome (compare mean): – 2 conditions: T-test – 3+ conditions: ANOVA 33
Explanatory: categorical Outcome: continuous …. • Non-normal outcome, ordinal outcome – Does one group tend to have larger values? – 2 conditions: Mann-Whitney U (AKA Wilcoxon rank- sum) – 3+ conditions: Kruskal-Wallis 34
Outcome: Length of password 35
What about Likert-scale data? • Respond to the statement: Ponies are magical. – 7: Strongly agree – 6: Agree – 5: Mildly agree – 4: Neutral – 3: Mildly disagree – 2: Disagree – 1: Strongly disagree 36
What about Likert-scale data? • Some people treat it as continuous (not good) • Other people treat it as ordinal (better!) – Difference 1-2 ≠ 2-3 – Use Mann-Whitney U / Kruskal-Wallis • Another good option: binning (simpler) – Transform into binary “agree” and “not agree” – Use χ 2 or FET 37
Password meter annoying Control baseline meter three-segment green tiny Visual huge no suggestions text-only bunny half-score one-third-score Scoring nudge-16 nudge-comp8 Visual & text-only half-score bold text-only half- Scoring score 38 38
Notes for study design • Plan your analysis before you collect data! – What explanatory, outcome variables? – Which tests will be appropriate? • Ensure that you collect what you need and know what do with it – Otherwise your experiment may be wasted 39
CONTRASTS 40
Recommend
More recommend