validity
play

Validity How to make sure your testing investment yields reliable - PowerPoint PPT Presentation

Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller Interactive


  1. Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller

  2. Interactive panel with experts Daniel Burstein, Director of Editorial Content, MECLABS @DanielBurstein Eric Miller, Director of Product Management, Monetate @dynamiller @monetate @dynamiller @monetate Bob Kemper, Director of Sciences, MECLABS Bob Kemper, Director of Sciences, MECLABS Aaron Gray, Head of Day 2 @agray

  3. How marketers validate test results

  4. Three important (and often misunderstood) elements of statistical significance elements of statistical significance • Significant difference • Sample size p • Level of confidence

  5. Significant difference 11% is more than 10%* 11% is more than 10%* 11% is more than 10% 11% is more than 10% *…except when it’s not

  6. Which Treatment Won? A B

  7. Initial Monetate Campaign

  8. Test A ‐ Warm Fleece

  9. Test B ‐ Layers

  10. Monetate Reporting: A Vs. Control Incremental Revenue: ‐ $35k Incremental Revenue: $35k New Customer Acquisition: 14.79% lift at p95 AOV: ‐ 8.81% lift at p99

  11. Monetate Reporting: B Vs. Control Incremental Revenue: $43k AOV: 13.15% lift at p99 RPS: 13.47% lift at p90 l f

  12. A Vs. B Conversion: 4.30% lift at p80

  13. Final Results A A • They both won for different segments • “A ‐ Fleece” was the overall winner B and won with new customer acquisition , and is now shown only to that segment. • “B ‐ Layers” won with existing customers with significant lift in AOV over time. • New campaigns were iterated to take advantage of learnings

  14. Resulting Campaign

  15. Sample size n=2 “Well, you’re alive today even though you didn’t have one of those fancy car , y y g y f y seats.” – My Mom n=7,813 “Compared with seat belts, child restraints…were associated with a 28% reduction in risk for death ” reduction in risk for death. – Michael R. Elliott, PhD; Michael J. Kallan, MS; Dennis R. Durbin, MD, MSCE; Flaura K. Winston, MD, PhD

  16. Sample size Number of test subjects needed to get “statistically significant” results Achieving that number is a function of visitor volume and time • Factors in determining Sample Size g p Test complexity (number of versions being tested) • Conversion rate • Performance difference between variations P f diff b t i ti • Confidence level • But – too short a test may not be as valid as it looks, especially if • distribution of time is a factor Be realistic about what kind of test your site can support y pp

  17. Level of confidence “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com p

  18. Level of confidence What is it? Statistical Level of Confidence – The statistical probability that there really is a performance difference between the control and experimental treatments (“unofficial” but useful) based upon the data collected to date. H How (or where) do I get it? ( h ) d I i ? The math – Determine the Mean difference, the standard deviation and the sample size and use the formula ... Confidence Interval Limits Or… get it from your metrics / testing tool . The big (inferential) statistics question… What are the chances that what I just saw could have happened ‘just by chance ’…, and that these two (pages) are really no different at all ? and that these two (pages) are really no different at all ?

  19. Level of confidence What does it MEAN? Imagine an experiment… Take one FAIR coin. (i.e., if flipped ∞ times, would come out heads 50%) . • Flip the coin ‘n’ (many) times and record # Heads (e.g., say 60 times) • Then do it over and over again; same # flips. g ; p • Proportional to # ‐ times it comes out with that many Heads The math – 5 times out of every 100 that I do the coin ‐ flip experiment, I expect to get a difference between my two samples that's AT LEAST as big as this one ‐ even though there is NO ACTUAL difference... g

  20. Level of confidence How do I decide on the right level? Most common is 95% (i.e., 5% chance you’ll think they’re different when they’re really not) • There is no ‘magic’ to the 95% LoC. • Mainly a matter of ‘convention’ or agreement. Mainly a matter of convention or agreement. • The onus for picking the ‘right’ level for your test is on YOU. • Sometimes the tools limit you • 95% is seldom a “bad” choice. • Confidence Interval Limits Higher = Longer test g g • Bigger difference needed for validity • Decide based on… • Level of risk of being wrong vs. cost of prolonging the test. •

  21. The iPod of validity tools

  22. How marketers validate test results

  23. Experiment – Background Experiment ID: (Protected) Location: MarketingExperiments Research Library Research Notes: B Background: Consumer company that offers online brokerage services k d C h ff li b k i Goal: To increase the volume of accounts created online Primary research question: Which page design will generate the highest rate of conversion? Test Design: A/B/C/D multi factor split test Test Design: A/B/C/D multi ‐ factor split test

  24. Experiment – Control Treatment Control Heavily competing • imagery and messages g y g ROTATING BANNER Multiple calls ‐ to ‐ action •

  25. Experiment – Exp. Treatment 1 Most of the elements on the Treatment 1 • page are unchanged, only one block of information has been optimized ROTATING BANNER Headline has been added • Bulleted copy highlighted • key value proposition points Chat With a Live Agent CTA h h • removed Large, clear call ‐ to ‐ action g , • has been added

  26. Experiment – Exp. Treatment 2 Treatment 2 • Left column remained the same, but we removed footer ROTATING BANNER elements • Long copy, vertical flow • Added awards and testimonials in right ‐ hand column • Large, clear call ‐ to ‐ action similar to Treatment 1

  27. Experiment – Exp. Treatment 3 Treatment 3 • Similar to Treatment 2, except left ‐ hand column ROTATING BANNER ROTATING BANNER width reduced even further • Left ‐ hand column has a more navigational role l l • Still a long copy, vertical flow, single call to action design single call ‐ to ‐ action design

  28. Experiment – All Treatments Summary Control Treatment 1 Treatment 2 Treatment 3

  29. Experiment – Results No Significant Difference None of the treatment designs performed with conclusive results Conversion Test Designs Relative Diff% Rate Control 5.95% ‐ Treatment 1 6.99% 17.42% Treatment 2 6.51% 9.38% Treatment 3 6.77% 13.70% � What you need to understand: According to the testing platform we � Wh t d t d t d A di t th t ti l tf were using, the aggregate results came up inconclusive. None of the treatments outperformed the control with any significant difference.

  30. Experiment Validity Threat However, we noticed an interesting performance • shift in the control and treatments towards the end of the test. We discovered that during the test, there was an • email sent that skewed the sampling distribution. 19.00% Treatment consistently is Control beats 17.00% 15.00% beating the control the treatment te version Rat 13.00% 11.00% Control Treatment 3 9.00% 7.00% Conv 5.00% 3.00% Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Day 11 Test Duration

  31. Experiment Results 31% Increase in Conversions The highest performing treatment outperformed the control by 31% Test Designs Treatment Relative Diff% Control 5.35% ‐ Treatment 1 6.67% 25% Treatment 2 6.13% 15% Treatment 3 T t t 3 7 03% 7.03% 31% 31% � What you need to understand: After excluding the data collected after the email had been sent out, each of the treatments substantially h l h d b h f h b ll outperformed the control with conclusive validity.

  32. Validity Threats: The reason you can’t blindly trust your tools trust your tools • Sample Distortion Effect – the effect on a test outcome caused by failing to collect a sufficient number of observations • History Effect X X y X • Instrumentation Effect • Selection Effect X

  33. Validity Threats: History effect When a test variable is affected by an extraneous variable associated with the When a test variable is affected by an extraneous variable associated with the passage of time Examples p • An email send that skews conversion for one treatment (as in the previous experiment) • Newsworthy event that changes the nature of arriving subjects—whether temporarily or permanently (e.g., 9/11 attack)

  34. Validity Threats: History effect

  35. Validity Threats: History effect

  36. Validity Threats: History effect Identification: • Sniff Test –but to a point • Sniff Test –but to a point • Did anything happen? REALLY HARD Mitigation: • Segmented reporting • Test with longer time horizons, but to a point • Iterate, iterate, iterate, target, test Iterate, iterate, iterate, target, test • Balance the cost of being wrong

  37. Validity Threats: Instrumentation effect when a test variable is affected by a change in the measurement instrument h t t i bl i ff t d b h i th t i t t Examples • Short ‐ duration response time slowdowns • E.g., due to server ‐ load, page ‐ weight, page ‐ code problems • Splitter malfunction • Inconsistent URLs • Server downtime

Recommend


More recommend