Validity How to make sure your testing investment yields reliable How to make sure your testing investment yields reliable data (interactive panel with experts) D Daniel Burstein i l B t i B b K Bob Kemper Aaron Gray Eric Miller
Interactive panel with experts Daniel Burstein, Director of Editorial Content, MECLABS @DanielBurstein Eric Miller, Director of Product Management, Monetate @dynamiller @monetate @dynamiller @monetate Bob Kemper, Director of Sciences, MECLABS Bob Kemper, Director of Sciences, MECLABS Aaron Gray, Head of Day 2 @agray
How marketers validate test results
Three important (and often misunderstood) elements of statistical significance elements of statistical significance • Significant difference • Sample size p • Level of confidence
Significant difference 11% is more than 10%* 11% is more than 10%* 11% is more than 10% 11% is more than 10% *…except when it’s not
Which Treatment Won? A B
Initial Monetate Campaign
Test A ‐ Warm Fleece
Test B ‐ Layers
Monetate Reporting: A Vs. Control Incremental Revenue: ‐ $35k Incremental Revenue: $35k New Customer Acquisition: 14.79% lift at p95 AOV: ‐ 8.81% lift at p99
Monetate Reporting: B Vs. Control Incremental Revenue: $43k AOV: 13.15% lift at p99 RPS: 13.47% lift at p90 l f
A Vs. B Conversion: 4.30% lift at p80
Final Results A A • They both won for different segments • “A ‐ Fleece” was the overall winner B and won with new customer acquisition , and is now shown only to that segment. • “B ‐ Layers” won with existing customers with significant lift in AOV over time. • New campaigns were iterated to take advantage of learnings
Resulting Campaign
Sample size n=2 “Well, you’re alive today even though you didn’t have one of those fancy car , y y g y f y seats.” – My Mom n=7,813 “Compared with seat belts, child restraints…were associated with a 28% reduction in risk for death ” reduction in risk for death. – Michael R. Elliott, PhD; Michael J. Kallan, MS; Dennis R. Durbin, MD, MSCE; Flaura K. Winston, MD, PhD
Sample size Number of test subjects needed to get “statistically significant” results Achieving that number is a function of visitor volume and time • Factors in determining Sample Size g p Test complexity (number of versions being tested) • Conversion rate • Performance difference between variations P f diff b t i ti • Confidence level • But – too short a test may not be as valid as it looks, especially if • distribution of time is a factor Be realistic about what kind of test your site can support y pp
Level of confidence “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com p
Level of confidence What is it? Statistical Level of Confidence – The statistical probability that there really is a performance difference between the control and experimental treatments (“unofficial” but useful) based upon the data collected to date. H How (or where) do I get it? ( h ) d I i ? The math – Determine the Mean difference, the standard deviation and the sample size and use the formula ... Confidence Interval Limits Or… get it from your metrics / testing tool . The big (inferential) statistics question… What are the chances that what I just saw could have happened ‘just by chance ’…, and that these two (pages) are really no different at all ? and that these two (pages) are really no different at all ?
Level of confidence What does it MEAN? Imagine an experiment… Take one FAIR coin. (i.e., if flipped ∞ times, would come out heads 50%) . • Flip the coin ‘n’ (many) times and record # Heads (e.g., say 60 times) • Then do it over and over again; same # flips. g ; p • Proportional to # ‐ times it comes out with that many Heads The math – 5 times out of every 100 that I do the coin ‐ flip experiment, I expect to get a difference between my two samples that's AT LEAST as big as this one ‐ even though there is NO ACTUAL difference... g
Level of confidence How do I decide on the right level? Most common is 95% (i.e., 5% chance you’ll think they’re different when they’re really not) • There is no ‘magic’ to the 95% LoC. • Mainly a matter of ‘convention’ or agreement. Mainly a matter of convention or agreement. • The onus for picking the ‘right’ level for your test is on YOU. • Sometimes the tools limit you • 95% is seldom a “bad” choice. • Confidence Interval Limits Higher = Longer test g g • Bigger difference needed for validity • Decide based on… • Level of risk of being wrong vs. cost of prolonging the test. •
The iPod of validity tools
How marketers validate test results
Experiment – Background Experiment ID: (Protected) Location: MarketingExperiments Research Library Research Notes: B Background: Consumer company that offers online brokerage services k d C h ff li b k i Goal: To increase the volume of accounts created online Primary research question: Which page design will generate the highest rate of conversion? Test Design: A/B/C/D multi factor split test Test Design: A/B/C/D multi ‐ factor split test
Experiment – Control Treatment Control Heavily competing • imagery and messages g y g ROTATING BANNER Multiple calls ‐ to ‐ action •
Experiment – Exp. Treatment 1 Most of the elements on the Treatment 1 • page are unchanged, only one block of information has been optimized ROTATING BANNER Headline has been added • Bulleted copy highlighted • key value proposition points Chat With a Live Agent CTA h h • removed Large, clear call ‐ to ‐ action g , • has been added
Experiment – Exp. Treatment 2 Treatment 2 • Left column remained the same, but we removed footer ROTATING BANNER elements • Long copy, vertical flow • Added awards and testimonials in right ‐ hand column • Large, clear call ‐ to ‐ action similar to Treatment 1
Experiment – Exp. Treatment 3 Treatment 3 • Similar to Treatment 2, except left ‐ hand column ROTATING BANNER ROTATING BANNER width reduced even further • Left ‐ hand column has a more navigational role l l • Still a long copy, vertical flow, single call to action design single call ‐ to ‐ action design
Experiment – All Treatments Summary Control Treatment 1 Treatment 2 Treatment 3
Experiment – Results No Significant Difference None of the treatment designs performed with conclusive results Conversion Test Designs Relative Diff% Rate Control 5.95% ‐ Treatment 1 6.99% 17.42% Treatment 2 6.51% 9.38% Treatment 3 6.77% 13.70% � What you need to understand: According to the testing platform we � Wh t d t d t d A di t th t ti l tf were using, the aggregate results came up inconclusive. None of the treatments outperformed the control with any significant difference.
Experiment Validity Threat However, we noticed an interesting performance • shift in the control and treatments towards the end of the test. We discovered that during the test, there was an • email sent that skewed the sampling distribution. 19.00% Treatment consistently is Control beats 17.00% 15.00% beating the control the treatment te version Rat 13.00% 11.00% Control Treatment 3 9.00% 7.00% Conv 5.00% 3.00% Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Day 11 Test Duration
Experiment Results 31% Increase in Conversions The highest performing treatment outperformed the control by 31% Test Designs Treatment Relative Diff% Control 5.35% ‐ Treatment 1 6.67% 25% Treatment 2 6.13% 15% Treatment 3 T t t 3 7 03% 7.03% 31% 31% � What you need to understand: After excluding the data collected after the email had been sent out, each of the treatments substantially h l h d b h f h b ll outperformed the control with conclusive validity.
Validity Threats: The reason you can’t blindly trust your tools trust your tools • Sample Distortion Effect – the effect on a test outcome caused by failing to collect a sufficient number of observations • History Effect X X y X • Instrumentation Effect • Selection Effect X
Validity Threats: History effect When a test variable is affected by an extraneous variable associated with the When a test variable is affected by an extraneous variable associated with the passage of time Examples p • An email send that skews conversion for one treatment (as in the previous experiment) • Newsworthy event that changes the nature of arriving subjects—whether temporarily or permanently (e.g., 9/11 attack)
Validity Threats: History effect
Validity Threats: History effect
Validity Threats: History effect Identification: • Sniff Test –but to a point • Sniff Test –but to a point • Did anything happen? REALLY HARD Mitigation: • Segmented reporting • Test with longer time horizons, but to a point • Iterate, iterate, iterate, target, test Iterate, iterate, iterate, target, test • Balance the cost of being wrong
Validity Threats: Instrumentation effect when a test variable is affected by a change in the measurement instrument h t t i bl i ff t d b h i th t i t t Examples • Short ‐ duration response time slowdowns • E.g., due to server ‐ load, page ‐ weight, page ‐ code problems • Splitter malfunction • Inconsistent URLs • Server downtime
Recommend
More recommend