The one weird trick for analyzing big data Eyeball the data early and often! John Lamping
A testimonial: Long-Read Assembly Achieves Optimal Repeat Resolution Acknowledgements GMK would also like to thank John Lamping of Human Longevity Inc., chatting with whom drove him to take a data-driven approach to this project.
"Looking at data? How boring! That's not my job!"
Data is a window onto your domain
"Look at queries!"
Eyeballing queries at Google [Australia] Old Google New Google Australia - Wikipedia Tourism Australia Australia travel guide - Wikitravel Austria - Lonely Planet Tourism Australia Australia - Wikipedia Latest Australia news | The Guardian Australia.gov.au Australia.gov.au Austria - Wikipedia
Eyeball the data early and often!
User session data 09:32:10 query [australia] Australia - Wikipedia Australia travel guide - Wikitravel Tourism Australia Latest Australia news | The Guardian Australia.gov.au 09:32:44 click position 3 Australia travel guide - Wikitravel 09:35:12 query [brisbane]
Eyeballing user sessions [australia] (34) 3: Australia travel guide - Wikitravel (2:28) [brisbane] (20) 4: Brisbane, Australia, Attractions - Tourism Australia (6:20) [morton island] (20) [moreton island] (40) 3: Moreton Island - Visit Brisbane (3:12) 8: Moreton island - Lonely Planet (50) 7: Moreton Island National Park and Recreation Area (Department of ... (54:02) [ayers rock] (30)
Which data items to eyeball? Sample from the ones that matter.
Which data items to eyeball? Sample proportionally to how much they matter. ● by difference ● by volume ● by value ● ...
Selecting A/B testing differences A A B B A A A B A A B A B B A A B B B A A A B A B A B B B B
Eyeballing a weighted sample Workers Company Emily Emily 55,000 Rio Tinto Jake Jake 15,000 Telstra Nick Nick 45,000 Commonwealth Bank Olivia Olivia 22 The Gantry restaurant each sample is equally important.
Eyeballing a weighted sample Workers Company 55,000 Rio Tinto 45 The Paddington 122 Peppers Gallery Hotel 5 Melbourne Dry Cleaners each sample is equally important.
What do we do when we should be eyeballing?
What does a new Google engineer do? Tweak parameters to optimize metrics. ● Revenue ● Clicks
Revenue?
Revenue? Google Search Yahoo ! Search Australia Australia Buy Australia on Ebay Australia - Wikipedia Junky ads The Vegemite store Tourism Australia Australia - Wikipedia Australia travel guide Tourism Australia Australia - The World Factbook Australia travel guide Australia.gov.au
Clicks? The one weird trick for analyzing big data
The lure of metrics Metrics capture only a part of the picture. 20 trees
Watch out for accidental patterns. 4 4 4 6 9 3 2 1 6 3 9 8 8 4 1 1 6 8 4 1 3 5 5 7 7 3 0 3 0 8
A sad tale of not eyeballing data often enough
I was good I eyeballed the documents. I eyeballed the hierarchy.
Document hierarchy data ❏ attendance faith leader prayer finances ❏ attendance faith priest church bible ❏ faith bible leader torah synagogue
Eyeballing a document hierarchy word significance = frequency * difference from parent node frequency * log(frequency / frequency in parent)
Eyeballing a document hierarchy ❏ faith prayer minister church priest ❏ bible minister church pope jesus ❏ synagogue muslim torah temple kosher
I was good mostly I eyeballed the documents. I eyeballed the hierarchy. I wrote a large scale quality test metric. Whenever a change reduced the quality metric, I fixed it. But I didn't eyeball the difference to the hierarchy.
Eyeball the data early and often ! When something changes, look at the data again.
Two plausible alternatives Catholic Protestant Other Christian Other Eyeball the data early and often .
DNA sequencing
DNA sequencing G A G G G T G C T G T A G C C C A T T T G T G A G G T T G G C A A G G T C C C A T T T G
DNA sequencing G G T T G G C A A G G T C C C A T T T G G G T T G C C A A G G A C C C A T T T G G G T T G C A A G GGT C C C A T T T G G G T T G G C A A G G G A T A A C G T A
Our code's task: T T G G C A A G G T T G G A A A G G
Sequences show variants A G A G C C C A T A T T T A G G C G C T A G T A C T T G T G C C A A G A C T G G A G A T A G C A A C C A A C A T T G G A A G G T C C A A
Not working We ran it against data with known variants. It found most of them. But it missed many of them.
What to do? Print statements? Tracing? Tweak some parameters!
Eyeball the data early and often ! When your analysis sees something, eyeball data for a few examples of it.
Some data Sequence 4205 Position 462 T Position 463 T Position 464 G Position 465 G Position 466 C Position 467 A Position 468 A Position 469 G Position 470 G
More data Sequence 2602 Position 144 T Position 145 T Position 146 A Position 147 G Position 148 C Position 149 A Position 150 A Position 151 G Position 152 G
Would you see the problem? Eyeball the data early and often!
A little formatting to support visualization Sequence 4205 T T G G C A A G G Sequence 2602 T T A G C A A G G Sequence 4403 T T G G C A A A G Sequence 0605 T T G G C A A G G Sequence 3878 T T G G C A G G A Sequence 4138 T T G G C A A G G Sequence 4942 A T T G C A A G G Sequence 1319 T T G G T A A G G Sequence 2251 T T G G C A A G G
An expert's visualization
Eyeballing data reveals ...
Eyeballing data reveals ... the best bug of my career.
Eyeball the data early and often! Eyeball the data early and often! Eyeball the data early and often !
Have fun doing it!
Recommend
More recommend