Youtube Revisited: On the Importance of Correct Measurement - PowerPoint PPT Presentation

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi Kangasharju University of Helsinki www.helsinki.fi/yliopisto www.helsinki.fi/yliopisto 1

Introduction • Measuring large systems is challenging • Full system analysis is expensive -> sampling The way sampling is conducted affects the results • • Ideally a random and representative sample Technological limitation may skew the sampling process • Biased sample may yield incorrect conclusions • Could also affect any derivative work • • We will show the effects of three different sampling methods on YouTube www.helsinki.fi/yliopisto 2

Motivation • Previously YouTube video metadata collection: selecting videos belonging to certain categories • crawling related videos • using most recent videos • • We argue that all these methods lead to a biased sample • The result are not representative in all aspects • Other work base their assumptions on these results www.helsinki.fi/yliopisto 3

Our Contributions • We have collected three datasets with three methods • We compare the methods for collecting YouTube video metadata • We demonstrate the differences in various metrics between the different datasets www.helsinki.fi/yliopisto 4

Data Collection • We have collected metadata by three different methods: 1. Most recent videos (MR) 2. Related videos (BFS) 3. Random string (RS) • Fourth method is to use videos from a certain category, which is obviously biased • M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. IMC, 2007. www.helsinki.fi/yliopisto 5

1. Most Recent Videos (MR) • Collect periodically metadata of the most recent videos Included information: video ID, view count, length, • category, publish date etc. • Obviously limited to new videos • Previously used by, e.g.: • X. Cheng, J. Liu, and C. Dale. Understanding the characteristics of internet short video sharing: A youtube-based measurement study. Multimedia, IEEE Transactions on, 2013. • G. Szabo and B. A. Huberman. Predicting the popularity of online content. Communications of the ACM, 2010. www.helsinki.fi/yliopisto 6

2. Related Videos (BFS) • Select a video ID and then ask its related videos and then the related videos for all those videos and so on • We limited related videos to 50 per one video • In theory, one seed yields to ~125,000 videos (50x50x50) • N unique videos is lower, the related videos overlap • Can be seen as similar to breadth-first search (BFS) • Fast, most of the time one query returns metadata of tens of videos • X. Cheng, J. Liu, and C. Dale. Understanding the characteristics of internet short video sharing: A youtube-based measurement study. Multimedia, IEEE Transactions on, 2013. www.helsinki.fi/yliopisto 7

3. Random Strings (RS) • Zhou et al. have used similar method to estimate YouTube’s size (“Counting YouTube Videos via Random Prefix Sampling”, IMC 2011) • Generate a random character string and ask the API to return videos which IDs include the string • ‘a-Z’, ‘0-9’, ‘-’, ‘_’, four-letter strings work the best • On average a random string matched to 6.9 video IDs • For an unknown reason IDs include ‘-’ www.helsinki.fi/yliopisto 8

3. Random Strings (RS) A random string w57j would match and return metadata for the following videos: W57J-21gSSo XcY-W57J-Uo w57j-VVNAg0 W57J-msuors www.helsinki.fi/yliopisto 9

Datasets Dataset Method Time period N MR-09 Most recent videos Summer 2009 9,405 MR-11 Most recent videos Summer 2011 8,766 MR-14 Most recent videos Late 2013-early 2014 10,000 RS Random ID Early 2014 ~ 5 million BFS Related videos Early 2014 ~ 5 million www.helsinki.fi/yliopisto 10

Results • Popularity • Views • Age • Categories • Length www.helsinki.fi/yliopisto 11

Popularity • RS and BFS: Very different view count distributions • BFS has two-part distribution, with a quick- dropping tail • RS follows more closely Zipf, with a truncated tail • BFS data seems to over-estimate view counts • RS:Top 10 -> 5% of all views, top 1000 -> 43 %, top 10,000 -> 74 % www.helsinki.fi/yliopisto 12

Popularity after 30 days • MR and BFS seem to ever-estimate video popularity • However MR-09 resembles RS www.helsinki.fi/yliopisto 13

Views • The 5th percentile of BFS is higher than the median of RS and MR • BFS view counts are at least one order of magnitude higher than the RS ones www.helsinki.fi/yliopisto 14

Views • The median, 5th and 95th percentiles for BFS and RS over eight years • BFS’s median is most of the time two orders of magnitude higher than RS’s www.helsinki.fi/yliopisto 15

Age Distribution • BFS has less videos newer than two years, but a lot of very recent videos • The drop in RS is an artifact of the method • RS: 29 % of videos are newer than a year, majority is newer than two years www.helsinki.fi/yliopisto 16

Categories (share of videos) • Most videos of: RS: People & Blogs • (Default category for an upload) BFS: Music • MR: News & Politics • www.helsinki.fi/yliopisto 17

Categories (share of views) • Distribution of number of views is more similar • Music videos get most views www.helsinki.fi/yliopisto 18

Popularity based on Category www.helsinki.fi/yliopisto 19

Video Length • RS and MR: Most common length is 60 s or less • BFS: Most common 3-5 min, music videos? • All: Videos of 3-5 mins length get most views www.helsinki.fi/yliopisto 20

Summary of the Methods BFS MR RS Tends to over- Over-estimates views Most ‘reliable’ estimate some metrics Fast, up to 100 per Slow Not that fast, ~7 per query query Mostly popular music Limited to new videos Mysterious ‘-’ curiosity videos? Mostly news clips? www.helsinki.fi/yliopisto 21

Conclusion 1/2 • We have used YouTube as an example, using three data collection methods • The datasets differ in many key metrics that have used in past research (MR, BFS) • RS not previously used in this manner • Differences between RS and the others raise questions about the general applicability of the previous results • We believe the RS produces a representative sample www.helsinki.fi/yliopisto 22

Conclusion 2/2 • As BFS dataset demonstrates even large datasets are not immune to bias introduced by the method • Data collection method can have a significant impact on the results • Whatever is the selected sampling method, be aware of its properties and weaknesses • Be careful when adopting results from earlier work • Time to accept more reappraisal work? www.helsinki.fi/yliopisto 23

Questions? www.helsinki.fi/yliopisto 24

Youtube Revisited: On the Importance of Correct Measurement - PowerPoint PPT Presentation

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi Kangasharju University of Helsinki www.helsinki.fi/yliopisto www.helsinki.fi/yliopisto 1 Introduction Measuring large systems is challenging

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Accelerating YouTube & Google Search Andreas Terzis YouTube Statistics YouTube is a large

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Geo-Strategy https://www.youtube.com/watch?v=5GvjVUrmgNU Geo-politics Geo-economics

Inserting a YouTube video into a PowerPoint Presentation Follow these steps to embed a YouTube

From YouTube to SHU Tube Part 1: The What and Why of SHU Tube October 4, 2009 From YouTube to

YouTube and other Topics video sharing YouTube platforms Overview Website Computer

Performance of Correct Statement of the Problem and Impact. Associated Issues. Procedure

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

1 Correct at 11 October 2019 - the latest information can be found on GOV.UK 2 Correct at 11

Correct b Correct by Construction A Construction Attack ttack-Tolerant Syst olerant Systems

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Environmental Acquisition Revisited Richard Cobbe and Matthias Felleisen Northeastern University

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

Financial Need and Aid Volatility among Students with Zero Expected Family Contribution Robert

Citizens Advisory Group Meeting No. 4 December 15, 2015 Agenda 1. Project Overview

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and

Expedited Delivery System for Energy Efficiency Projects What is The Energy Network? The Energy

COMPANY OVERVIEW AUGUST, 2018 FORWARD LOOKING STATEMENTS ADVISORY This presentation is for

CONSULTANT TEAM PRESENTATION ON PA Q1 REPORT DRAFT June 19, 2019 INTRODUCTION Q1 is

State of Hawaii GEMS Financing Program Helping Hawaii consumers save money through clean energy

Sambuz

Useful Links

Newsletter

Mail Us

Youtube Revisited: On the Importance of Correct Measurement - PowerPoint PPT Presentation

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi Kangasharju University of Helsinki www.helsinki.fi/yliopisto www.helsinki.fi/yliopisto 1 Introduction Measuring large systems is challenging

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

Accelerating YouTube &amp; Google Search Andreas Terzis YouTube Statistics YouTube is a large

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Geo-Strategy https://www.youtube.com/watch?v=5GvjVUrmgNU Geo-politics Geo-economics

Inserting a YouTube video into a PowerPoint Presentation Follow these steps to embed a YouTube

From YouTube to SHU Tube Part 1: The What and Why of SHU Tube October 4, 2009 From YouTube to

YouTube and other Topics video sharing YouTube platforms Overview Website Computer

Performance of Correct Statement of the Problem and Impact. Associated Issues. Procedure

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

1 Correct at 11 October 2019 - the latest information can be found on GOV.UK 2 Correct at 11

Correct b Correct by Construction A Construction Attack ttack-Tolerant Syst olerant Systems

Problem-solving revisited Problem-solving revisited David Lim (District Judge / Mediator) State

Environmental Acquisition Revisited Richard Cobbe and Matthias Felleisen Northeastern University

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

Financial Need and Aid Volatility among Students with Zero Expected Family Contribution Robert

Citizens Advisory Group Meeting No. 4 December 15, 2015 Agenda 1. Project Overview

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and

Expedited Delivery System for Energy Efficiency Projects What is The Energy Network? The Energy

COMPANY OVERVIEW AUGUST, 2018 FORWARD LOOKING STATEMENTS ADVISORY This presentation is for

CONSULTANT TEAM PRESENTATION ON PA Q1 REPORT DRAFT June 19, 2019 INTRODUCTION Q1 is

State of Hawaii GEMS Financing Program Helping Hawaii consumers save money through clean energy

Sambuz

Useful Links

Newsletter

Mail Us

Accelerating YouTube & Google Search Andreas Terzis YouTube Statistics YouTube is a large