Skip Blocks : Reusing Execution History to Accelerate Web Scripts Sarah Chasins Rastislav Bodik University of California, Berkeley University of Washington OOPSLA Oct 25, 2017 Vancouver
care about data tomorrow coders non-coders care about data today non-coders coders 2
What web data collection tools do we have? tools that require users to ● hire a human to reverse engineer target copy & paste webpages ● hire a coder to use JS X X A A J J A A one of these DOM ● Helena our tool! ... WEB AUTOMATION FOR END USERS coders non-coders 3
Helena Helena program program’ PBD tool editor Web Browser demonstration skip blocks this is our prior work 4
output: a script Let’s PBD a web automation script! Goal: scrape all papers by top 10,000 CS authors from Google Scholar input: a demonstration Author 1 Paper A 1998 Author 1 Paper B 2007 Author 1 ... ... Author 2 Paper C 2012 Helena Author 2 Paper D 2009 WEB AUTOMATION FOR END USERS Author 2 ... ... Author 3 Paper E 2014 Author 3 Paper F 2006 Author 3 ... ... 5
Helena Helena program program’ PBD tool editor Web today asking: how can Browser this go wrong, and how can we handle it? demonstration skip blocks this is our prior work 6
How is rent changing across Seattle neighborhoods? 7
page New listings have 1 pushed the last three listings from p1 onto p2 page Kept losing 2 wasting 10+ hours network scraping duplicates! connection 8
How is the minimum DEPARTMENT OF ECONOMICS _______________________________________________________________________________________________ wage affecting UNIVERSITY of WASHINGTON Seattle restaurants? CIVIL & ENVIRONMENTAL Can we design a ENGINEERING better carpool _______________________________________________________________________________________________ matching algorithm? UNIVERSITY of WASHINGTON How do charitable EVANS SCHOOL OF PUBLIC foundations POLICY & GOVERNANCE communicate with _______________________________________________________________________________________________ UNIVERSITY of WASHINGTON supporters? 9
Problem Statement (1) Failures : What happens when the network fails, the server fails, the computer fails? When we lose our session with the server and have to start over? (2) Data changes : What happens when the server gives the client pages produced from different (potentially conflicting) reads of the underlying data store? not client side problems → scraping script can’t prevent them, must handle them 10
Helena Helena program program’ PBD tool editor Web Browser demonstration skip blocks our solution 11
Solution on the surface, seem like very different problems failures data changes “Just don’t redo the same work you’ve already done!” But what’s the ‘same’ work? After all, data changes... ● Our answer: the skip block! User can If object already committed ● tell us what makes objects the same (memoized), skip block; else, run block ● ● associate the code that operates on No reverse engineering! Reasoning an object about output data 12
Recovering from Failures text-ify-ed representation of the block language for (aRow in p1.authors){ scrape stuff about the author, scrape aRow.author_name scrape aRow.author_institution click the author p2 = click aRow.author_name for the author’s papers, scrape for (pRow in p2.papers){ paper stuff scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } add a row of output with the author and paper info } 13
Recovering from Failures key attributes : is the current author the same as another we’ve already seen? for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name scrape aRow.author_institution block : the code that operates p2 = click aRow.author_name on the author object for (pRow in p2.papers){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } } if ever, in any run , script has committed an object with the same key attributes, skips the block 14
Recovering from Failures for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name A scrape aRow.author_institution U p2 = click aRow.author_name T for (pRow in p2.papers){ H scrape pRow.title P A O scrape pRow.citations P E R output ([aRow.author_name, pRow.title, pRow.citations]) R } } page skip block } load commit always at least one page load per author (to load paper list), but often ≈ 40 time a1 a2 ... ... ... ... ... ... p1 p2 p21 p22 p41 p42 p1 p2 p21 p22 p41 p42 15
Recovering from Failures external didn’t reach failure point this commit recovery without the author skip block recovery with the author skip block 10 authors per page, so skips 40 just 1 page load by this “fast-forwarding” over prior work page loads point, 200 skipped page skip block load commit always at least one page load per author (to load paper list), but often ≈ 40 time a1 a2 ... ... ... ... ... ... p1 p2 p21 p22 p41 p42 p1 p2 p21 p22 p41 p42 16
Nested Skip Blocks Review i Restaurant A Review ii City 1 Restaurant B In authors vs. papers, authors ... ... is clearly the right level for the Review iii skip block. But here? Restaurant C Review iv City 2 Restaurant D ... ... skip block only at city → scraping a whole city takes many hours, so scraping half a city also takes hours skip block only at restaurant → iterating through a city’s restaurant list takes a long time, and now we have to go through all of Seattle, San Francisco before we can resume in the middle of Vancouver skip block at city & restaurant → adjustable granularity skipping 17
Nested Skip Blocks for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } and the inner block may commit even if the } outer doesn’t - like a nested open transaction } 18
Refreshing a Dataset this is the for (aRow in p1.authors){ default skipBlock (Author(aRow.author_name, aRow.author_institution), - ∞ )){ staleness scrape aRow.author_name threshold scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } } } -∞ means skip any duplicate we’ve seen ever If we’re scraping once a week, we don’t want to revisit each author. But after a year, maybe we should see what’s new. 19
Refreshing a Dataset for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution), now - 365*24*60)){ scrape aRow.author_name scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } Also have logical time (ex: last 3 runs) } } Bonus! In addition to failure recovery and data redundancy handling, get incremental/longitudinal scraping! 20
Demo time! 21
Benchmark Suite benchmark suite: Need web 7 long-running web data? scraping tasks Ex: for 50 top foundations, scrape the last 1,000 tweets they tweeted Ex: scrape all Seattle apartment listings from Craigslist 22
1.7x Skipping one ad skips one page Data Change load, and pagination gives us so many within one run duplicate ads! 0.9x All overhead, no gains - skipping a tweet doesn’t skip Speedup any page loads! Measured full execution time of: ● Script with skip blocks ● Script without skip blocks Chart shows speedup from using skip blocks higher is better 23
49x Lots of benefit from last week’s Data Re-Scraping data. Gates Foundation doesn’t post within multiple runs that many new tweets in a week! 1.9x Little additional benefit from last week’s data; ≈ same Executed script with skip blocks speedup Speedup speedup as first run. One week later, measured full execution time of: ● Script with skip blocks ● Script without skip blocks Chart shows speedup from using skip blocks higher is better 24
Failure Recovery execution time if with skip block fast-forwarding there’s no failure failure during high For each benchmark, for three churn → see new data failure locations, the execution → slower recovery time of: ● Script that recovers by Execution Time naive restarting ● Script that recovers by skip block fastforwarding Normalized by execution time of a script that doesn’t encounter failures lower is better overall, performance close to ideal! 25
User Study the UI in the Helena tool the UI in the online survey 26
Recommend
More recommend