Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar - PowerPoint PPT Presentation

A Large-scale Study of Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar Chellapilla 1 1 Microsoft Live Labs, 2 Microsoft Research

Problem Statement • Goal – Distinguish search queries as either automated or by a human 5000 • Motivation 4000 Cumulative Query Count – Improve QoS for humans 3000 – Increase/improve data for relevance 2000 1000 • Caveats 0 0 2 4 6 8 10 12 14 16 18 20 22 24 – Currently next-day analysis Time of day (hours) – Requires sessionization – Currently on a per-day basis, could analyze over longer time periods

Why automate query traffic? • To collect information – About the search engine • SEOs will query to check URL presence, rankings, find low result queries – For personal gain • Easy stock quotes, business news, etc • Scrape for email addresses, phone numbers, good spam queries • To commit click fraud – Click on ads of competitors

Exploring the Query Logs Top Queries of the Day 1. “” 2. “ google ” 3. “yahoo” 4. “ fire+department+- location%3ajp” 5. “ youtube ” 11. “ microsoft ”

Search Traffic Flow Browsers Custom Programs IE, Safari, C# Webrequests, browser Firefox, etc automation, etc Search Pages Our Applications 3 rd Party Applications MSN, Live.com, Local Club Live, MSN Live, etc Shopping, etc Search Engine

Query Stream Classification Process Log Sessionize Calculate Classify Queries queries features users into users for users Focus of this paper

Feature Set • Physical limits - time and space bound – Volume • Number of queries, clicks, etc (sustained) – Rate • Maximum interactions in a small time frame – Space • Distinct locations in a given time frame • Behavioral Signals – Entropy/chaos bound • Entropy of keywords, lengths, temporal ordering, periodicity, query category – Signatures • Spam score of keywords, adult score of keywords, etc • CTR, dwell time, etc • Blacklisted IPs, User agents, locales, etc Features are simple calculations and require little time for full data

Data Set • First we sampled 100M requests (all requests for a chosen user are included using cookies) • Then we pruned it to those that had at least 5 interactions, totally 46M requests

PL: Volume • Total Requests, Queries, Clicks, Keywords, etc – Most discriminating feature class – One user queried for “ mynet ” 12,061 times 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 More 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 Number of requests

PL: Rate • Number of events per (small) time period – Requests, clicks 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 Max number of requests per 10 second period

PL: Geography • Distinct IP address, considering only the first two octets – One user had 38 different cities in 4 hrs (428 queries) 1.E+06 Example 4:18:34 AM IP1 Charlottesville, Virginia 1.E+04 4:18:47 AM IP2 Tampa, Florida 4:18:52 AM IP3 Los Angeles, California 1.E+02 4:19:13 AM IP4 Johnson City, Tennessee 4:22:15 AM IP5 Delhi, Delhi 1.E+00 4:22:58 AM IP6 Pittsburgh, Pennsylvania 0 8 16 24 32 40 48 56 64 72 80 88 96 4:23:03 AM IP7 Canton, Georgia 4:23:17 AM IP8 Saint peter, Minnesota Number of IP addresses (first two octets)

B: Click-through Rate • Histogram for light and heavy users Users with 5+ requests Users with 50+ requests • Histograms show many more zero-click users when the volume is high – Rank checking does not require a click – Scraping top URLs for a query does not require a click

B: Keyword, Query Entropy • Calculated as informational entropy where the token is either a keyword or the whole query Example 1.E+05 06:20:59 2007 :financial+trade+cycle, 06:24:14 2007 :blue+letter+bible, 06:25:30 2007 :should+know+before, 1.E+04 06:27:40 2007 :individuals+cannot+adequately, 06:30:23 2007 :representing+several+bareboat+companies, 06:31:52 2007 :following+provisions+that, 1.E+03 06:33:22 2007 :post+jobs+with+careerbuilder, 06:34:38 2007 :edit+keyboard+shortcuts, 1.E+02 06:35:15 2007 :ways+consumer+knowledge+test, 06:36:28 2007 :like+writing+good+code, 06:39:19 2007 :save+money+with+road+runner, 1.E+01 06:41:00 2007 :featured+inquiry+logo+when+does, 06:43:03 2007 :asylum+lake+controversy, 06:44:40 2007 :introduced, 1.E+00 06:45:11 2007 :abdominal+wall+pathway, 06:46:51 2007 :calendars, 06:47:44 2007 :free+press+release+distribution, 06:49:25 2007 :early+double+knits+were, 07:03:27 2007 :serves+audiobook+professionals,

B: Alphabetical Ordering • Some users issue their queries in alphabetical order 1.E+05 Example 2 1.E+04 http astro stanford 1.E+03 http adulthealth lo 1.E+02 http www bigdrugsto 1.E+01 http www cheap diet pills online … 1.E+00 http www generic vi -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 http contrib cgi cl http www e insaat b Example 1 http buy tramadol o 2102manpuku, http cialis raulserrano info ciaxlis … 2103manpuku, http englishgrad cas ilstu edu files … 2104manpuku, …

B: Spam & Adult Scores • A small dictionary of spam (or adult) keywords and weights is used (normalized sum) Example 1.E+05 Managing your internal communities based group captive convert video from 1.E+04 book your mountain resort agreement forms online 1.E+03 find your true love 1.E+02 products from thousands mtge market share slips 1.E+01 mailing list archives studnet loan bill 1.E+00 your dream major 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 0.33 More computer degrees from home free shipping coupon offers

B: Length Entropy • Length of each keyword, length of each query Example nex pae cln intc 1.E+06 tei eu3 1.E+05 1.E+04 wfr eem 1.E+03 olv ssg 1.E+02 sqi oj 1.E+01 lqde nq 1.E+00 igf trf 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 cl ief nzd dax ewl rib xil bbdb nex csco

B: Query Periodicity • Entropy of elapsed time between successive requests (or clicks, for dwell time) – Could also use FFT 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

B: Advanced Query Terms • Scan index for “title:”, “link:”, “ url :”, etc and keep a count of the total number of occurrences 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 130 150 More 1 5 15 30 50 70 90 110 170 190

Others • Reputations – Use bags of values that represent black lists (or white lists) for particular fields • IP address • User agent • User ID (was previously tagged as automated) • Country code / locale • CLR boost – % clr gain afforded by user Id, day, etc • Ranks of the queries

Preliminary Classification Results • Weka – 320 Labeled data points – Not chosen randomly (Active Learner) – Search page entry points – Didn’t include reputations Rank Field Classifier TP TN FP FN % 1 Query Count Bayes Net 183 120 11 6 95 2 Query Entropy Naïve Bayes 185 106 25 4 91 3 Max interval AdaBoost 179 119 12 10 93 4 CTR Bagging 185 115 16 4 94 5 Spam Score ADTree 182 121 10 7 95 PART 184 120 11 5 95

Mixed Signals • It is not uncommon to have automated traffic and human traffic on the same user Id – 6,534 queries, first five (4 clicks) were • Pottery barn • Pottery barn kids • Pottery barn kids outlet • Pottery barn kids outlet store • Pier 1 … • Then 6529 queries without a click (mostly blank)

Conclusion • Feature set to distinguish between human search query traffic and automated query traffic – Divided into two groups, physical limits and behavioral signals – Initial results suggest the features can be used to classify traffic effectively

Exploring the Query Logs Future Work How many IP addresses have no cookies at all? 19.3M How many of these 19.3M have < 100 queries? 19.1M Can we sessionize these into users?

Questions?

Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar - PowerPoint PPT Presentation

A Large-scale Study of Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar Chellapilla 1 1 Microsoft Live Labs, 2 Microsoft Research Problem Statement Goal Distinguish search queries as either automated or by a human

my.SWISSTRAFFIC Automated Traffic Collection & Analysis FULLY AUTOMATED TRAFFIC DATA

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Traffic Shaping, Traffic Policing Peter Puschner, Institut fr Technische Informatik Traffic

Traffic signal optimization and traffic assignment Traffic signals Traffic signal optimization

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Web CS490W: Web I nformation Search & Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

The Traffic Conflicts Methodology revisited Richard van der Horst Traffic Safety Assessment

Traffic Engineering with Traffic Engineering with Estimated Traffic Matrices Estimated Traffic

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Learning in the Workplace: Using Performance Diagnostics to Improve Staff Performance Presented

The dynamical system of mixing layers J P Parker, C P Caulfield, R R Kerswell February 6, 2019

Combining Ac5ve Route Measurements with Passive BGP Data Dan

From Stage to Screen: How to Get the Most From Your Performance Capture Simon Unger Animation

Adam Garnsworthy ARIEL Principal Scientist and TRIUMF Research Scientist FRIB Decay Station

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Using Financial Data in Macroeconomic Models Markus Brunnermeier, Darius Palia, and Chris Sims

Transition from direct to sequential 2p-decay in theory and in experiment. NUSTAR meeting, March

Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar - PowerPoint PPT Presentation

A Large-scale Study of Automated Web Search Traffic Greg Buehrer 1 , Jack Stokes 2 and Kumar Chellapilla 1 1 Microsoft Live Labs, 2 Microsoft Research Problem Statement Goal Distinguish search queries as either automated or by a human

my.SWISSTRAFFIC Automated Traffic Collection &amp; Analysis FULLY AUTOMATED TRAFFIC DATA

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Traffic Shaping, Traffic Policing Peter Puschner, Institut fr Technische Informatik Traffic

Traffic signal optimization and traffic assignment Traffic signals Traffic signal optimization

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Web CS490W: Web I nformation Search &amp; Management Web opened the door for many important

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

The Traffic Conflicts Methodology revisited Richard van der Horst Traffic Safety Assessment

Traffic Engineering with Traffic Engineering with Estimated Traffic Matrices Estimated Traffic

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Learning in the Workplace: Using Performance Diagnostics to Improve Staff Performance Presented

The dynamical system of mixing layers J P Parker, C P Caulfield, R R Kerswell February 6, 2019

Combining Ac5ve Route Measurements with Passive BGP Data Dan

From Stage to Screen: How to Get the Most From Your Performance Capture Simon Unger Animation

Adam Garnsworthy ARIEL Principal Scientist and TRIUMF Research Scientist FRIB Decay Station

Scene Representation How does one describe the objects in a Scene Graphs 3D scene? Scene

Using Financial Data in Macroeconomic Models Markus Brunnermeier, Darius Palia, and Chris Sims

Transition from direct to sequential 2p-decay in theory and in experiment. NUSTAR meeting, March

my.SWISSTRAFFIC Automated Traffic Collection & Analysis FULLY AUTOMATED TRAFFIC DATA

Web CS490W: Web I nformation Search & Management Web opened the door for many important