Analysis of wide area user mobility patterns Kevin Simler*, Steven E. Czerwinski † , Anthony Joseph UC Berkeley * Now at MIT 2004/12/02 † Now at Google
Motivation � We want to understand user behavior � In order to design better systems � In order to generate synthetic traces � In order to model user behavior � How can we capture user presence in the wide area?
Motivation � We want to understand user behavior � In order to design better systems � In order to generate synthetic traces � In order to model user behavior � How can we capture user presence in the wide area? web
Motivation � We want to understand user behavior � In order to design better systems � In order to generate synthetic traces � In order to model user behavior � How can we capture user presence in the wide area? web, IM
Motivation � We want to understand user behavior � In order to design better systems � In order to generate synthetic traces � In order to model user behavior � How can we capture user presence in the wide area? web, IM, …, e-mail
Why e-mail? � E-mail is a widely-used service � User typically checks e-mail first � Berkeley provides IMAP + web front end � Any Internet connection → e-mail access � E-mail reflects users’ Internet presence
Outline � Background � Analysis and results � User modeling � Future work � Summary
Trace characteristics � 31-days (May 2003) � Server from UC Berkeley EECS dept. � Regular IMAP plus web front-end � 1004 active users, primarily: � Professors � Graduate students � Support staff � Tracked across different service providers
Building on previous work � Wireless Campus Studies � Mobility on a campus � Single service provider with homogenous users � Tang & Baker, Kotz & Essien, Balazinska & Castro � Metricom WLAN � Mobility across/between cities � Single service provider with more diverse users � Tang & Baker
Trace data � Each entry in the trace includes: � Timestamp (seconds) � Request type ( login , close , select , etc.) � Username � IP address
Preprocessing � We want user behavior � Trace records client application behavior � Outlook, Eudora, Thunderbird, etc. � Primary difference: � Client polls for new e-mail at regular intervals � Fixed period per client, variable across clients
We filter client polling using a Fourier transform Client connections from a single user: … client connection login logout
We filter client polling using a Fourier transform p p … Use a Fourier transform to identify polling period p .
We filter client polling using a Fourier transform … Identify sequence separated by p . Remove all but the first connection.
We filter client polling using a Fourier transform > 15 minute gap … Clump connections into user sessions
We filter client polling using a Fourier transform … user session user session
We filter client polling using a Fourier transform … Now we have (roughly) a trace of user behavior
Outline � Background � Trace analysis � Defining location � Daily mobility � Monthly mobility � Session activity � User modeling � Future work � Summary
Defining network location � Connection used to access the Internet � E.g. a dialup ISP, campus wireless network � Approximated by a combination of � Authoritative DNS server � AS number � Subnet
How mobile are users each day? Fraction of user-days 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 Number of locations
How mobile are users each day? Fraction of user-days 0.6 50% of user- 0.5 days involve logging in from 0.4 only 1 location 0.3 0.2 0.1 0 0 1 2 3 Number of locations
How mobile are users each day? Fraction of user-days 0.6 15% of user- 0.5 days involve logging in from 0.4 2 locations 0.3 0.2 0.1 0 0 1 2 3 Number of locations
How mobile are users each day? Fraction of user-days 0.6 Upshot: On any 0.5 given day, users are not highly 0.4 mobile 0.3 0.2 0.1 0 0 1 2 3 Number of locations
How mobile are users in 31 days? � How many unique subnets do they visit? � How many unique AS #s do they visit? Let’s look at a graph….
How mobile are users in 31 days? 1 cumulative fraction of users subnets 0.8 AS #s 0.6 0.4 0.2 0 0 2 4 6 8 10 12 14 # clusters
How mobile are users in 31 days? 1 cumulative fraction of users subnets 0.8 AS #s 0.6 80% of users 0.4 log in from 8 or 0.2 fewer unique 0 subnets 0 2 4 6 8 10 12 14 # clusters
How mobile are users in 31 days? 1 cumulative fraction of users subnets 0.8 AS #s 0.6 90% of users 0.4 log in from 3 or 0.2 fewer unique 0 AS numbers 0 2 4 6 8 10 12 14 # clusters
How mobile are users in 31 days? 1 cumulative fraction of users subnets 0.8 AS #s 0.6 Upshot: Again, 0.4 most users are 0.2 not highly 0 mobile 0 2 4 6 8 10 12 14 # clusters
User activity at a location 0.7 fraction of visits 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4+ # sessions
User activity at a location 0.7 60% of visits to fraction of visits 0.6 a location 0.5 result in only 1 0.4 session 0.3 0.2 0.1 0 1 2 3 4+ # sessions
User activity at a location 0.7 20% of visits to fraction of visits 0.6 a location result 0.5 in exactly 2 0.4 sessions 0.3 0.2 0.1 0 1 2 3 4+ # sessions
User activity at a location 0.7 Upshot: Users fraction of visits 0.6 access their e- 0.5 mail once or 0.4 twice per visit. 0.3 0.2 0.1 0 1 2 3 4+ # sessions
Outline � Background � Trace analysis � User modeling � Categorizing users � Model structure � Training and testing � Future work � Summary
Categorizing users � Based on number of primary locations � For a given user, a primary location is: � One where the user spends >5% of the time � Categories � Users with 1 primary location � Users with 2 primary locations � Users with 3+ primary locations
Structure of our models � One model for each category � Two-tiered Markov model � High-level states represent user’s location � Low-level states represent user’s activity � Both MMs are 1 st order
Model structure for category 2 � 2 primary locations + 1 traveling state primary 1 primary 2 traveling
Model structure for category 2 � 2 primary locations + 1 traveling state primary 1 High-level (location) states primary 2 traveling
Model structure for category 2 � 2 primary locations + 1 traveling state primary 1 Low-level (session) states primary 2 I.e. Logged-In and Logged-Out traveling
Training � We have all the information � Which locations are primary � Where the user is, at any time � When the user is logged in/out � Simple to compute transition probabilities
Testing methodology � Create synthetic trace � Chose metrics to measure a trace � Compare real trace with synthetic trace
Testing one metric � # of sessions between visits to primary � Each user visits his primary � leaves to visit other locations � then comes back to his primary � Every time this happens, record the number of other locations � There will be a CDF for the entire trace (real or synthetic)
Testing results
Outline � Background � Trace analysis � User modeling � Future work � Summary
Using the results � Synthetic traces can help test systems � User behavior has implications for design � E.g. focus resources on primary locations � Model can predict user behavior on-the-fly � E.g. to cache, or not to cache?
As technology changes… � Blackberries � More physical locations � Shorter, more frequent sessions � Still, primary locations will be important � Wireless LAN hotspots � More network locations
Outline � Background � Trace analysis � User modeling � Future work � Summary
Summary – what we’ve done � Obtained a trace from an e-mail server � Filtered out client polling � Analyzed trace of user behavior � Modeled categories of users with tiered MM � Generated synthetic traces
Summary – user behavior � Most users log in from 1 or 2 locations � But a few users are highly mobile � Users access e-mail infrequently, but for long periods of time
Thank you � Quick clarifying questions?
Recommend
More recommend