Reading into Scarce Data @ Estimating uncertainty in real-world conditions using Bayesian inference Max Sklar Foursquare Engineer @maxsklar
We're talking about this problem: - When do I have enough data to be confident in my answer? - What if I'm forced to give an estimate even if I don't have enough data? - How can we do this in a non- Rotten Tomatoes ratings: arbitrary way? Is this fair?
Scarce data is inevitable! Most instances won't have a lot of data. ● Tradoff to dividing data into more buckets, but getting less data per bucket. ● Power curves ○ A few buckets will have tons of examples, while most won't have many ○ Ex. Venues, Users, Words in documents ● Every social network has this distribution
What can we say about the sparse buckets? Look at the roulette wheel vs. the restaurant story. Wheel: Red, Red, Red, Red Restaurant: +, +, +, + Takeaway: You're willing to make a judgement about a restaurant after going 4 times, but you're not willing to accept data points from the wheel. Why is this? Turns out the intuition is very logical!
What can we say about the sparse buckets? Suppose we're pulling data out of a multinomial distribution . Each distribution is represented by a pie chart. We don't know which pie chart we're on - only the result of throwing darts at it. We should have an idea before starting of which distributions are more likely than others. We are willing to change our mind given data using Bayes rule :
A clean way of doing this: Dirichlet Prior Dirichlet Prior Normal Distribution - has an expected - has a mean multinomial distribution (think - has a standard deviation, a pie chart) which could be interpreted as - has a weight that tells us how confident we are in the how confident we are. mean. high weight = we're not so willing to change our mind with new data. low weight = we place high value on new data. Collecting more data increases the weight.
Estimating food preferences You decide to poll 1000 random people in a given area on what they like better: pizza or hot dogs. You get the following result: x 638 638/1000 = 63.8% x 362 362/1000 = 36.2% The familiar way to find probability is to divide the number of events for that state by the total number of events. This is the maximum likelihood estimation. In this case, because there's a lot of data, this is actually a very good probability distribution to guess for the next person coming out of this sample. But then there are other cases...
Cases where maximum likelihood fails No data: Small amount of data, one category is a no show x 2 x 0 x 0 x 0 MLE for hot dog is 0. If the next MLE = 0/0 (Indeterminate) person likes hot dogs that's infinite loss, but it's very possible. Small amount of data, both items show Large amount of data, one category is a no show Where would you rather be, 16 x 1 Handles, or 14th Street Post office? x 4 x 0 x1000 MLE for pizza is 20%. But this could have easily been produced by a 50-50 split, and LLL We still don't want to guess 0 tells us not to be too bold.
Solution: Add prior data ● When we count up our data, we initialize both counts to zero. Instead, initialize both counts to something > 0. It doesn't even have to be an integer! Count: .5 + 0 = 0.5 Count: .5 + 2 = 2.5 Count: .5 + 1 = 1.5 Count: 638.5 Probability = .5 Probability = 83% Probability = 25% Probability = 63.8% Was: INDET Was: 100% Was: 20% Was: ~Same Count: .5 + 0 = 0.5 Count: .5 + 0 = 0.5 Count: .5 + 4 = 4.5 Count: 362.5 Probability = .5 Probability = 17% Probability = 75% Probability = 36.2% Was: INDET Was 0% Was: 80% Was: ~ Same Count = 1 Count = 3 Count = 6 Count = 1001 Was: 0 Was: 2 Was: 5 Was: 1000 ● When sparse data isn't a problem (4th column), the number doesn't change much. ● Now even the post office will be assigned a probability of one out of every 2001 people.
The initial counts correspond to a dirichlet distribution! n1: number of observations for state 1 n2: number of observations for state 2 N = n1+n2: total number of observations a1: prior on state1 a2: prior on state2 W = a1+a2: total prior (the weight of the dirichlet distribution) MLE estimate for state 1 = n1 / N Dirichlet Prior Estimate for state 1 = (n1+) / (N+W)
Math for the Beta Distribution ● The Beta Prior is the dirichlet prior with 2 categories. ● When using bayes rule to find a posterior distribution after seeing data, the result is a new beta prior! ● The end result is the same as we saw before from adding additional data, so there's no need to understand this. ● But trust me, the math works. ● See for yourself! It's not as bad as it looks.
Beta, Gamma, Dirichlet ● Beta Prior ○ Use it when you're dividing into two groups, or deciding between true and false ● Dirichlet Prior ○ This is when you have multiple categories to choose from. Add a little bit of prior data to each category. ● Gamma Prior ○ This is when you have a rate of occurance of events. Add some prior events, and some prior time. ○ Can be used to build Dirichlets
How do we find the prior ● Without any other data, it's subjective. ● Same weight on all categories? ● Roulette Wheel vs. Restaurant again ● When would you want high weight? Roulette Wheel. ● When would you want low weight? Team of observers example. ● Cautionary tale: Dirichtlet Jellybean example.
How do we find the prior? With Big Data, we can get fancier! ● We may not have a lot of data about this barbecue, but we've been to lots of barbecues ● Use our experience - find the dirichlet distribution which optimizes the likelihood of past data.
A short python script will do it ● Available on github: BayesPy/ConjugatePriorTools ○ The dirichlet probability distribution over counts (below) is easy to implement in python. ● Uses Gradient Descent (below), and second order methods. ● Let's run it!
Finally, some applications in the Foursquare app: Tip Sentiment: Using a modified version of the afinn word list, each tip is classified either positive, negative, or neutral. Here is the prior that we found: Positive 1.4 Neutral 3.1 Negative .4 High Sentiment: McNally Jackson Bookstore Pepe's Pizza Low Sentiment: 14th Street Post Office
Venue without Tips 1 negative tip 5 negative tips 30 tips - 10 of each
Prior for partitioned popularity with a 48-dimensional Dirichlet ● Monday - Thursday compressed, 2 hour intervals ● Total weight: about 31 (Why so high?) ● What if the time buckets are sampled unevenly?
Likes, and Dislikes: #allnew4sq Foursquare released an update on June 7, 2012 that allowed users to like and dislike venues. We now have a lot of data points, but most venues have very few likes or dislikes. Beta Prior: We've done some initial analysis, likes, dislikes but there are a few issues. Let's 7.50, 0.90 look at the list and try to diagnose.... likes are much more common
Examples of like/dislike ratings: This list illustrates some of the work that still needs to be done Highly liked venues: at&t park, 38/38, 98% magic kingdom park, 31/31, 98% blue bottle coffee, 28/28, 98% navy pier (chicago), 26/26, 97% shake shake, 24/24, 97% musée du louvre, 24/24, 97% museum of natural history, 24/24, 97% MoMa, 44/45, 96% uefa euro 2012 russia / czech 261/272, 95%
Examples of like/dislike ratings: This list illustrates some of the work that still needs to be done Highly disliked venues: (Not sure if people actually dislike) vh1 big morning buzz, 17/41, 50% cross country cookout (History Channel event in Nashville) 14/30, 56% Penny Farthing, 1/5, 63% (More confident, shows lack of data) Left: Moscow Passport Office, 1/5, 63% 14th Street Post office: 0/1: 80%
These projects will make these usable: 1) We need more than a weeks worth of data. The negative signal in particular is weak. 2) There's a signal in "not rating" even though you've been somewhere. We need to look at who is rating (what's worked in the past) 3) Combine likes with a better tip sentiment algorithm 4) Personalization - netflix-style matrix factorization
Thanks! Questions? Follow me on Twitter: @maxsklar
Recommend
More recommend