Fighting Spam with Haskell Simon Marlow 5 Sept 2015 Headlines - PowerPoint PPT Presentation

Fighting Spam with Haskell Simon Marlow 5 Sept 2015

Headlines ▪ Migrated a large service to Haskell ▪ thousands of machines ▪ every action on Facebook (and Instagram) runs some Haskell code ▪ live code updates (<10 min) ▪ This talk ▪ The problem we’re solving: abuse detection & remediation ▪ Why (and how) Haskell? ▪ Tales from the trenches

The problem ▪ There is spam and other types of abuse ▪ Malware attacks, credential stealing ▪ Sites that trick users into liking/sharing things or divulging passwords ▪ Fake accounts that spam people and pages ▪ Spammers can use automation and viral attacks ▪ Want to catch as much as possible in a completely automated way ▪ Squash attacks quickly and safely

Yes! Evil?

∑ We call this system Sigma

Sigma :: Content -> Bool ▪ Sigma classifies tens of billions of actions per day ▪ Facebook + Instagram ▪ Sigma is a rule engine ▪ For each action type, evaluate a set of rules ▪ Rules can block or take other action ▪ Manual + machine learned rules ▪ Rules can be updated live ▪ Highly effective at eliminating spam, malware, malicious URLs, etc. etc.

How do we define rules?

Example ▪ Fanatics are spamming their friends with posts about Functional Programming! ▪ Let’s fix it!

Example Need info about the content ▪ We want a rule that says ▪ If the person is posting about Functional Programming ▪ And they have >100 friends Need to fetch the ▪ And more than half of their friends like C++ friend list ▪ Then block, else allow Need info about each friend

Our rule, in Haskell fpSpammer :: Haxl Bool fpSpammer = ▪ Haxl is a monad ▪ “ Haxl Bool” is the type of a computation that may: ▪ do data-fetching ▪ consult input data ▪ maybe throw exceptions ▪ finally, return a Bool

Our rule, in Haskell fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP where talkingAboutFP = strContains “Functional Programming ” <$> postContent ▪ postContent is part of the input (say) postContent :: Haxl Text

Our rule, in Haskell fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 where talkingAboutFP = strContains “Functional Programming” <$> postContent (.&&) :: Haxl Bool -> Haxl Bool -> Haxl Bool (.>) :: Ord a => Haxl a -> Haxl a -> Haxl Bool numFriends :: Haxl Int

Our rule, in Haskell fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 .&& friendsLikeCPlusPlus where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2)

Observations ▪ Our language is Haskell + libraries ▪ Embedded Domain-Specific Language (EDSL) ▪ Users can pick up a Haskell book and learn about it ▪ Tradeoff: not exactly the syntax we might have chosen, but we get to take advantage of existing tooling, documentation etc. ▪ Focus on expressing functionality concisely, avoid operational details ▪ “pure” semantics ▪ no side effects – easy to reason about ▪ scope for automatic optimisation

Efficiency ▪ Rules are data + computation ▪ Fetching remote data can be slow ▪ Latency is important! ▪ We’re on the clock: the user is waiting ▪ So what about efficiency?

Fetching data efficiently is all that matters.

1. Fetch only the data you need to make a decision 2. Fetch data concurrently whenever possible Let’s deal with (1) first.

Example Fast ▪ We want a rule that says ▪ If the person is posting about Functional Programming ▪ And they have >100 friends Slow ▪ And more than half of their friends like C++ Very slow ▪ Then block, else allow ▪ Avoid slow checks if fast checks already determine the answer

.&& is short-cutting fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 .&& friendsLikeCPlusPlus where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2) ▪ Programmer is responsible for getting the order right ▪ (tooling helps with this)

We can speculate fpSpammer :: Haxl Bool avoid shortcutting fpSpammer = behaviour by talkingAboutFP .&& explicitly do a <- numFriends .> 100 evaluating both b <- friendsLikeCPlusPlus conditions return (a && b) where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2)

Concurrency ▪ Multiple independent data-fetch requests must be executed concurrently and/or batched ▪ Traditional languages and frameworks make the programmer deal with this ▪ threads, futures/promises, async, callbacks, etc. ▪ Hard to get right ▪ Our users don’t care ▪ Clutters the code ▪ Hard to refactor later

Haxl’s advantage ▪ Because our language has no side effects, the framework can handle concurrency automatically ▪ We can exploit concurrency as far as data dependencies allow ▪ The programmer doesn’t need to think about it getFriends friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends likesCPlusPlu likesCPlusPlu ... s likesCPlusPlu s likesCPlusPlu s likesCPlusPlu s s

numCommonFriends a b = do fa <- friendsOf a fb <- friendsOf b return (length (intersect fa fb)) friendsOf a friendsOf b length (intersect ...)

How does Haxl work?

Step 1 ▪ Haxl is a Monad ▪ The implementation of (>>=) will allow the computation to block, waiting for data. This is the Done indicates result of a Blocked indicates that the that we have data Result a computation computation requires this finished data. = Done a | Blocked (Seq BlockedRequest) (Haxl a) Haxl may need to do IO newtype Haxl a = Haxl { unHaxl :: IO (Result a) }

Monad instance instance Monad Haxl where return a = Haxl $ return (Done a) Haxl m >>= k = Haxl $ do r <- m case r of Done a -> unHaxl (k a) Blocked br c -> return (Blocked br (c >>= k)) If m blocks with continuation c, the continuation for m >>= k is c >>= k

So far we can only block on one data-fetch Our example will block on the first friendsOf request: • numCommonFriends a b = do fa <- friendsOf a blocks here fb <- friendsOf b return (length (intersect fa fb)) How do we allow the Monad to collect multiple data-fetches, so we • can execute them concurrently?

First, rewrite to use Applicative operators numCommonFriends a b = length <$> (intersect <$> friendsOf a <*> friendsOf b) ▪ Applicative is a weaker version of Monad class Applicative f where pure :: a -> f a (<*>) :: f (a -> b) -> f a -> f b class Monad m where return :: a -> m a (>>=) :: m a -> (a -> m b) -> m b ▪ When we use Applicative, Haxl can collect multiple data fetches and execute them concurrently.

Applicative instance instance Applicative Haxl where pure = return Haxl f <*> Haxl x = Haxl $ do f' <- f x' <- x case (f',x') of (Done g, Done y ) -> return (Done (g y)) (Done g, Blocked br c ) -> return (Blocked br (g <$> c)) (Blocked br c, Done y ) -> return (Blocked br (c <*> return y)) (Blocked br1 c, Blocked br2 d) -> return (Blocked (br1 <> br2) (c <*> d)) ▪ <*> allows both arguments to block waiting for data ▪ <*> can be nested, letting us collect an arbitrary number of data fetches to execute concurrently

(Some) Concurrency for free ▪ Applicative is a standard class in Haskell ▪ Lots of library functions are already defined using it ▪ These work concurrently when used with Haxl ▪ e.g. sequence :: Monad m => [m a] -> m [a] mapM :: Monad m => (a -> b) -> m [a] -> m [b] filterM :: Monad m => (a -> m Bool) -> [a] -> m [a] friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends ...

Back to our example ▪ These behave the same: numCommonFriends a b = do fa <- friendsOf a This is the version we want fb <- friendsOf b to write return (length (intersect fa fb)) numCommonFriends a b = length <$> (intersect <$> friendsOf a <*> friendsOf b) This is the version we want to run ▪ Data dependencies tell us we can translate one into the other

Applicative Do ▪ We implemented this transformation in the compiler ▪ Users just turn it on: {-# LANGUAGE ApplicativeDo #-} ▪ ... and get automatic concurrency/batching when using “do” ▪ Semantics-preserving with existing code (if certain standard properties hold), but provides better performance in some cases ▪ Extension pushed upstream to GHC (17/9/2015), will be in 8.0.1

Does this work in practice? ▪ In Sigma, our most common request executes hundreds of fetches in under ten rounds. ▪ Performance problems that come up in code reviews (and production) tend to be about fetching too much data, almost never about concurrency

Fighting Spam with Haskell Simon Marlow 5 Sept 2015 Headlines - PowerPoint PPT Presentation

Fighting Spam with Haskell Simon Marlow 5 Sept 2015 Headlines Migrated a large service to Haskell thousands of machines every action on Facebook (and Instagram) runs some Haskell code live code updates (<10 min) This talk

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Measuring the Role of Greylisting and Nolisting in Fighting Spam F. Pagani 1 M. De Astis 2 M.

09/28/2005 Shalendra Chhabra MS Thesis Defense - Fighting Spam, Phishing and Email Fraud

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Proofs of communication and its application e-mails Proof-of-work for fighting spam Proof-of-

Haskell in the datacentre! Simon Marlow Facebook (FHPC 17, September 2017) Haskell powers

Haskell in the datacentre! Simon Marlow Facebook (Copenhagen, April 2019) Haskell powers Sigma

Fighting SPAM: Whitelisting Revisited David Erickson Martin Casado Nick McKeown

Using BGP for realtime import and export of spam whitelist/blacklist entries Peter Hessler

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

Using BGP for realtime import and export of spam whitelist/blacklist entries a year in the life

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Haskell-RL An Equational Specification of Haskell in Maude Andrew Bennett Presented on 24 April

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Bringing Haskell to the World www.fpcomplete.com Experience Report Building Haskell Development

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Facebook Simon Marlow Jon Coens Louis Brandy Jon Purdy & others Databases Business

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

spam, ham and other food or how to distribute spam to 100k email addresses Who am I? Debian

Ruth Batson By Dan Hernan When we fight about education, were fighting for our lives.

Fighting Spam with Haskell Simon Marlow 5 Sept 2015 Headlines - PowerPoint PPT Presentation

Fighting Spam with Haskell Simon Marlow 5 Sept 2015 Headlines Migrated a large service to Haskell thousands of machines every action on Facebook (and Instagram) runs some Haskell code live code updates (<10 min) This talk

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Measuring the Role of Greylisting and Nolisting in Fighting Spam F. Pagani 1 M. De Astis 2 M.

09/28/2005 Shalendra Chhabra MS Thesis Defense - Fighting Spam, Phishing and Email Fraud

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Proofs of communication and its application e-mails Proof-of-work for fighting spam Proof-of-

Haskell in the datacentre! Simon Marlow Facebook (FHPC 17, September 2017) Haskell powers

Haskell in the datacentre! Simon Marlow Facebook (Copenhagen, April 2019) Haskell powers Sigma

Fighting SPAM: Whitelisting Revisited David Erickson Martin Casado Nick McKeown

Using BGP for realtime import and export of spam whitelist/blacklist entries Peter Hessler

last chance for mail service ? DKIM TFMC2 01/2006 Mail service status More and more spam,

Using BGP for realtime import and export of spam whitelist/blacklist entries a year in the life

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Haskell-RL An Equational Specification of Haskell in Maude Andrew Bennett Presented on 24 April

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Bringing Haskell to the World www.fpcomplete.com Experience Report Building Haskell Development

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

Facebook Simon Marlow Jon Coens Louis Brandy Jon Purdy &amp; others Databases Business

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

spam, ham and other food or how to distribute spam to 100k email addresses Who am I? Debian

Ruth Batson By Dan Hernan When we fight about education, were fighting for our lives.

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Facebook Simon Marlow Jon Coens Louis Brandy Jon Purdy & others Databases Business