economical machine learning via functional programming
play

Economical machine learning via functional programming Big Data - PowerPoint PPT Presentation

Economical machine learning via functional programming Big Data Scala by the Bay August 18, 2015 David Andrzejewski - @davidandrzej Data Sciences Engineering, Sumo Logic Sumo Logic Confidential Sumo Logic Machine data intelligence


  1. Economical machine learning via functional programming Big Data Scala by the Bay – August 18, 2015 David Andrzejewski - @davidandrzej Data Sciences Engineering, Sumo Logic Sumo Logic Confidential

  2. Sumo Logic • Machine data intelligence platform in AWS • Early-ish Scala adopter ( 2.7.7 in 2010) • Free trial for < 500 MB/day Sumo Logic Confidential

  3. This talk • Machine learning is useful, but... • ...brings additional engineering complexity • Functional programming techniques can help Sumo Logic Confidential

  4. Machine learning So hot right now • Will robots... – take our jobs? – annihilate humanity? • Key clip art – robots studying – heads with gears in them Sumo Logic Confidential

  5. “Machine learning studies computer algorithms for learning to do stuff.” -Prof. Rob Schapire (COS 511 scribe notes) Sumo Logic Confidential

  6. What kinds of “stuff” can machines learn to do? And how do they do it? • ...predict whether someone will click an ad • ...rank / recommend content by relevance What • ...classify behavior as malicious or not • ...label images or text based on content How f ( x ) = y Model [( x 1 , y 1 ) , . . . , ( x N , y N )] → ˆ f ( x ) Estimate ˆ f ( x ) = ˆ y Predict Sumo Logic Confidential

  7. Rise of complementary goods Moore’s Law More Cloud More Data (source: Fairchild via (source: Forrester via Forbes) (source: IDC via The Economist) computerhistory.org) Sumo Logic Confidential

  8. “Machine Learning disrupts software engineering.” -Léon Bottou (ICML 2015 keynote) Sumo Logic Confidential

  9. Technical debt “...you are sure that it will make further changes harder in the future.” – Martin Fowler • Tight coupling • Hidden dependencies • Code repetition / duplication • Statefulness • Duct-taped workarounds Sumo Logic Confidential

  10. ML: new & exciting ways to shoot yourself in the foot Trough of disillusionment? • Unreliable contracts Machine Learning: The High Interest • Unrealistic assumptions Credit Card of Technical Debt • Hard to D. Sculley et al (NIPS 2014 workshop) – test and debug – safely improve Two big challenges in machine learning – manage data/features Léon Bottou (ICML 2015 keynote) • Easy to – erode boundaries A Systems View of Machine Learning – glue / hack / duct tape Joshua Bloom (PyData 2015 keynote) Sumo Logic Confidential

  11. Payments Principal Sumo Logic Confidential

  12. N → ∞ Payments Principal Sumo Logic Confidential

  13. Payments Principal Sumo Logic Confidential

  14. Payments Machine Learning Principal Sumo Logic Confidential

  15. Out of the Tar Pit Moseley & Marks (2006) – h/t Paco Nathan Essential Complexity Incidental Complexity Actual problem “Reality tax” Business logic Implementation detail SQL Hadoop Declarative Imperative Sumo Logic Confidential

  16. Control “complexity spend” with Functional Programming • Avoid mutable state • Minimize custom logic surface area • Facilitate local reasoning • Compose small, well-typed, well-tested functions Sumo Logic Confidential

  17. Functional programming (FP) • Big idea: pure functions – no side effects – referential transparency • Consequences – immutability – 1 st class functions – higher-order functions • Examples use scalaz – version 7.1.3 – see “learning scalaz” blog post series by eed3si9n (sp?) Sumo Logic Confidential

  18. Sumo Logic Confidential

  19. Sumo Logic Confidential

  20. Sumo Logic Confidential

  21. Case 0: your code, does it work? Step 0: use the types • Useful tricks def ¡getData(datasetId: ¡Long, ¡ – Case class wrappers ¡ ¡ ¡ ¡ ¡ ¡ ¡startTime: ¡Long, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡endTime: ¡Long) ¡ – Unboxed tagged types def ¡getData(datasetId: ¡DatasetId, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡timeRange: ¡DatasetInterval) ¡ Sumo Logic Confidential

  22. Case 0: your code, does it work? Step 1: unit testing Testing for data scientists ( x 1 , y 1 ) Trey Causey (PyData 2015) ( x 2 , y 2 ) ? f ( x i ) = y i ( x k , y k ) Sumo Logic Confidential

  23. Case 0: your code, does it work? Step 2: property testing • Define properties you expect to hold • Heuristics to probe edge cases • ML examples – bounded output – find weird edge cases (e.g., empty clusters) p ( x, f ( x )) ∀ x Universal c ( x ) = ⇒ p ( x, f ( x )) Conditional ∀ x Sumo Logic Confidential

  24. Case 0: your code, does it work? Step 3: statistical estimators are functions • Confidence intervals P ( L ≥ � ) ≤ δ • PAC-style bounds • Property testing – customer generator: sets of datasets Sumo Logic Confidential

  25. Case 1: loose coupling via type class pattern Approach Example Hard-wired def ¡foo(x: ¡MyBuzzType) ¡ ¡ Parametric polymorphism def ¡add[T](x: ¡T) ¡ ¡ Variance annotation class ¡Stack[+T] ¡ Ad-hoc polymorphism def ¡sort[T ¡: ¡Ordering] ¡ (xs: ¡List[T]) ¡ Sumo Logic Confidential

  26. Advantages of type classes • Retroactively extend external types (e.g., Joda Time) • Nicer than “wrapper class” / subtyping (blog post) • ML sweet spot: consumers “just short” of being polymorphic • Examples – Timestamped[T] – Featurable[T] – Labeled[T] Sumo Logic Confidential

  27. • Basic k-fold CV • Stratified: � need label info Sumo Logic Confidential

  28. Type class laws Property Testing + Type Classes • What about transitivity? Let’s add trait ¡TotalOrdering[T] ¡ ¡ Sumo Logic Confidential

  29. Case 2: Monoids + Monoids = Monoids • Experimental evaluation code frequently manipulates results • How to combine? Sumo Logic Confidential

  30. Implementing Monoid (I believe Shapeless can do this automagically...!?) Sumo Logic Confidential

  31. Map(TestGroup ¡-­‑> ¡Results(79,119,171,14), ¡ ¡ ¡ ¡ ¡ ¡ControlGroup ¡-­‑> ¡Results(34,77,136,112)) ¡ Sumo Logic Confidential

  32. Distributed compute via monoid homomorphism See: Twitter Algebird and related talks, Jimmy Lin “Monoidify!” paper f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) DATA ¡ DATA ¡ DATA ¡ Sumo Logic Confidential

  33. Monoidal classifiers: 400x faster than Weka Algebraic Classifiers: a generic approach to fast cross-validation, online training, and parallel training - Izbicki, ICML13 Sumo Logic Confidential

  34. Key trick: prefix-sum Sumo Logic Confidential

  35. Case 3: auditing computation with Writer Monad Understanding multiclass predictions (credit: Kumar Avijit) w T f ( x ) = argmax i x i Sumo Logic Confidential

  36. Sumo Logic Confidential

  37. Sumo Logic Confidential

  38. Confusion matrix with “max significant feature” Tracking illuminates “bad features” Sumo Logic Confidential

  39. How did we do that? Writer Monad in simple drawings Sumo Logic Confidential

  40. How did we do that? Writer Monad in simple drawings Sumo Logic Confidential

  41. How did we do that? Writer Monad in simple drawings Sumo Logic Confidential

  42. Case 4: stateful traversal ! Example: sampling from p-th order autoregressive model ! ! = ! ! ! ! ! ! + ! ! ! ! ! ! Sumo Logic Confidential

  43. Case 4: stateful traversal ! Re-arrange to take current window state as input ! ! = ! ! ! ! ! ! + ! ! ! ! ! ! Sumo Logic Confidential

  44. Case 4: stateful traversal ! Partially apply the function for fixed parameters ! ! = ! ! ! ! ! ! + ! ! ! ! ! ! Sumo Logic Confidential

  45. Case 4: stateful traversal ! Map function over random noise terms ! ! = ! ! ! ! ! ! + ! ! ! Now we’ve got something like ! ! ! g: ¡Window ¡=> ¡(Window, ¡Prediction) ¡ Sumo Logic Confidential

  46. Sumo Logic Confidential

  47. 1. Convert each position into independent State calculation Traverse/Sequence to convert List[State] → State[List] 2. 3. Supply initial window state and run() Sumo Logic Confidential

  48. Monoids, Monads, who cares? • Ubiquitous patterns – Monoids: generalized addition/combination – Monads: computation within context • Make it explicit and reap the rewards – type-checking – generalized wiring – optimization opportunities – common vocabulary Sumo Logic Confidential

  49. Manage ML tech debt with functional programming Monoid design Correctness Loose coupling Monad design • Type-oriented • Type class • Combine data • Instrumented design design pattern structures model prediction • Function • Utility functions • Leverage with Writer composition via ad hoc general • Stateful traversal • Unit tests polymorphism plumbing • Type-checked • Property tests • Chaining type • Efficient failure handling • Bounds and classes distributed randomized • Law-checking computation behavior • Cross-fold validation Sumo Logic Confidential

Recommend


More recommend