data intensive distributed computing
play

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 7: Data Mining (2/4) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons


  1. Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 7: Data Mining (2/4) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States 1 See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details 1

  2. Stochastic Gradient Descent 2 Source: Wikipedia (Water Slide) 2

  3. Stochastic Gradient Descent Gradient Descent Stochastic Gradient Descent (SGD) 3 3

  4. Stochastic Gradient Descent Gradient Descent Considers all training instances in every iteration Stochastic Gradient Descent (SGD) Considers a random instance in every iteration 4 4

  5. Stochastic Gradient Descent Gradient Descent Considers all training instances in every iteration Stochastic Gradient Descent (SGD) Considers a random instance in every iteration 5 5

  6. Batch Stochastic Gradient Gradient Descent Descent Mini-batching Considers a random subset of instances in every iteration 6

  7. Ensembles 7 Source: Wikipedia (Orchestra) 7

  8. Ensemble Learning Learn multiple models, combine results from different models to make prediction Common implementation: Train classifiers on different input partitions of the data Embarrassingly parallel! Combining predictions: Majority voting Model averaging 8 8

  9. Ensemble Learning Learn multiple models, combine results from different models to make prediction Why does it work? If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error 9 9

  10. MapReduce Implementation 10

  11. Gradient Descent training data training data training data training data mapper mapper mapper mapper reducer update model iterate until convergence 11 11

  12. Stochastic Gradient Descent training data training data training data training data mapper mapper mapper mapper learner learner learner learner No iteration! 12 This is great because we no longer need iterations! Mappers go through the record and apply the stochastic gradient descend rule on that record and update the model. This process continues for all records 12

  13. Stochastic Gradient Descent training data training data training data training data mapper mapper mapper mapper reducer reducer learner learner No iteration! 13 13

  14. MapReduce Implementation How do we output the model? Option 1: write model out as “side data” Option 2: emit model as intermediate output 14 14

  15. What about Spark? RDD[T] mapPartitions f: (Iterator[T]) ⇒ Iterator[U] learner RDD[U] 15 15

  16. In practice … Data scientists usually use provided transformations in Spark ML val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) val prediction = model.predict(point.features) 16

  17. Sentiment Analysis Case Study Binary polarity classification: {positive, negative} sentiment Use the “emoticon trick” to gather data Data Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split) Features: Sliding window byte-4grams Models + Optimization: Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting) T h i 17 s Source: Lin and Kolcz. (2012) Large-Scale Machine Learning at Twitter. SIGMOD. P h 17

  18. Diminishing returns… Ensembles with 10m examples better than 100m single classifier! “for free” single classifier 10m instances 100m instances 18 18

  19. Supervised Machine Learning training testing/deployment Model ? Machine Learning Algorithm 19 19

  20. Evaluation How do we know how well we’re doing? Induce: Such that loss is minimized We need end-to-end metrics! Obvious metric: accuracy 20 20

  21. Metrics Actual Positive Negative Positive True Positive False Positive Precision (TP) (FP) = TP/(TP + FP) = Type 1 Error Predicted Negative False Negative True Negative Miss rate (FN) (TN) = FN/(FN + TN) = Type II Error Recall or TPR Fall-Out or FPR = TP/(TP + FN) = FP/(FP + TN) 21 21

  22. 22

  23. ROC and PR Curves AUC 23 Source: Davis and Goadrich. (2006) The Relationship Between Precision-Recall and ROC curves A receiver operating characteristic curve , or ROC curve , is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. 23

  24. Training/Testing Splits Training Test Cross-Validation 24 24

  25. Training/Testing Splits Cross-Validation 25 25

  26. Training/Testing Splits Cross-Validation 26 26

  27. Training/Testing Splits Cross-Validation 27 27

  28. Training/Testing Splits Cross-Validation 28 28

  29. Training/Testing Splits Cross-Validation 29 29

  30. Typical Industry Setup time A/B test Training Test 30 30

  31. A/B Testing X % 100 - X % Control Treatment Gather metrics, compare alternatives 31 31

  32. A/B Testing: Complexities Properly bucketing users Novelty Learning effects Long vs. short term effects Multiple, interacting tests Nosy tech journalists … 32 32

  33. Supervised Machine Learning training testing/deployment Model ? Machine Learning Algorithm 33 33

  34. Applied ML in Academia Download interesting dataset (comes with the problem) Run baseline model Train/Test Build better model Train/Test Does new model beat baseline? Yes: publish a paper! No: try again! 34 34

  35. 35 35

  36. 36 36

  37. 37 37

  38. Fantasy Reality Extract features What’s the task? Develop cool ML Where’s the data? technique What’s in this dataset? #Profit What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate… 38 38

  39. 39

  40. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil “Data Jujitsu” 40 Source: Wikipedia (Jujitsu) 40

  41. 41 41

  42. On finding things… 42 42

  43. On naming things… userid CamelCase user_id smallCamelCase snake_case camel_Snake dunder__snake 43 43

  44. On feature extraction … ^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ ([^@]+?)@(\\S+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+) \\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$ An actual Java regular expression used to parse log message at Twitter circa 2010 Friction is cumulative! 44 44

  45. Data Plumbing… Gone Wrong! [scene: consumer internet company in the Bay Area…] Okay, let’s get going… where’s the click data? It’s over here… Well, that’s kinda non- intuitive, but okay… Well, it wouldn’t fit, so we had to shoehorn… … Oh, BTW, where’s the timestamp of the click? Hang on, I don’t remember… Uh, bad news. Looks like we forgot to log it… [grumble, grumble, grumble] Frontend Engineer Data Scientist Develops new feature, adds Analyze user behavior, extract logging code to capture clicks insights to improve feature 45 45

  46. Fantasy Reality Extract features What’s the task? Develop cool ML Where’s the data? technique What’s in this dataset? #Profit What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate… 46 46

  47. Congratulations, you’re halfway there… 47 Source: Wikipedia (Hills) 47

  48. Congratulations, you’re halfway there… Does it actually work? A/B testing Is it fast enough? Good, you’re two thirds there… 48 48

  49. Productionize 49 Source: Wikipedia (Oil refinery) 49

  50. Productionize What are your jobs’ dependencies? How/when are your jobs scheduled? Are there enough resources? How do you know if it’s working? Who do you call if it stops working? Infrastructure is critical here! (plumbing) 50 50

  51. Takeaway lesson: Most of data science isn’t glamorous! 51 Source: Wikipedia (Plumbing) 51

Recommend


More recommend