COMP9313: Big Data Management Classification and PySpark MLlib

PySpark MLlib • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities • Basic Statistics • Classification • Regression • Clustering • Recommendation System • Dimensionality Reduction • Feature Extraction • Optimization • It is more or less a spark version of sk-learn 2

Classification • Classification • predicts categorical class labels • constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction (aka. Regression) • models continuous-valued functions, i.e., predicts unknown or missing values • Applications • medical diagnosis • credit approval • natural language processing 3

Classification and Regression • Given a new object 𝑝 , map it to a feature vector 𝐲 = 𝑦 ! , 𝑦 " , … , 𝑦 # $ • Predict the output (class label) 𝑧 ∈ 𝒵 • Binary classification • 𝒵 = {0, 1} (sometimes {−1, 1} ) • Multi-class classification • 𝒵 = {1,2, … , 𝐷} • Learn a classification function • 𝑔 𝐲 : ℝ , ↦ 𝒵 • Regression: 𝑔 𝐲 : ℝ # ↦ ℝ 4

Example of Classification – Text Categorization • Given: document or sentence • E.g., A statement released by Scott Morrison said he has received advice … advising the upcoming sitting be cancelled. • Predict: Topic • Pre-defined labels: Politics or not? • How to learn the classification function? • 𝑔 𝐲 : ℝ , ↦ 𝒵 • How to convert document to 𝐲 ∈ ℝ , (e.g., feature vector)? • How to convert pre-defined labels to 𝒵 = {0, 1} ? 5

Example of Classification – Text Categorization • Input object: a sequence of words • Input features 𝐲 • Bag of Words representation • Freq(Morrison) = 2, freq(Trump) = 0, … • 𝐲 = 2, 1, 0, … - • Class labels: 𝒵 • Politics: 1 • Not politics: -1 6

Convert a Problem into Classification Problem • Input • How to generate input feature vectors • Output • Class labels • Another example: image classification • Input: A matrix of RGB values • Input features: color histogram • E.g., pixel_count(red) = ?, pixel_count(blue) = ? • Output: class labels • Building: 1 • Not building: -1 7

Supervised Learning • How to get 𝑔 𝐲 ? • In supervised learning, we are given a set of training examples: • 𝒠 = 𝐲 . , 𝑧 . , 𝑗 = 1, … , 𝑜 • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory 8

Machine Learning Terminologies • Supervised learning has input labelled data • #instances x #attributes matrix/table • #attributes = #features + 1 • 1 (usu. the last attribute) is for the class label • Labelled data split into 2 or 3 disjoint subsets • Training data (used to build a classifier) • Development data (used to select a classifier) • Testing data (used to evaluate the classifier) • Output of the classifier • Binary classification: #labels = 2 • Multi-label classification: #labels > 2 9

Machine Learning Terminologies • Evaluate the classifier • False positive: • not politics but classified as politics • False negative • Politics but classified as not politics • True positive • Politics and classified as politics &' • Precision = &'()' &' • Recall = &'()* • F1 score = 2 ⋅ '-./0102*⋅-./344 '-./0102*(-./344 10

Classification—A Two-Step Process • Classifier construction • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for classifier construction is training set • The classifier is represented as classification rules, decision trees, or mathematical formulae • Classifier usage: classifying future or unknown objects • Estimate accuracy of the classifier • The known label of test sample is compared with the classified result from the classifier • Accuracy rate is the percentage of test set samples that are correctly classified by the classifier • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the classifier to classify data tuples whose class labels are not known 11

Classification Process 1: Preprocessing and Feature Engineering Raw Data Training Data 12

Classification Process 2: Train a classifier Classification Training Algorithms Data Prediction 1 0 1 1 0 Classifier 𝑔 𝐲 Precision = 0.66 Recall = 0.66 F1 = 0.66 13

Classification Process 3: Evaluate the Classifier Classifier Prediction 1 1 1 Test 1 Data 0 Precision = 75% Recall = 100% 14 F1 = 0.86

How to Judge a Model? • Based on training error or testing error? • Testing error • Otherwise, this is a kind of data scooping => overfitting • What if there are multiple models to choose from? • Further split a “development set” from the training set • Can we trust the error values on the development set? • Need “large” dev set => less data for training • k-fold cross-validation 15

k-fold cross-validation 16

Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • … • We will do text classification in Project 2 17

Text Classification: Problem Definition • Input • Document or sentence 𝑒 • Output • Class label 𝑑 ∈ {c / , c 0 , … } • Classification methods: • Naïve bayes • Logistic regression • Support-vector machines • … 18

Naïve Bayes: Intuition • Simple (“naïve”) classification method based on Bayes rule • Relies on very simple representation of document • Bag of words it 6 I 5 I love this movie! It's sweet, the 4 but with satirical humor. The t, to 3 fairy it always love he dialogue is great and the to and 3 it whimsical it I adventure scenes are fun... and areanyone seen 2 seen ... friend yet 1 It manages to be whimsical dialogue cal happy recommend would 1 and romantic while laughing ng adventure satirical whimsical 1 sweet of at the conventions of the who it movie times 1 I to it but romantic I fairy tale genre. I would yet sweet 1 t several humor r recommend it to just about again the it satirical 1 ral the would seen anyone. I've seen it several py adventure 1 to scenes I the manages es I genre 1 the times, and I'm always happy fun times I and and fairy 1 about to see it again whenever I while whenever humor 1 have have a friend who hasn't conventions have 1 with seen it yet! great 1 … … 19

Naïve Bayes Classifier • Bayes’ Rule: • For a document d and a class c 𝑄(𝑑|𝑒) = 𝑄(𝑒|𝑑)𝑄(𝑑) 𝑄 𝑒 • We want to which class is most likely 𝑑 123 = argmax 𝑄(𝑑|𝑒) 4∈6 20

Naïve Bayes Classifier MAP is “maximum a 𝑑 123 = argmax 𝑄(𝑑|𝑒) posteriori” = most likely 4∈6 class 𝑄 𝑒 𝑑 𝑄(𝑑) = argmax Bayes Rule 𝑄(𝑒) 4∈6 = argmax 𝑄 𝑒 𝑑 𝑄(𝑑) Dropping the denominator 4∈6 Document d = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) represented as 4∈6 features x1..xn O(| X | n •| C |) parameters. Could only be estimated if a very, very large number of training examples was available. 21

Multinomial Naïve Bayes Independence Assumptions 𝑄 𝑦 ! , 𝑦 " , … , 𝑦 * 𝑑 𝑄(𝑑) • Bag of Words assumption : Assume position doesn’t matter • Conditional Independence : Assume the feature probabilities P ( x i | c j ) are independent given the class c. 𝑄 𝑦 ! , … , 𝑦 * 𝑑 = 𝑄 𝑦 ! 𝑑 ⋅ 𝑄 𝑦 " 𝑑 ⋅ … ⋅ 𝑄(𝑦 * |𝑑) 22

Multinomial Naïve Bayes Classifier 𝑑 123 = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) 4∈6 𝑑 89 = argmax 𝑄 𝑑 : A 𝑄(𝑦|𝑑) 4∈6 ;∈< positions ¬ all word positions in test document 𝑑 89 = argmax 𝑄 𝑑 A 𝑄(𝑦 . |𝑑 : ) : 4∈6 .∈=>?.@.>7? 23

Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates • simply use the frequencies in the data P ( c j ) = doccount ( C = c j ) ˆ N doc fraction of times word 𝑑𝑝𝑣𝑜𝑢(𝑥 0 , 𝑑 7 ) w i appears ! 𝑄 𝑥 0 𝑑 7 = among all words in ∑ 8∈: 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑑 7 ) documents of topic c j • Create mega-document for topic j by concatenating all docs in this topic • Use frequency of w in mega-document 24

Problem with Maximum Likelihood • What if we have seen no training documents with the word fantastic and classified in the topic positive? 𝑄 ”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑” 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓 = 𝑑𝑝𝑣𝑜𝑢(”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑”, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) A = 0 ∑ .∈0 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) • Zero probabilities cannot be conditioned away, no matter the other evidence! B B 𝑑 123 = argmax 𝑄(𝑑) A 𝑄(𝑦 . |𝑑) 4 . 25

COMP9313: Big Data Management Classification and PySpark MLlib - PowerPoint PPT Presentation

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is Sparks scalable machine learning library consisting of common learning algorithms and utilities Basic Statistics Classification Regression

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

COMP9313: Big Data Management Course Introduction Lecture in Charge Lecturer: Yifang Sun

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

COMP9313: Big Data Management Recommender System Source from Dr. Xin Cao Recommendations

COMP9313: Big Data Management Introduction to MapReduce and Spark Motivation of MapReduce

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Gr a d u a t e S e mi n a r 4 Oc t . Mo r n i n g : p r o j e c t p

Quantum Measurements and Contextuality Robert B. Griffiths Physics Department Carnegie-Mellon

The Not So Happily-Ever After End of AES Security Fairytale Orr Dunkelman Faculty of

Elements of European Fairy Tales revised 01.25.13 || English 1302: Composition II || D. Glen

Disclosure Disorders No relevant disclosures. Craig Canapari, M.D.

Hope you had a FANTASTIC spring break! Hope you had a FANTASTIC spring break! Thanksgiving CS

Tolkiens Influence on the Medievalism of Names in Fantasy Literature Sara L. Uckelman

GoalPost Eric Nhan Reggie Jones Vision Fantasy sports for accomplishing goals Instead of