COMP9313: Big Data Management Classification and PySpark MLlib
PySpark MLlib • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities • Basic Statistics • Classification • Regression • Clustering • Recommendation System • Dimensionality Reduction • Feature Extraction • Optimization • It is more or less a spark version of sk-learn 2
Classification • Classification • predicts categorical class labels • constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction (aka. Regression) • models continuous-valued functions, i.e., predicts unknown or missing values • Applications • medical diagnosis • credit approval • natural language processing 3
Classification and Regression • Given a new object 𝑝 , map it to a feature vector 𝐲 = 𝑦 ! , 𝑦 " , … , 𝑦 # $ • Predict the output (class label) 𝑧 ∈ 𝒵 • Binary classification • 𝒵 = {0, 1} (sometimes {−1, 1} ) • Multi-class classification • 𝒵 = {1,2, … , 𝐷} • Learn a classification function • 𝑔 𝐲 : ℝ , ↦ 𝒵 • Regression: 𝑔 𝐲 : ℝ # ↦ ℝ 4
Example of Classification – Text Categorization • Given: document or sentence • E.g., A statement released by Scott Morrison said he has received advice … advising the upcoming sitting be cancelled. • Predict: Topic • Pre-defined labels: Politics or not? • How to learn the classification function? • 𝑔 𝐲 : ℝ , ↦ 𝒵 • How to convert document to 𝐲 ∈ ℝ , (e.g., feature vector)? • How to convert pre-defined labels to 𝒵 = {0, 1} ? 5
Example of Classification – Text Categorization • Input object: a sequence of words • Input features 𝐲 • Bag of Words representation • Freq(Morrison) = 2, freq(Trump) = 0, … • 𝐲 = 2, 1, 0, … - • Class labels: 𝒵 • Politics: 1 • Not politics: -1 6
Convert a Problem into Classification Problem • Input • How to generate input feature vectors • Output • Class labels • Another example: image classification • Input: A matrix of RGB values • Input features: color histogram • E.g., pixel_count(red) = ?, pixel_count(blue) = ? • Output: class labels • Building: 1 • Not building: -1 7
Supervised Learning • How to get 𝑔 𝐲 ? • In supervised learning, we are given a set of training examples: • = 𝐲 . , 𝑧 . , 𝑗 = 1, … , 𝑜 • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory 8
Machine Learning Terminologies • Supervised learning has input labelled data • #instances x #attributes matrix/table • #attributes = #features + 1 • 1 (usu. the last attribute) is for the class label • Labelled data split into 2 or 3 disjoint subsets • Training data (used to build a classifier) • Development data (used to select a classifier) • Testing data (used to evaluate the classifier) • Output of the classifier • Binary classification: #labels = 2 • Multi-label classification: #labels > 2 9
Machine Learning Terminologies • Evaluate the classifier • False positive: • not politics but classified as politics • False negative • Politics but classified as not politics • True positive • Politics and classified as politics &' • Precision = &'()' &' • Recall = &'()* • F1 score = 2 ⋅ '-./0102*⋅-./344 '-./0102*(-./344 10
Classification—A Two-Step Process • Classifier construction • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for classifier construction is training set • The classifier is represented as classification rules, decision trees, or mathematical formulae • Classifier usage: classifying future or unknown objects • Estimate accuracy of the classifier • The known label of test sample is compared with the classified result from the classifier • Accuracy rate is the percentage of test set samples that are correctly classified by the classifier • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the classifier to classify data tuples whose class labels are not known 11
Classification Process 1: Preprocessing and Feature Engineering Raw Data Training Data 12
Classification Process 2: Train a classifier Classification Training Algorithms Data Prediction 1 0 1 1 0 Classifier 𝑔 𝐲 Precision = 0.66 Recall = 0.66 F1 = 0.66 13
Classification Process 3: Evaluate the Classifier Classifier Prediction 1 1 1 Test 1 Data 0 Precision = 75% Recall = 100% 14 F1 = 0.86
How to Judge a Model? • Based on training error or testing error? • Testing error • Otherwise, this is a kind of data scooping => overfitting • What if there are multiple models to choose from? • Further split a “development set” from the training set • Can we trust the error values on the development set? • Need “large” dev set => less data for training • k-fold cross-validation 15
k-fold cross-validation 16
Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • … • We will do text classification in Project 2 17
Text Classification: Problem Definition • Input • Document or sentence 𝑒 • Output • Class label 𝑑 ∈ {c / , c 0 , … } • Classification methods: • Naïve bayes • Logistic regression • Support-vector machines • … 18
Naïve Bayes: Intuition • Simple (“naïve”) classification method based on Bayes rule • Relies on very simple representation of document • Bag of words it 6 I 5 I love this movie! It's sweet, the 4 but with satirical humor. The t, to 3 fairy it always love he dialogue is great and the to and 3 it whimsical it I adventure scenes are fun... and areanyone seen 2 seen ... friend yet 1 It manages to be whimsical dialogue cal happy recommend would 1 and romantic while laughing ng adventure satirical whimsical 1 sweet of at the conventions of the who it movie times 1 I to it but romantic I fairy tale genre. I would yet sweet 1 t several humor r recommend it to just about again the it satirical 1 ral the would seen anyone. I've seen it several py adventure 1 to scenes I the manages es I genre 1 the times, and I'm always happy fun times I and and fairy 1 about to see it again whenever I while whenever humor 1 have have a friend who hasn't conventions have 1 with seen it yet! great 1 … … 19
Naïve Bayes Classifier • Bayes’ Rule: • For a document d and a class c 𝑄(𝑑|𝑒) = 𝑄(𝑒|𝑑)𝑄(𝑑) 𝑄 𝑒 • We want to which class is most likely 𝑑 123 = argmax 𝑄(𝑑|𝑒) 4∈6 20
Naïve Bayes Classifier MAP is “maximum a 𝑑 123 = argmax 𝑄(𝑑|𝑒) posteriori” = most likely 4∈6 class 𝑄 𝑒 𝑑 𝑄(𝑑) = argmax Bayes Rule 𝑄(𝑒) 4∈6 = argmax 𝑄 𝑒 𝑑 𝑄(𝑑) Dropping the denominator 4∈6 Document d = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) represented as 4∈6 features x1..xn O(| X | n •| C |) parameters. Could only be estimated if a very, very large number of training examples was available. 21
Multinomial Naïve Bayes Independence Assumptions 𝑄 𝑦 ! , 𝑦 " , … , 𝑦 * 𝑑 𝑄(𝑑) • Bag of Words assumption : Assume position doesn’t matter • Conditional Independence : Assume the feature probabilities P ( x i | c j ) are independent given the class c. 𝑄 𝑦 ! , … , 𝑦 * 𝑑 = 𝑄 𝑦 ! 𝑑 ⋅ 𝑄 𝑦 " 𝑑 ⋅ … ⋅ 𝑄(𝑦 * |𝑑) 22
Multinomial Naïve Bayes Classifier 𝑑 123 = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) 4∈6 𝑑 89 = argmax 𝑄 𝑑 : A 𝑄(𝑦|𝑑) 4∈6 ;∈< positions ¬ all word positions in test document 𝑑 89 = argmax 𝑄 𝑑 A 𝑄(𝑦 . |𝑑 : ) : 4∈6 .∈=>?.@.>7? 23
Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates • simply use the frequencies in the data P ( c j ) = doccount ( C = c j ) ˆ N doc fraction of times word 𝑑𝑝𝑣𝑜𝑢(𝑥 0 , 𝑑 7 ) w i appears ! 𝑄 𝑥 0 𝑑 7 = among all words in ∑ 8∈: 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑑 7 ) documents of topic c j • Create mega-document for topic j by concatenating all docs in this topic • Use frequency of w in mega-document 24
Problem with Maximum Likelihood • What if we have seen no training documents with the word fantastic and classified in the topic positive? 𝑄 ”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑” 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓 = 𝑑𝑝𝑣𝑜𝑢(”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑”, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) A = 0 ∑ .∈0 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) • Zero probabilities cannot be conditioned away, no matter the other evidence! B B 𝑑 123 = argmax 𝑄(𝑑) A 𝑄(𝑦 . |𝑑) 4 . 25
Recommend
More recommend