introduction in ml with scikit learn
play

Introduction in ML with scikit- learn Professor Patrick McDaniel - PowerPoint PPT Presentation

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015 Features Attributes in a data set Individual measurable property of phenomenon being observed Choosing/discovering features is a


  1. Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015

  2. Features • Attributes in a data set • “Individual measurable property of phenomenon being observed” • Choosing/discovering features is a crucial part of ML • Ex: ‣ Character Recognition: histograms of pixels ‣ Speech Recognition: Sound length, power, frequency ‣ Malware Detection: Function use count, byte counts Page

  3. Supervised Learning • Inferring a function from labeled training data • The features are selected by the developer • As such, it requires the developer to know something about the dataset to infer good features • Based on pairs of input objects and output values • Ex: ‣ Regression – Predict values ‣ Classification – Predict groupings Page

  4. Unsupervised Learning • Find hidden structure or patterns in unlabled data • Requires no prior knowledge of the nature of data • Not limited by biases inherent in feature selection • Ex: ‣ K-means ‣ Clustering ‣ Neural networks Page

  5. Scikit-learn • The easy way to do data mining and data analysis • Its all Python scripts (yay) • Built on NumPy, SciPy, and matplotlib • Okay, lets get it: ‣ pip install numpy scipy scikit-learn Page

  6. Lets do one • Classification of digits problem • Classify images of drawn numbers Page

  7. Before We Start • What can we use about the image of a character to solve this problem? Page

  8. Dataset • Dataset object in scikit-learn is a dictionary-like object that holds all data (and some metadata). • Actual data is stored as a N_sampes, N_features array • Lets get the digit dataset: >>> from sklearn import datasets >>> digits = datasets.load_digits() Page

  9. Dataset Page

  10. Dataset • “digit database by collecting 250 samples from 44 writers. The samples written by 30 writers are used for training, cross-validation and writer dependent testing, and the digits written by the other 14 are used for writer independent testing” • 500 x 500 pixel characters, compressed to form this (and then a feature vector of length=64): Page

  11. Lets Do Some Estimating • We’re going to use support vector classification (SVC). We’ll explain later. • This code sets up the classifier clf: >>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, C=100.) • We will also treat this as a black box and come back to the gamma/C values later Page

  12. Fit And Predict • To fit the classifier: >>> clf.fit(digits.data[:-1], digits.target[:-1]) • Now, we predict! >>> clf.predict(digits.data[-1]) array([8]) • Which is apparently this from before: Page

  13. Its (Sort of) That Easy! • We glossed over a couple details, but this shows how easy scikit learn makes the actual implementation • Lets talk about some of the concepts we skipped over earlier Page

  14. SVC’s • We are NOT going into implementation details. • Used for classification, regression, and detecting outliers • Advantages: ‣ Works in high-dimensional spaces ‣ Memory efficient ‣ Versatile • Disadvantages ‣ Bad when # of features > # of samples ‣ Don’t directly provide probability Page

  15. SVC: Graphically Page

  16. Next Week • Next, we will go over a security usage of data analysis: a malware classification Kaggle challenge from Microsoft • See the course site for supplemental readings and setup instructions Page

Recommend


More recommend