Introduction in ML with scikit- learn Professor Patrick McDaniel - - PowerPoint PPT Presentation

introduction in ml with scikit learn
SMART_READER_LITE
LIVE PREVIEW

Introduction in ML with scikit- learn Professor Patrick McDaniel - - PowerPoint PPT Presentation

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015 Features Attributes in a data set Individual measurable property of phenomenon being observed Choosing/discovering features is a


slide-1
SLIDE 1

Introduction in ML with scikit- learn

Professor Patrick McDaniel Jonathan Price Fall 2015

slide-2
SLIDE 2

Page

Features

  • Attributes in a data set
  • “Individual measurable property of phenomenon

being observed”

  • Choosing/discovering features is a crucial part of

ML

  • Ex:
  • Character Recognition: histograms of pixels
  • Speech Recognition: Sound length, power, frequency
  • Malware Detection: Function use count, byte counts
slide-3
SLIDE 3

Page

Supervised Learning

  • Inferring a function from labeled training data
  • The features are selected by the developer
  • As such, it requires the developer to know

something about the dataset to infer good features

  • Based on pairs of input objects and output values
  • Ex:
  • Regression – Predict values
  • Classification – Predict groupings
slide-4
SLIDE 4

Page

Unsupervised Learning

  • Find hidden structure or patterns in unlabled data
  • Requires no prior knowledge of the nature of data
  • Not limited by biases inherent in feature selection
  • Ex:
  • K-means
  • Clustering
  • Neural networks
slide-5
SLIDE 5

Page

Scikit-learn

  • The easy way to do data mining and data analysis
  • Its all Python scripts (yay)
  • Built on NumPy, SciPy, and matplotlib
  • Okay, lets get it:
  • pip install numpy scipy scikit-learn
slide-6
SLIDE 6

Page

Lets do one

  • Classification of digits problem
  • Classify images of drawn numbers
slide-7
SLIDE 7

Page

Before We Start

  • What can we use about the image of a character

to solve this problem?

slide-8
SLIDE 8

Page

Dataset

  • Dataset object in scikit-learn is a dictionary-like
  • bject that holds all data (and some metadata).
  • Actual data is stored as a N_sampes, N_features

array

  • Lets get the digit dataset:

>>> from sklearn import datasets >>> digits = datasets.load_digits()

slide-9
SLIDE 9

Page

Dataset

slide-10
SLIDE 10

Page

Dataset

  • “digit database by collecting 250 samples from 44
  • writers. The samples written by 30 writers are

used for training, cross-validation and writer dependent testing, and the digits written by the

  • ther 14 are used for writer independent testing”
  • 500 x 500 pixel characters, compressed to form

this (and then a feature vector of length=64):

slide-11
SLIDE 11

Page

Lets Do Some Estimating

  • We’re going to use support vector classification

(SVC). We’ll explain later.

  • This code sets up the classifier clf:

>>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, C=100.)

  • We will also treat this as a black box and come

back to the gamma/C values later

slide-12
SLIDE 12

Page

Fit And Predict

  • To fit the classifier:

>>> clf.fit(digits.data[:-1], digits.target[:-1])

  • Now, we predict!

>>> clf.predict(digits.data[-1]) array([8])

  • Which is apparently this from before:
slide-13
SLIDE 13

Page

Its (Sort of) That Easy!

  • We glossed over a couple details, but this shows

how easy scikit learn makes the actual implementation

  • Lets talk about some of the concepts we skipped
  • ver earlier
slide-14
SLIDE 14

Page

SVC’s

  • We are NOT going into implementation details.
  • Used for classification, regression, and detecting
  • utliers
  • Advantages:
  • Works in high-dimensional spaces
  • Memory efficient
  • Versatile
  • Disadvantages
  • Bad when # of features > # of samples
  • Don’t directly provide probability
slide-15
SLIDE 15

Page

SVC: Graphically

slide-16
SLIDE 16

Page

Next Week

  • Next, we will go over a security usage of data

analysis: a malware classification Kaggle challenge from Microsoft

  • See the course site for supplemental readings and

setup instructions