Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

machine learning algorithms and applications
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

21/05/12 Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012 Unsupervised Learning (cont) Slides courtesy of Bing


slide-1
SLIDE 1

21/05/12 ¡ 1 ¡

Machine Learning: Algorithms and Applications

Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012

Unsupervised Learning (cont…)

Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html

slide-2
SLIDE 2

21/05/12 ¡ 2 ¡

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

Mixed attributes

n The distance functions we have seen are for

data with all numeric attributes, or all nominal attributes, etc.

n In many practical cases data has different types

  • f attributes, from the following 6:

q interval-scaled q ratio-scaled q symmetric binary q asymmetric binary q nominal q ordinal

n Clustering a data set involving mixed attributes is

a challenging problem

slide-3
SLIDE 3

21/05/12 ¡ 3 ¡

Convert to a single type

n One common way of dealing with mixed

attributes is to:

1.

Choose a dominant attribute type

2.

Convert the other types to this type

n E.g., if most attributes in a data set are

interval-scaled

q we convert ordinal attributes and ratio-scaled

attributes to interval-scaled attributes

q it is also appropriate to treat symmetric binary

attributes as interval-scaled attributes

Convert to a single type (cont …)

n It does not make much sense to convert a

nominal attribute or an asymmetric binary attribute to an interval-scaled attribute

q but it is frequently done in practice by assigning

some numbers to them according to some hidden

  • rdering, e.g., prices of the fruits

n Alternatively, a nominal attribute can be

converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes

slide-4
SLIDE 4

21/05/12 ¡ 4 ¡

Combining individual distances

n This approach computes individual attribute distances and then

combine them

n A combination formula, proposed by Gower, is

q The distance dist(xi,xj) is between 0 and 1 q r is the number of attributes q q dij

f is the distance contributed by attribute f, in the range [0,1]

!ij

f =

1 if xif and x jf are not missing 0 if xif or x jf is missing 0 if attribute f is asymmetric and xif and x jf are both 0 ! " # # $ # #

∑ ∑

= =

=

r f f ij f ij r f f ij j i

d dist

1 1

) , ( δ δ x x

(4)

Combining individual distances (cont …)

n If f is a binary or nominal attribute

q distance (4) reduces to

n

equation (3)-lect 10 if all attributes are nominal

n

the simple matching distance (1)-lect 10 if all attributes are symmetric binary

n

the Jaccard distance (2)-lect 10 if all attributes are asymmetric n If f is interval-scaled

q Rf is the value range of f q If all the attributes are interval-scaled, distance (4) reduces to

Manhattan distance

n

Assuming that all attributes values are standardized n Ordinal and ratio-scaled attributes are converted to

interval-scaled attributes and handled in the same way

dij

f =

1 if xif ! x jf 0 otherwise " # $ % $ dij

f = xif ! x jf

Rf Rf = max( f )! min( f )

slide-5
SLIDE 5

21/05/12 ¡ 5 ¡

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

How to choose a clustering algorithm

n Clustering research has a long history

q A vast collection of algorithms are available q We only introduced several main algorithms

n Choosing the “best” algorithm is challenging

q Every algorithm has limitations and works well with certain

data distributions

q It is very hard, if not impossible, to know what distribution

the application data follow

n

The data may not fully follow any “ideal” structure or distribution required by the algorithms

q One also needs to decide how to standardize the data, to

choose a suitable distance function and to select other parameter values

slide-6
SLIDE 6

21/05/12 ¡ 6 ¡

How to choose a clustering algorithm (cont …)

n Due to these complexities, the common practice is to

1.

run several algorithms using different distance functions and parameter settings

2.

carefully analyze and compare the results

n The interpretation of the results must be based on

q insight into the meaning of the original data q knowledge of the algorithms used

n Clustering is highly application dependent and to

certain extent subjective (personal preferences)

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

slide-7
SLIDE 7

21/05/12 ¡ 7 ¡

Cluster Evaluation: hard problem

n The quality of a clustering is very hard to evaluate

because

q We do not know the correct clusters

n Some methods are used

q User inspection n

A panel of experts inspects the resulting clusters and scores them

q Study centroids as spreads q Examine rules (e.g., from a decision tree) that describe the clusters q For text documents, one can inspect by reading n

The final score is the average of the individual scoring

n

Manual inspection is labor intensive and time consuming

Cluster evaluation: ground truth

n We use some labeled data (for classification)

q Assumption: Each class is a cluster

n Let the classes in the data D be C=(c1, c2,…,ck)

q The clustering method produces k clusters, which

divides D into k disjoint subsets, D1, D2, …, Dk

n After clustering, a confusion matrix is constructed

q From the matrix, we compute various measurements:

entropy, purity, precision, recall and F-score

slide-8
SLIDE 8

21/05/12 ¡ 8 ¡

Evaluation measures: Entropy

n For each cluster, we can measure the entropy as

q Pri(cj): proportion of class cj in cluster Di

n The entropy of the whole clustering is

q |Di|/|D| is the weight of cluster Di, proportional to its

size entropy(Di) = ! Pri(cj)log2

j=1 k

"

Pri(cj) entropytotal(D) = Di D entropy(Di)

i=1 k

!

Evaluation measures: purity

n Measures the extent a cluster contains only one class of

data

n The purity of the whole clustering is

q |Di|/|D| is the weight of cluster Di, proportional to its size

n Precision, recall, and F-measure can be computed as well

q Based on the class that is most frequent in the cluster

purity(Di) = max j Pr(cj)

( )

puritytotal(D) = Di D purity(Di)

i=1 k

!

slide-9
SLIDE 9

21/05/12 ¡ 9 ¡

An example

n

We can use the total entropy or purity to compare

q

different clustering results from the same algorithm

q

different algorithms

n

Precision, recall and F-measure can be computed as well for each cluster

q

The precision of Science in cluster 1 is 0.89, the recall is 0.83, the F-measure is thus 0.86

A remark about ground truth evaluation

n Commonly used to compare different clustering

algorithms

n A real-life data set for clustering has no class labels

q Thus although an algorithm may perform very well on some

labeled data sets, no guarantee that it will perform well on the actual application data at hand

n The fact that it performs well on some label data

sets does give us some confidence of the quality of the algorithm

n This evaluation method is said to be based on

external data or information

slide-10
SLIDE 10

21/05/12 ¡ 10 ¡

Evaluation based on internal information

n Intra-cluster cohesion (compactness):

q Cohesion measures how near the data points in a

cluster are to the cluster centroid

q Sum of squared error (SSE) is a commonly used

measure

n Inter-cluster separation (isolation):

q Separation means that different cluster centroids

should be far away from one another

n In most applications, expert judgments are

still the key

Indirect evaluation

n In some applications, clustering is not the primary

task, but used to help perform another task

n We can use the performance on the primary task to

compare clustering methods

n For instance, in an application, the primary task is to

provide recommendations on book purchasing to

  • nline shoppers

q If we can cluster shoppers according to their features, we

might be able to provide better recommendations

q We can evaluate different clustering algorithms based on

how well they help with the recommendation task

q Here, we assume that the recommendation can be reliably

evaluated

slide-11
SLIDE 11

21/05/12 ¡ 11 ¡

Road map

n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary

Summary

n Clustering is has along history and still active

q There are a huge number of clustering algorithms q More are still coming every year

n We only introduced several main algorithms. There are

many others, e.g.,

q density based algorithm, sub-space clustering, scale-up methods,

neural networks based methods, fuzzy clustering, co-clustering, etc.

n Clustering is hard to evaluate, but very useful in practice

q This partially explains why there are still a large number of clustering

algorithms being devised every year

n Clustering is highly application dependent and to some

extent subjective

slide-12
SLIDE 12

21/05/12 ¡ 12 ¡

Reinforcement Learning

These slides are an adaptation of slides drawn by Tom Mitchell and modified by Liviu Ciortuz

Introduction

› Supervised learning is the simplest and most

studied type of learning

› How can an agent learn behaviors when it doesn’t

have a teacher to tell it how to perform?

› The agent has a task to perform › It takes some actions in the world › At some later point, it gets feedback telling it how well it

did on performing the task

› The agent performs the same task over and over again

› This problem is called reinforcement learning:

› The agent gets positive reward for tasks done well › The agent gets negative reward for tasks done poorly

slide-13
SLIDE 13

21/05/12 ¡ 13 ¡

Introduction (cont…)

› The goal is to get the agent to act in the world

so as to maximize its rewards

› The agent has to figure out what it did that

made it get the reward/punishment

› This is known as the credit assignment problem

› Reinforcement learning can be used to train

computers to do many tasks, such as:

› playing board games › job shop scheduling › controlling robot › flight/taxy scheduling › …

Overview

› Task: Control learning

› make an autonomous agent (robot) to perform actions,

  • bserve consequences and learn a control strategy

› The Q learning algorithm

› acquire optimal control strategies from delayed rewards,

even when the agent has no prior knowledge of the effect

  • f its actions on the environment

› Reinforcement Learning is related to dynamic

programming, used to solve optimization problems

› While DP assumes that the agent/program knows the

effect (and rewards) of all its actions, in RL the agent has to experiment in the real world

slide-14
SLIDE 14

21/05/12 ¡ 14 ¡

Reinforcement Learning Problem

Example:

  • play Backgammon (TD-Gammon [Tesauro, 1995]);

immediate reward

  • +100 if win,
  • -100 if lose,
  • 0 otherwise

› Target function to

learn:

› Goal: maximize

where

! : S ! A r

0 +!r 1 +! 2r 2 +...

0 !! <1

Control learning characteristics

slide-15
SLIDE 15

21/05/12 ¡ 15 ¡ Learning Sequential Control Strategies Using Markov Decision Processes

Agent’s Learning Task