Statistical Natural Language Processing Ç. Çöltekin, variables learned without labels … … Hidden Markov models Models with hidden variables Practical matters Autoencoders PCA Clustering Recap 4 / 53 Summer Semester 2017 SfS / University of Tübingen such as gradient descent SfS / University of Tübingen using analytic solutions, otherwise use search methods – For logistic regression, the negative log likelihood – For least-squares regression Unsupervised machine learning during training Supervised learning: estimating parameters Practical matters Autoencoders PCA Clustering Recap 3 / 53 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Summer Semester 2017 label of an unknown Clustering Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, – … – Clustering documents, e.g., news into topics – Clustering words for e.g., better parsing – Clustering (literary) texts, for e.g., authorship attribution relations – Clustering languages, dialects for determining their similar to each other Clustering: why do we do it? Practical matters Autoencoders PCA Recap 5 / 53 – Clustering : fjnd related groups of instances Recap Clustering PCA Autoencoders Practical matters Unsupervised learning – Density estimation : fjnd a probability distribution that 6 / 53 explains the data – Dimensionality reduction : fjnd an accurate/useful lower dimensional representation of the data Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 Ç. Çöltekin, 7 / 53 SfS / University of Tübingen Ç. Çöltekin, Autoencoders PCA Clustering Recap 2 / 53 learning Summer Semester 2017 based on model we learned SfS / University of Tübingen Supervised learning: regression Summer Semester 2017 1 / 53 Recap Clustering Ç. Çöltekin, PCA Autoencoders want to predict the value quantitative variable. Practical matters Supervised learning: classifjcation Supervised learning PCA a label. In the example: Çağrı Çöltekin University of Tübingen Summer Semester 2017 Recap Clustering Seminar für Sprachwissenschaft Autoencoders Practical matters Practical matters • The methods we studied so far are instances of supervised • In supervised learning, we have a set of predictors x , and want to predict a response or outcome variable y • During training, we have both input and output variables • Training consist of estimating parameters w of a model • During prediction, we are given x and make predictions x 2 y + + + • The response (outcome) is • The response (outcome) − + − variable ( y ) is a + positive + or negative − ? − + • Given the features ( x 1 and − • Given the features ( x ) we − x 2 ), we want to predict the − of y − instance ? x 1 x • Most models/methods estimate a set of parameters w q 0 q 1 q 2 q 3 q 4 q T • Often we fjnd the parameters that minimize a loss function y i − y i ) 2 + ∥ w ∥ ∑ J ( w ) = ( ˆ o 1 o 2 o 3 o 4 o T i • HMMs, or other models with hidden variables, can be J ( w ) = − log L ( w ) + ∥ w ∥ • Unsupervised learning is essentially learning the hidden • If the loss function is convex , we can fjnd a global minimum • In unsupervised learning, we do not have labels • The aim is to fjnd groups of instances/items that are • Our aim is to fjnd useful patterns/structure in the data • Typical unsupervised methods include • Applications include • All can be cast as graphical models with hidden variables • Evaluation is diffjcult: we do not have ‘true’ labels/values
Recap 13 / 53 assignments Efgectively, we are fjnding a local minimum of the sum of squared Euclidean distance within each cluster * Note the similarity with the EM algorithm Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 Recap – Assign each data point to the cluster of the nearest centroid Clustering PCA Autoencoders Practical matters K-means clustering: visualization 0 1 2 – Re-calculate the centroid locations based on the 2. Repeat until convergence 4 SfS / University of Tübingen Practical matters How to do clustering Most clustering algorithms try to minimize the scatter within each cluster. Which is equivalent to maximizing the scatter between clusters Clustering Ç. Çöltekin, Summer Semester 2017 clusters 12 / 53 Recap Clustering PCA Autoencoders Practical matters K-means clustering K-means is a popular method for clustering. 3 5 PCA 3 2 3 4 5 0 1 2 4 0 5 randomly to the closest centroid centroids Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 1 K-means clustering: visualization 0 centroid 1 2 3 4 5 randomly to the closest centroids Practical matters Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 14 / 53 Recap Clustering PCA Autoencoders Autoencoders 14 / 53 Clustering Autoencoders Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 9 / 53 Recap Clustering PCA Practical matters closer data points are Similarity and distance between linguistic units (letters, words, sentences, Recap Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 10 / 53 Recap grouped together groups in the data PCA Ç. Çöltekin, PCA Autoencoders Practical matters Clustering (divisive) optimum solutions, we often rely on greedy algorithms that fjnd a local minimum important SfS / University of Tübingen not have labels Summer Semester 2017 8 / 53 Recap Clustering PCA Autoencoders Practical matters Clustering in two dimensional space Clustering documents, …) Autoencoders Practical matters 11 / 53 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Distance measures in Euclidean space x 2 • Clustering can be hierarchical or non-hierarchical • Unlike classifjcation, we do • Clustering can be bottom-up (agglomerative) or top-down • We want to fjnd ‘natural’ • For most (useful) problems we cannot fjnd globally • Intuitively, similar or • The measure of distance or similarity between the items is x 1 • The notion of distance (similarity) is important in • Euclidean distance: clustering. A distance measure D , � � k – is symmetric: D ( a , b ) = D ( b , a ) � � ∑ ∥ a − b ∥ = ( a j − b j ) 2 – non-negative: D ( a , b ) ⩾ 0 for all a , b , and it D ( a , b ) = 0 ifg a = b j = 1 – obeys triangle inequality: D ( a , b ) + D ( b , c ) ⩾ D ( a , c ) • Manhattan distance: • The choice of distance is application specifjc k • We will often face with defjning distance measures ∑ ∥ a − b ∥ 1 = | a j − b j | j = 1 1. Randomly choose centroids , m 1 , . . . , m K , representing K x 2 K ∑ ∑ ∑ d ( a , b ) k = 1 C ( a )= k C ( b )= k K K 1 ∑ ∑ ∑ ∑ ∑ ∑ ∥ a − b ∥ 2 d ( a , b ) 2 k = 1 C ( a )= k C ( b ) ̸ = k k = 1 C ( a )= k C ( b )= k x 1 • The data • The data • Set cluster centroids • Set cluster centroids • Assign data points • Assign data points • Recalculate the • Recalculate the
Recommend
More recommend