Data pre-processing for k- means clustering Karolis Urbonas Head - PowerPoint PPT Presentation

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Data pre-processing for k- means clustering Karolis Urbonas Head of Data Science, Amazon

DataCamp Customer Segmentation in Python Advantages of k-means clustering One of the most popular unsupervised learning method Simple and fast Works well* * with certain assumptions about the data

DataCamp Customer Segmentation in Python Key k-means assumptions Symmetric distribution of variables (not skewed) Variables with same average values Variables with same variance

DataCamp Customer Segmentation in Python Skewed variables Left-skewed Right-skewed

DataCamp Customer Segmentation in Python Skewed variables Skew removed with logarithmic transformation

DataCamp Customer Segmentation in Python Variables on the same scale datamart_rfm.describe() K-means assumes equal mean And equal variance It's not the case with RFM data

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Let's review the concepts

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Managing skewed variables Karolis Urbonas Head of Data Science, Amazon

DataCamp Customer Segmentation in Python Identifying skewness Visual analysis of the distribution If it has a tail - it's skewed

DataCamp Customer Segmentation in Python Exploring distribution of Recency import seaborn as sns from matplotlib import pyplot as plt sns.distplot(datamart['Recency']) plt.show()

DataCamp Customer Segmentation in Python Exploring distribution of Frequency sns.distplot(datamart['Frequency']) plt.show()

DataCamp Customer Segmentation in Python Data transformations to manage skewness Logarithmic transformation (positive values only) import numpy as np frequency_log= np.log(datamart['Frequency']) sns.distplot(frequency_log) plt.show()

DataCamp Customer Segmentation in Python Dealing with negative values Adding a constant before log transformation Cube root transformation

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Let's practice how to identify and manage skewed variables!

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Centering and scaling variables Karolis Urbonas Head of Data Science, Amazon

DataCamp Customer Segmentation in Python Identifying an issue datamart_rfm.describe() Analyze key statistics of the dataset Compare mean and standard deviation

DataCamp Customer Segmentation in Python Centering variables with different means K-means works well on variables with the same mean Centering variables is done by subtracting average value from each observation datamart_centered = datamart_rfm - datamart_rfm.mean() datamart_centered.describe().round(2)

DataCamp Customer Segmentation in Python Scaling variables with different variance K-means works better on variables with the same variance / standard deviation Scaling variables is done by dividing them by standard deviation of each datamart_scaled = datamart_rfm / datamart_rfm.std() datamart_scaled.describe().round(2)

DataCamp Customer Segmentation in Python Combining centering and scaling Subtract mean and divide by standard deviation manually Or use a scaler from scikit-learn library (returns numpy.ndarray object) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(datamart_rfm) datamart_normalized = scaler.transform(datamart_rfm) print('mean: ', datamart_normalized.mean(axis=0).round(2)) print('std: ', datamart_normalized.std(axis=0).round(2)) mean: [-0. -0. 0.] std: [1. 1. 1.]

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Test different approaches by yourself!

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Sequence of structuring pre-processing steps Karolis Urbonas Head of Data Science, Amazon

DataCamp Customer Segmentation in Python Why the sequence matters? Log transformation only works with positive data Normalization forces data to have negative values and log will not work

DataCamp Customer Segmentation in Python Sequence 1. Unskew the data - log transformation 2. Standardize to the same average values 3. Scale to the same standard deviation 4. Store as a separate array to be used for clustering

DataCamp Customer Segmentation in Python Coding the sequence Unskew the data with log transformation import numpy as np datamart_log = np.log(datamart_rfm) Normalize the variables with StandardScaler from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(datamart_log) Store it separately for clustering datamart_normalized = scaler.transform(datamart_log)

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Practice on RFM data!

Data pre-processing for k- means clustering Karolis Urbonas Head - PowerPoint PPT Presentation

DataCamp Customer Segmentation in Python CUSTOMER SEGMENTATION IN PYTHON Data pre-processing for k- means clustering Karolis Urbonas Head of Data Science, Amazon DataCamp Customer Segmentation in Python Advantages of k-means clustering One

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

Peak-End Rule: Main Result A Utility-Based Discussion First Open Problem Explanation Second

Recent Trends in Algorithms National Institute of Science Education and Research The standard

Bypassing Combinatorial Protections Polynomial-Time Algorithms for Single-Peaked Electorates

I NF ORMAT I ON T E CHNOL OGY ADVI SORY COMMI T T E E DAT A E XCHANGE WORK ST RE

What is a monoid? How I learnt to stop worrying and love skewness Paul Blain Levy University of

Skew structures in 2-category theory and homotopy theory John Bourke Department of Mathematics

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

STAT 113 Variability Colin Reimer Dawson Oberlin College September 14, 2017 1 / 48 Last Time: