Uns u per v ised Learning U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io
Uns u per v ised learning Uns u per v ised learning � nds pa � erns in data E . g ., cl u stering c u stomers b y their p u rchases Compressing the data u sing p u rchase pa � erns ( dimension red u ction ) UNSUPERVISED LEARNING IN PYTHON
S u per v ised v s u ns u per v ised learning S u per v ised learning � nds pa � erns for a prediction task E . g ., classif y t u mors as benign or cancero u s ( labels ) Uns u per v ised learning � nds pa � erns in data ... b u t w itho u t a speci � c prediction task in mind UNSUPERVISED LEARNING IN PYTHON
Iris dataset Meas u rements of man y iris plants Three species of iris : setosa v ersicolor v irginica Petal length , petal w idth , sepal length , sepal w idth ( the feat u res of the dataset ) 1 h � p :// scikit - learn . org / stable / mod u les / generated / sklearn . datasets . load _ iris . html / UNSUPERVISED LEARNING IN PYTHON
Arra y s , feat u res & samples 2 D N u mP y arra y Col u mns are meas u rements ( the feat u res ) Ro w s represent iris plants ( the samples ) UNSUPERVISED LEARNING IN PYTHON
Iris data is 4- dimensional Iris samples are points in 4 dimensional space Dimension = n u mber of feat u res Dimension too high to v is u ali z e ! ... b u t u ns u per v ised learning gi v es insight UNSUPERVISED LEARNING IN PYTHON
k - means cl u stering Finds cl u sters of samples N u mber of cl u sters m u st be speci � ed Implemented in sklearn (" scikit - learn ") UNSUPERVISED LEARNING IN PYTHON
print(samples) [[ 5. 3.3 1.4 0.2] [ 5. 3.5 1.3 0.3] ... [ 7.2 3.2 6. 1.8]] from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) KMeans(algorithm='auto', ...) labels = model.predict(samples) print(labels) [0 0 1 1 0 1 2 1 0 1 ...] UNSUPERVISED LEARNING IN PYTHON
Cl u ster labels for ne w samples Ne w samples can be assigned to e x isting cl u sters k - means remembers the mean of each cl u ster ( the " centroids ") Finds the nearest centroid to each ne w sample UNSUPERVISED LEARNING IN PYTHON
Cl u ster labels for ne w samples print(new_samples) [[ 5.7 4.4 1.5 0.4] [ 6.5 3. 5.5 1.8] [ 5.8 2.7 5.1 1.9]] new_labels = model.predict(new_samples) print(new_labels) [0 2 1] UNSUPERVISED LEARNING IN PYTHON
Scatter plots Sca � er plot of sepal length v s . petal length Each point represents an iris sample Color points b y cl u ster labels P y Plot ( matplotlib.pyplot ) UNSUPERVISED LEARNING IN PYTHON
Scatter plots import matplotlib.pyplot as plt xs = samples[:,0] ys = samples[:,2] plt.scatter(xs, ys, c=labels) plt.show() UNSUPERVISED LEARNING IN PYTHON
Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON
E v al u ating a cl u stering U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io
E v al u ating a cl u stering Can check correspondence w ith e . g . iris species ... b u t w hat if there are no species to check against ? Meas u re q u alit y of a cl u stering Informs choice of ho w man y cl u sters to look for UNSUPERVISED LEARNING IN PYTHON
Iris : cl u sters v s species k - means fo u nd 3 cl u sters amongst the iris samples Do the cl u sters correspond to the species ? species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 UNSUPERVISED LEARNING IN PYTHON
Cross tab u lation w ith pandas Cl u sters v s species is a " cross - tab u lation " Use the pandas librar y Gi v en the species of each sample as a list species print(species) ['setosa', 'setosa', 'versicolor', 'virginica', ... ] UNSUPERVISED LEARNING IN PYTHON
Aligning labels and species import pandas as pd df = pd.DataFrame({'labels': labels, 'species': species}) print(df) labels species 0 1 setosa 1 1 setosa 2 2 versicolor 3 2 virginica 4 1 setosa ... UNSUPERVISED LEARNING IN PYTHON
Crosstab of labels and species ct = pd.crosstab(df['labels'], df['species']) print(ct) species setosa versicolor virginica labels 0 0 2 36 1 50 0 0 2 0 48 14 Ho w to e v al u ate a cl u stering , if there w ere no species information ? UNSUPERVISED LEARNING IN PYTHON
Meas u ring cl u stering q u alit y Using onl y samples and their cl u ster labels A good cl u stering has tight cl u sters Samples in each cl u ster b u nched together UNSUPERVISED LEARNING IN PYTHON
Inertia meas u res cl u stering q u alit y Meas u res ho w spread o u t the cl u sters are ( lo w er is be � er ) Distance from each sample to centroid of its cl u ster A � er fit() , a v ailable as a � rib u te inertia_ k - means a � empts to minimi z e the inertia w hen choosing cl u sters from sklearn.cluster import KMeans model = KMeans(n_clusters=3) model.fit(samples) print(model.inertia_) 78.9408414261 UNSUPERVISED LEARNING IN PYTHON
The n u mber of cl u sters Cl u sterings of the iris dataset w ith di � erent n u mbers of cl u sters More cl u sters means lo w er inertia What is the best n u mber of cl u sters ? UNSUPERVISED LEARNING IN PYTHON
Ho w man y cl u sters to choose ? A good cl u stering has tight cl u sters ( so lo w inertia ) ... b u t not too man y cl u sters ! Choose an " elbo w" in the inertia plot Where inertia begins to decrease more slo w l y E . g ., for iris dataset , 3 is a good choice UNSUPERVISED LEARNING IN PYTHON
Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON
Transforming feat u res for better cl u sterings U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson Director of Research at lateral . io
Piedmont w ines dataset 178 samples from 3 distinct v arieties of red w ine : Barolo , Grignolino and Barbera Feat u res meas u re chemical composition e . g . alcohol content Vis u al properties like " color intensit y" 1 So u rce : h � ps :// archi v e . ics .u ci . ed u/ ml / datasets / Wine UNSUPERVISED LEARNING IN PYTHON
Cl u stering the w ines from sklearn.cluster import KMeans model = KMeans(n_clusters=3) labels = model.fit_predict(samples) UNSUPERVISED LEARNING IN PYTHON
Cl u sters v s . v arieties df = pd.DataFrame({'labels': labels, 'varieties': varieties}) ct = pd.crosstab(df['labels'], df['varieties']) print(ct) varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON
Feat u re v ariances The w ine feat u res ha v e v er y di � erent v ariances ! Variance of a feat u re meas u res spread of its v al u es feature variance alcohol 0.65 malic_acid 1.24 ... od280 0.50 proline 99166.71 UNSUPERVISED LEARNING IN PYTHON
Feat u re v ariances The w ine feat u res ha v e v er y di � erent v ariances ! Variance of a feat u re meas u res spread of its v al u es feature variance alcohol 0.65 malic_acid 1.24 ... od280 0.50 proline 99166.71 UNSUPERVISED LEARNING IN PYTHON
StandardScaler In kmeans : feat u re v ariance = feat u re in �u ence StandardScaler transforms each feat u re to ha v e mean 0 and v ariance 1 Feat u res are said to be " standardi z ed " UNSUPERVISED LEARNING IN PYTHON
sklearn StandardScaler from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True) samples_scaled = scaler.transform(samples) UNSUPERVISED LEARNING IN PYTHON
Similar methods StandardScaler and KMeans ha v e similar methods Use fit() / transform() w ith StandardScaler Use fit() / predict() w ith KMeans UNSUPERVISED LEARNING IN PYTHON
StandardScaler , then KMeans Need to perform t w o steps : StandardScaler , then KMeans Use sklearn pipeline to combine m u ltiple steps Data � o w s from one step into the ne x t UNSUPERVISED LEARNING IN PYTHON
Pipelines combine m u ltiple steps from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans scaler = StandardScaler() kmeans = KMeans(n_clusters=3) from sklearn.pipeline import make_pipeline pipeline = make_pipeline(scaler, kmeans) pipeline.fit(samples) Pipeline(steps=...) labels = pipeline.predict(samples) UNSUPERVISED LEARNING IN PYTHON
Feat u re standardi z ation impro v es cl u stering With feat u re standardi z ation : varieties Barbera Barolo Grignolino labels 0 0 59 3 1 48 0 3 2 0 0 65 Witho u t feat u re standardi z ation w as v er y bad : varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50 UNSUPERVISED LEARNING IN PYTHON
sklearn preprocessing steps StandardScaler is a " preprocessing " step MaxAbsScaler and Normalizer are other e x amples UNSUPERVISED LEARNING IN PYTHON
Let ' s practice ! U N SU P E R VISE D L E AR N IN G IN P YTH ON
Recommend
More recommend