Dimensionalit y red u ction : feat u re e x traction P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Uns u per v ised learning methods Principal component anal y sis ( PCA ) --> Lesson 3.1 Sing u lar v al u e decomposition ( SVD ) --> Lesson 3.1 Cl u stering / gro u ping --> Lesson 3.3 E x plorator y data mining PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Dimensionalit y red u ction != feat u re selection 1 2 h � ps :// slidepla y er . com / slide /9699240/ h � ps ://www. anal y tics v idh y a . com / blog /2016/03/ practical - g u ide - principal - component - anal y sis - p y thon / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
C u rse of dimensionalit y 1 h � ps ://www.v isiond u mm y. com /2014/04/ c u rse - dimensionalit y- a � ect - classi � cation / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
1- D search PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
2- D search PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
3- D search PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Dimensionalit y red u ction methods PCA SVD PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
PCA PCA Relationship bet w een X and y Calc u lated b y � nding principal a x es Translates , rotates and scales Lo w er - dimensional projection of the data 1 h � ps :// scikit - learn . org / stable / mod u les / decomposition . html PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
SVD SVD Linear algebra and v ector calc u l u s Decomposes data matri x into three matrices Res u lts in ' sing u lar ' v al u es Variance in data appro x imatel y eq u als SS of sing u lar v al u es 1 h � ps :// gala xy datatech . com /2018/07/15/ sing u lar -v al u e - decomposition / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Dimension red u ction f u nctions F u nction / method ret u rns sklearn.decomposition.PCA principal component anal y sis sklearn.decomposition.TruncatedSVD sing u lar v al u e decomposition PCA/SVD.fit_transform(X) � ts and transforms data PCA/SVD.explained_variance_ratio_ v ariance e x plained b y PCs Other matri x decomposition algorithms PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Dimensionalit y red u ction : v is u ali z ation techniq u es P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Wh y dimensionalit y red u ction ? 1. Speed u p ML training 2. Vis u ali z ation 3. Impro v es acc u rac y PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Vis u ali z ation techniq u es PCA t - SNE PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Vis u ali z ing w ith PCA 1 h � ps :// districtdatalabs . sil v rback . com / principal - component - anal y sis -w ith - p y thon PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Scree plot 1 h � ps :// to w ardsdatascience . com / a - step - b y- step - e x planation - of - principal - component - anal y sis - b 836 fb 9 c 97 e 2 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
t - SNE Probabilistic Pairs of data points Lo w- dimensional embedding Plot embeddings PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Vis u ali z ing w ith t - SNE # t-sne with loan data from sklearn.manifold import TSNE # t-sne viz import seaborn as sns plt.figure(figsize=(16,10)) sns.scatterplot( loans = pd.read_csv('loans_dataset.csv') x="t-SNE-PC-one", y="t-SNE-PC-two", hue="Loan Status", # Feature matrix palette=sns.color_palette(["grey","blue"]), X = loans.drop('Loan Status', axis=1) data=loans, legend="full", tsne = TSNE(n_components=2, verbose=1, perplexity=40) alpha=0.3 tsne_results = tsne.fit_transform(X) ) loans['t-SNE-PC-one'] = tsne_results[:,0] loans['t-SNE-PC-two'] = tsne_results[:,1] 1 h � ps :// scikit - learn . org / stable / mod u les / generated / sklearn . manifold . TSNE . html PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Vis u ali z ing w ith t - SNE PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
PCA v s t - SNE digits data 1 h � ps :// to w ardsdatascience . com /v is u alising - high - dimensional - datasets -u sing - pca - and - t - sne - in - p y thon - 8 ef 87 e 7915 b PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Cl u stering anal y sis : selecting the right cl u stering algorithm P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Cl u stering algorithms Feat u res >> Obser v ations Model training more challenging Rel y on distance calc u lations Most commonl y u sed u ns u per v ised techniq u e PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Practical applications of cl u stering C u stomer segmentation Doc u ment classi � cation Ins u rance / transaction fra u d detection Image segmentation Anomal y detection Man y more ... PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Distance metrics : Manhattan ( ta x icab ) distance 1 h � ps :// en .w ikipedia . org /w iki / Ta x icab _ geometr y PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Distance metrics : E u clidian distance 1 h � p :// rosalind . info / glossar y/ e u clidean - distance / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
K - means 1. Initial centroids 2. Assign each obser v ation to nearest centroid 3. Create ne w centroids 4. Repeat steps 2 and 3 1 h � p :// sherr y to w ers . com /2013/10/24/ k - means - cl u stering / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Hierarchical agglomerati v e cl u stering 1 h � ps ://www. datano v ia . com / en / lessons / agglomerati v e - hierarchical - cl u stering / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Agglomerati v e cl u stering linkage Ward linkage Ma x im u m / complete linkage A v erage linkage Single linkage PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Selecting a cl u stering algorithm Cl u ster stabilit y assessment K - means and HC u se E u clidian distance Inter - and intra - cl u ster distances " An appropriate dissimilarit y meas u re is far more important in obtaining s u ccess w ith cl u stering than choice of cl u stering algorithm ." - from Elements of Statistical Learning 1 h � ps :// slidepla y er . com / slide /8363774/ PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Cl u stering f u nctions F u nction / method ret u rns sklearn.cluster.Kmeans K - Means cl u stering algorithm sklearn.cluster.AgglomerativeClustering Agglomerati v e cl u stering algorithm kmeans.inertia_ SS distances of obser v ations to closest cl u ster center scipy.cluster.hierarchy as sch Hierachical cl u stering for dendrograms sch.dendrogram() Dendrogram f u nction PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Cl u stering anal y sis : choosing the optimal n u mber of cl u sters P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Methods for optimal k Silho u e � e method Elbo w method PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Silho u ette coefficient Composed of 2 scores Mean distance bet w een each obser v ation and all others : in the same cl u ster in the nearest cl u ster PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Silho u ette coefficient v al u es Bet w een -1 and 1 1 near others in same cl u ster v er y far from others in other cl u sters -1 not near others in same cl u ster close to others in other cl u sters 0 denotes o v erlapping cl u sters PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Silho u ette score 1 h � ps :// scikit - learn . org / stable / a u to _ e x amples / cl u ster / plot _ kmeans _ silho u e � e _ anal y sis . html PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Elbo w method 1 h � ps ://www. datano v ia . com / en / lessons / determining - the - optimal - n u mber - of - cl u sters -3- m u st - kno w- methods / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Optimal k selection f u nctions F u nction / method ret u rns sklearn.cluster.KMeans K - Means cl u stering algorithm sklearn.metrics.silhouette_score score bet w een -1 and 1 as meas u re of cl u ster stabilit y kmeans.inertia_ SS distances of obser v ations to closest cl u ster center range(start, stop) list of v al u es beginning w ith start , u p to b u t not incl u ding stop list.append(kmeans.inertia_) appends inertia v al u e to list PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Recommend
More recommend