Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Prereq u isites S u per v ised Learning w ith scikit - learn Uns u per v ised Learning in P y thon PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Co u rse o u tline Chapter 1: Pre - processing and Vis u ali z ation Missing data , O u tliers , Normali z ation Chapter 2: S u per v ised Learning Feat u re selection , Reg u lari z ation , Feat u re engineering Chapter 3: Uns u per v ised Learning Cl u ster algorithm selection , Feat u re e x traction , Dimension red u ction Chapter 4: Model Selection and E v al u ation Model generali z ation and e v al u ation , Model selection PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Machine learning ( ML ) pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
O u r ML pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Missing data Impact of di � erent techniq u es Finding missing v al u es Strategies to handle PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Techniq u es 1. Omission Remo v al of ro w s --> .dropna(axis=0) Remo v al of col u mns --> .dropna(axis=1) 2. Imp u tation Fill w ith z ero --> SimpleImputer(strategy='constant', fill_value=0) Imp u te mean -> SimpleImputer(strategy='mean') Imp u te median --> SimpleImputer(strategy='median') Imp u te mode --> SimpleImputer(strategy='most_frequent') Iterati v e imp u tation --> IterativeImputer() PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Wh y bother ? Red u ce the probabilit y of introd u cing bias Most ML algorithms req u ire complete data PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Effects of imp u tation Depend on : Missing v al u es Original v ariance Presence of o u tliers Si z e and direction of ske w Omission --> Can remo v e too m u ch Zero --> Bias res u lts do w n w ard Mean --> A � ected more b y o u tliers Median --> Be � er in case of o u tliers Mode and iterati v e imp u tation --> Tr y them o u t PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
F u nction ret u rns df.isna().sum() n u mber missing df['feature'].mean() feat u re mean .shape ro w, col u mn dimensions df.columns col u mn names .fillna(0) � lls missing w ith 0 select_dtypes(include = [np.number] ) n u meric col u mns select_dtypes(include = ['object'] ) string col u mns .fit_transform(numeric_cols) � ts and transforms PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Effects of missing v al u es What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Select the ans w er that is tr u e : Missing v al u es aren ' t a problem since most of sklearn ' s algorithms can handle them . Remo v ing obser v ations or feat u res w ith missing v al u es is generall y a good idea . Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . Filling missing v al u es w ith z ero w ill bias res u lts u p w ard . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Effect of missing v al u es : ans w er What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? The correct ans w er is : Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . ( Filling missing v al u es b y testing w hich impacts the v ariance of a gi v en dataset the least is the best approach .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Effects of missing v al u es : incorrect ans w ers What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Missing v al u es aren ' t a problem ... ( Most of sklearn ' s algorithms cannot handle missing v al u es and w ill thro w an error .) Remo v ing obser v ations or feat u res w ith missing v al u es ... ( Unless y o u r dataset is large and the proportion of missing v al u es small , remo v ing ro w s or col u mns w ith missing data u s u all y res u lts in shrinking y o u r dataset too m u ch to be u sef u l in s u bseq u ent ML .) Filling missing v al u es w ith z ero w ill bias res u lts u p w ard .( It ' s the opposite , � lling w ith z ero w ill bias res u lts do w n w ard .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Data distrib u tions and transformations P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
Different distrib u tions 1 h � ps ://www. researchgate . net /� g u re / Bias - Training - and - test - data - sets - are - dra w n - from - di � erent - distrib u tions _� g 22_330485084 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Train / test split train, test = train_test_split(X, y, test_size=0.3) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) sns.pairplot() --> plot matri x of distrib u tions and sca � erplots PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Data transformation 1 h � ps ://www. researchgate . net /� g u re / E x ample - of - the - e � ect - of - a - log - transformation - on - the - distrib u tion - of - the - dataset _� g 20_308007227 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Bo x- Co x Transformations scipy.stats.boxcox(data, lmbda= ) p lmbda ( p ) x transform −2 = 1/2 x -2 reciprocal sq u are −1 = 1/ x x -1 reciprocal −1/2 = 1/ √ x x -0.5 reciprocal sq u are root log ( x ) 0.0 log 1/2 = √ x x 0.5 sq u are root 1 x = x 1 no transform 2 x = x 2 sq u are PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON
Data o u tliers and scaling P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist
O u tliers One or more obser v ations that are distant from the rest of the obser v ations in a gi v en feat u re . 1 h � ps :// bolt . mph .u�. ed u/6050-6052/u nit -1/ one - q u antitati v e -v ariable - introd u ction /u nderstanding - o u tliers / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Inter - q u artile range ( IQR ) 1 B y Jhg u ch at en .w ikipedia , CC BY - SA 2.5, h � ps :// commons .w ikimedia . org /w/ inde x. php ? c u rid =14524285 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Line of best fit 1 h � ps ://www. r - bloggers . com / o u tlier - detection - and - treatment -w ith - r / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
O u tlier f u nctions F u nction ret u rns sns.boxplot(x= , y='Loan Status') bo x plot conditioned on target v ariable sns.distplot() histogram and kernel densit y estimate ( kde ) np.abs() ret u rns absol u te v al u e stats.zscore() calc u lated z- score mstats.winsorize(limits=[0.05, 0.05]) � oor and ceiling applied to o u tliers np.where(condition, true, false) replaced v al u es PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
High v s lo w v ariance 1 h � ps :// machinelearningmaster y. com / a - gentle - introd u ction - to - calc u lating - normal - s u mmar y- statistics / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Standardi z ation v s normali z ation Standardi z ation : Normali z ation : Z - score standardi z ation Min / ma x normali z ing Scales to mean 0 and sd 1 Scales to bet w een (0, 1) 1 h � ps :// medi u m . com /@ rrfd / standardi z e - or - normali z e - e x amples - in - p y thon - e 3 f 174 b 65 dfc PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Scaling f u nctions scikit-learn.preprocessing.StandardScaler() --> ( mean =0, sd =1) sklearn.preprocessing.MinMaxScaler() --> (0,1) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
O u tliers and scaling Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? Select the statement that is tr u e : An o u tlier is a point that is j u st o u tside the range of similar points in a feat u re . In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . Mi x/ ma x scaling gi v es data a mean of 0, an SD of 1, and increases v ariance . Z - score standardi z ation scales data to be in the inter v al (0,1) and impro v es model � t . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
O u tliers and scaling : ans w er Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? The correct ans w er is : In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . ( Data anomalies are common in fra u d detection , c y bersec u rit y e v ents , and other scenarios w here the goal is to � nd them .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON
Recommend
More recommend