Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G - PowerPoint PPT Presentation

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

Prereq u isites S u per v ised Learning w ith scikit - learn Uns u per v ised Learning in P y thon PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Co u rse o u tline Chapter 1: Pre - processing and Vis u ali z ation Missing data , O u tliers , Normali z ation Chapter 2: S u per v ised Learning Feat u re selection , Reg u lari z ation , Feat u re engineering Chapter 3: Uns u per v ised Learning Cl u ster algorithm selection , Feat u re e x traction , Dimension red u ction Chapter 4: Model Selection and E v al u ation Model generali z ation and e v al u ation , Model selection PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Machine learning ( ML ) pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

O u r ML pipeline PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Missing data Impact of di � erent techniq u es Finding missing v al u es Strategies to handle PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Techniq u es 1. Omission Remo v al of ro w s --> .dropna(axis=0) Remo v al of col u mns --> .dropna(axis=1) 2. Imp u tation Fill w ith z ero --> SimpleImputer(strategy='constant', fill_value=0) Imp u te mean -> SimpleImputer(strategy='mean') Imp u te median --> SimpleImputer(strategy='median') Imp u te mode --> SimpleImputer(strategy='most_frequent') Iterati v e imp u tation --> IterativeImputer() PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Wh y bother ? Red u ce the probabilit y of introd u cing bias Most ML algorithms req u ire complete data PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Effects of imp u tation Depend on : Missing v al u es Original v ariance Presence of o u tliers Si z e and direction of ske w Omission --> Can remo v e too m u ch Zero --> Bias res u lts do w n w ard Mean --> A � ected more b y o u tliers Median --> Be � er in case of o u tliers Mode and iterati v e imp u tation --> Tr y them o u t PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

F u nction ret u rns df.isna().sum() n u mber missing df['feature'].mean() feat u re mean .shape ro w, col u mn dimensions df.columns col u mn names .fillna(0) � lls missing w ith 0 select_dtypes(include = [np.number] ) n u meric col u mns select_dtypes(include = ['object'] ) string col u mns .fit_transform(numeric_cols) � ts and transforms PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Effects of missing v al u es What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Select the ans w er that is tr u e : Missing v al u es aren ' t a problem since most of sklearn ' s algorithms can handle them . Remo v ing obser v ations or feat u res w ith missing v al u es is generall y a good idea . Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . Filling missing v al u es w ith z ero w ill bias res u lts u p w ard . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Effect of missing v al u es : ans w er What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? The correct ans w er is : Missing data tends to introd u ce bias that leads to misleading res u lts so the y cannot be ignored . ( Filling missing v al u es b y testing w hich impacts the v ariance of a gi v en dataset the least is the best approach .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Effects of missing v al u es : incorrect ans w ers What are the e � ects of missing v al u es in a Machine Learning ( ML ) se � ing ? Missing v al u es aren ' t a problem ... ( Most of sklearn ' s algorithms cannot handle missing v al u es and w ill thro w an error .) Remo v ing obser v ations or feat u res w ith missing v al u es ... ( Unless y o u r dataset is large and the proportion of missing v al u es small , remo v ing ro w s or col u mns w ith missing data u s u all y res u lts in shrinking y o u r dataset too m u ch to be u sef u l in s u bseq u ent ML .) Filling missing v al u es w ith z ero w ill bias res u lts u p w ard .( It ' s the opposite , � lling w ith z ero w ill bias res u lts do w n w ard .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON

Data distrib u tions and transformations P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

Different distrib u tions 1 h � ps ://www. researchgate . net /� g u re / Bias - Training - and - test - data - sets - are - dra w n - from - di � erent - distrib u tions _� g 22_330485084 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Train / test split train, test = train_test_split(X, y, test_size=0.3) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) sns.pairplot() --> plot matri x of distrib u tions and sca � erplots PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Data transformation 1 h � ps ://www. researchgate . net /� g u re / E x ample - of - the - e � ect - of - a - log - transformation - on - the - distrib u tion - of - the - dataset _� g 20_308007227 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Bo x- Co x Transformations scipy.stats.boxcox(data, lmbda= ) p lmbda ( p ) x transform −2 = 1/2 x -2 reciprocal sq u are −1 = 1/ x x -1 reciprocal −1/2 = 1/ √ x x -0.5 reciprocal sq u are root log ( x ) 0.0 log 1/2 = √ x x 0.5 sq u are root 1 x = x 1 no transform 2 x = x 2 sq u are PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Let ' s practice ! P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON

Data o u tliers and scaling P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist

O u tliers One or more obser v ations that are distant from the rest of the obser v ations in a gi v en feat u re . 1 h � ps :// bolt . mph .u�. ed u/6050-6052/u nit -1/ one - q u antitati v e -v ariable - introd u ction /u nderstanding - o u tliers / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Inter - q u artile range ( IQR ) 1 B y Jhg u ch at en .w ikipedia , CC BY - SA 2.5, h � ps :// commons .w ikimedia . org /w/ inde x. php ? c u rid =14524285 PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Line of best fit 1 h � ps ://www. r - bloggers . com / o u tlier - detection - and - treatment -w ith - r / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

O u tlier f u nctions F u nction ret u rns sns.boxplot(x= , y='Loan Status') bo x plot conditioned on target v ariable sns.distplot() histogram and kernel densit y estimate ( kde ) np.abs() ret u rns absol u te v al u e stats.zscore() calc u lated z- score mstats.winsorize(limits=[0.05, 0.05]) � oor and ceiling applied to o u tliers np.where(condition, true, false) replaced v al u es PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

High v s lo w v ariance 1 h � ps :// machinelearningmaster y. com / a - gentle - introd u ction - to - calc u lating - normal - s u mmar y- statistics / PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Standardi z ation v s normali z ation Standardi z ation : Normali z ation : Z - score standardi z ation Min / ma x normali z ing Scales to mean 0 and sd 1 Scales to bet w een (0, 1) 1 h � ps :// medi u m . com /@ rrfd / standardi z e - or - normali z e - e x amples - in - p y thon - e 3 f 174 b 65 dfc PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Scaling f u nctions scikit-learn.preprocessing.StandardScaler() --> ( mean =0, sd =1) sklearn.preprocessing.MinMaxScaler() --> (0,1) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

O u tliers and scaling Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? Select the statement that is tr u e : An o u tlier is a point that is j u st o u tside the range of similar points in a feat u re . In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . Mi x/ ma x scaling gi v es data a mean of 0, an SD of 1, and increases v ariance . Z - score standardi z ation scales data to be in the inter v al (0,1) and impro v es model � t . PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

O u tliers and scaling : ans w er Ho w sho u ld o u tliers be identi � ed and properl y dealt w ith ? What res u lt does min / ma x or z- score standardi z ation ha v e on data ? The correct ans w er is : In a gi v en conte x t , o u tliers considered anomalo u s are helpf u l in b u ilding a predicti v e ML model . ( Data anomalies are common in fra u d detection , c y bersec u rit y e v ents , and other scenarios w here the goal is to � nd them .) PRACTICING MACHINE LEARNING INTERVIEW QUESTIONS IN PYTHON

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G - PowerPoint PPT Presentation

Handling missing data P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E STION S IN P YTH ON Lisa St u art Data Scientist Prereq u isites S u per v ised Learning w ith scikit - learn Uns u per v ised Learning in P y thon PRACTICING

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Material Handling Chapter 5 Designing material handling systems Overview of material

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Handling Missing Values STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

Methods for Handling Missing Data Joseph Hogan Brown University MDEpiNet Conference Workshop

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Handling missing data in Stata: Imputation and likelihood-based approaches Rose Medeiros

Performing and tracking imputation Nicholas Tierney Statistician DataCamp Dealing With Missing

Population sizing Correct size of the population important: too small: premature

The effects of data selection on The effects of data selection on the assimilation of AIRS data

Overview Model Comparison Machine Learning and Pattern Recognition The model selection

Variable selection and parameter tuning in high-dimensional prediction Christoph Bernau and

Support session Case Study Our way of doing research: knowledge exchange 1. Problem/Issue

Overview of the 2016-2025 National Health Interview Survey Sample Design Chris Moriarity, Van

Quotes from evaluations from providers who have been through

LBUSD Superintendent Selection Board of Education Presentation January 2020 Superintendent

Sambuz

Useful Links

Newsletter

Mail Us