E x plorator y data anal y sis P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor
A closer look at feat u res int : an integer : 1 , 2 , etc . print(df.columns) float : decimals : 3.02 , 4.56 , etc . ['id', 'click', 'hour', 'C1', ... ] object : string : "hello" , "world" , etc . datetime : datetime : 2018-01-01 , etc . print(df.dtypes) df.select_dtypes( include=['int', 'float']) id object click int64 click int64 ... ... PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Missing data df.info() df.isnull().sum(axis = 0) Data columns (total 24 columns): dtype: object id 50000 non-null object id 0 ... df['id'].isnull() df.isnull().sum(axis = 0).sum() [False, False, False, False, ... ] 0 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Looking at distrib u tions df.groupby(['search_engine_type', df.groupby(['search_engine_type', 'click']).size() 'click']).size().unstack() search_engine_type click click 0 1 1002 0 940 search_engine_type 1 240 1002 940 240 ... ... PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Breakdo w n b y CTR df.reset_index() click search_engine_type 0 1 1002 940 240 df.rename(columns = {0: 'non_clicks'}, inplace = True) click search_engine_type non_clicks clicks 1002 940 240 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON
Feat u re engineering P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor
Dealing w ith dates print(df.hour.head(1)) print(df.groupby('hour_of_day') ['click'].sum()) 14102101 click hour_of_day df['hour'] = pd.to_datetime( 1 1092 df['hour'], format = '%y%m%d%H') 2 6546 df['hour_of_day'] = df['hour'].dt.hour print(df.hour.head(1)) 2014-10-21 01:00:0 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Con v erting categorical v ariables v ia hashing Categorical feat u res m u st be con v erted into a n u merical format Hash f u nction : maps arbitrar y inp u t to an integer o u tp u t , ret u rning e x act same o u tp u t for a gi v en inp u t Lambda f u nction : lambda x: f(x) Appl y hash f u nction v ia f(x) = hash(x) as follo w s : df['site_id'] = df['site_id'].apply(lambda x: hash(x), axis = 0) 83a0ad1a -> -9161053084583616050 85f751fd-> 818242008494177460 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
A closer look at feat u res E x amples of count() and nunique() : df['ad_type'].count() 50000 df['ad_type'].nunique() 31 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Creating feat u res Most of v ariables are categorical Adding more feat u res is be � er for predicti v e po w er E x ample of ne w feat u re : impressions b y device_id (u ser ) and search_engine_type : df['device_id_count'] = df.groupby('device_id')['click'].transform("count") df['search_engine_type_count'] = df.groupby('search_engine_type')['click'].transform("count") print(df.head(1)) ... device_id_count search_engine_type_count ... 40862 47710 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON
Standardi z ing feat u res P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON Ke v in H u o Instr u ctor
Wh y standardi z ation is important Standardi z ation : ens u ring y o u r data � ts ass u mptions that models ha v e Certain feat u res ma y ha v e too high v ariance , w hich might u nfairl y dominate models E x ample : certain co u nt ha v e too large of a range of v al u es d u e to one spam u ser Does not appl y to categorical v ariables s u ch as site_id , app_id , device_id , etc . PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Log normali z ation df.var() print(df['click'].var()) df['device_id_count'] = df[ 'device_id_count'].apply( click 1.294270e-01 lambda x: np.log(x)) hour 1.123316e-01 print(df['click'].var()) df.var().median() 249362570.10134825 15.628476003312514 0.7108583771671939 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Scaling data Standard scaling con v erts all feat u res to ha v e mean of 0 and standard de v iation of 1 Generall y a good practice for machine learning models PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Ho w to standard scale data Scaling can be done u sing StandardScaler() as follo w s : scaler = StandardScaler() X[numeric_cols] = scaler.fit_transform(X[numeric_cols]) dtype: float64 1 10.5 -> 0.85 2 32.3 -> 1.54 PREDICTING CTR WITH MACHINE LEARNING IN PYTHON
Let ' s practice ! P R E D IC TIN G C TR W ITH MAC H IN E L E AR N IN G IN P YTH ON
Recommend
More recommend