Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science
The past is u sef u l Timeseries data almost al w a y s ha v e information that is shared bet w een timepoints Information in the past can help predict w hat happens in the f u t u re O � en the feat u res best - s u ited to predict a timeseries are pre v io u s v al u es of the same timeseries . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A note on smoothness and a u to - correlation A common q u estion to ask of a timeseries : ho w smooth is the data . AKA , ho w correlated is a timepoint w ith its neighboring timepoints ( called a u tocorrelation ). The amo u nt of a u to - correlation in data w ill impact y o u r models . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating time - lagged feat u res Let ' s see ho w w e co u ld b u ild a model that u ses v al u es in the past as inp u t feat u res . We can u se this to assess ho w a u to - correlated o u r signal is ( and lots of other st u� too ) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Time - shifting data w ith Pandas print(df) df 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 # Shift a DataFrame/Series by 3 index values towards the past print(df.shift(3)) df 0 NaN 1 NaN 2 NaN 3 0.0 4 1.0 MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating a time - shifted DataFrame # data is a pandas Series containing time series data data = pd.Series(...) # Shifts shifts = [0, 1, 2, 3, 4, 5, 6, 7] # Create a dictionary of time-shifted data many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts} # Convert them into a dataframe many_shifts = pd.DataFrame(many_shifts) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Fitting a model w ith time - shifted feat u res # Fit the model using these input features model = Ridge() model.fit(many_shifts, data) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Interpreting the a u to - regressi v e model coefficients # Visualize the fit model coefficients fig, ax = plt.subplots() ax.bar(many_shifts.columns, model.coef_) ax.set(xlabel='Coefficient name', ylabel='Coefficient value') # Set formatting so it looks nice plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing coefficients for a ro u gh signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing coefficients for a smooth signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON
Cross -v alidating timeseries data MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science
Cross v alidation w ith scikit - learn # Iterating over the "split" method yields train/test indices for tr, tt in cv.split(X, y): model.fit(X[tr], y[tr]) model.score(X[tt], y[tt]) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Cross v alidation t y pes : KFold KFold cross -v alidation splits y o u r data into m u ltiple " folds " of eq u al si z e It is one of the most common cross -v alidation ro u tines from sklearn.model_selection import KFold cv = KFold(n_splits=5) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing model predictions fig, axs = plt.subplots(2, 1) # Plot the indices chosen for validation on each loop axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40) axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)', xlabel='Index of raw data') # Plot the model predictions on each iteration axs[1].plot(model.predict(X[tt])) axs[1].set(title='Test set predictions on each CV loop', xlabel='Prediction index') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing KFold CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A note on sh u ffling y o u r data Man y CV iterators let y o u sh u� e data as a part of the cross -v alidation process . This onl y w orks if the data is i . i . d ., w hich timeseries u s u all y is not . Yo u sho u ld not sh u� e y o u r data w hen making predictions w ith timeseries . from sklearn.model_selection import ShuffleSplit cv = ShuffleSplit(n_splits=3) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing sh u ffled CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using the time series CV iterator Th u s far , w e 'v e broken the linear passage of time in the cross v alidation Ho w e v er , y o u generall y sho u ld not u se datapoints in the f u t u re to predict data in the past One approach : Al w a y s u se training data from the past to predict the f u t u re MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing time series cross v alidation iterators # Import and initialize the cross-validation iterator from sklearn.model_selection import TimeSeriesSplit cv = TimeSeriesSplit(n_splits=10) fig, ax = plt.subplots(figsize=(10, 5)) for ii, (tr, tt) in enumerate(cv.split(X, y)): # Plot training and test indices l1 = ax.scatter(tr, [ii] * len(tr), c=[plt.cm.coolwarm(.1)], marker='_', lw=6) l2 = ax.scatter(tt, [ii] * len(tt), c=[plt.cm.coolwarm(.9)], marker='_', lw=6) ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior', xlabel='data index', ylabel='CV iteration') ax.legend([l1, l2], ['Training', 'Validation']) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Vis u ali z ing the TimeSeriesSplit cross v alidation iterator MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
C u stom scoring f u nctions in scikit - learn def myfunction(estimator, X, y): y_pred = estimator.predict(X) my_custom_score = my_custom_function(y_pred, y) return my_custom_score MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A c u stom correlation f u nction for scikit - learn def my_pearsonr(est, X, y): # Generate predictions and convert to a vector y_pred = est.predict(X).squeeze() # Use the numpy "corrcoef" function to calculate a correlation matrix my_corrcoef_matrix = np.corrcoef(y_pred, y.squeeze()) # Return a single correlation value from the matrix my_corrcoef = my_corrcoef[1, 0] return my_corrcoef MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON
Stationarit y and stabilit y MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science
Stationarit y Stationar y time series do not change their statistical properties o v er time E . g ., mean , standard de v iation , trends Most time series are non - stationar y to some e x tent MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model stabilit y Non - stationar y data res u lts in v ariabilit y in o u r model The statistical properties the model � nds ma y change w ith the data In addition , w e w ill be less certain abo u t the correct v al u es of model parameters Ho w can w e q u antif y this ? MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Cross v alidation to q u antif y parameter stabilit y One approach : u se cross -v alidation Calc u late model parameters on each iteration Assess parameter stabilit y across all CV splits MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Bootstrapping the mean Bootstrapping is a common w a y to assess v ariabilit y The bootstrap : 1. Take a random sample of data w ith replacement 2. Calc u late the mean of the sample 3. Repeat this process man y times (1000 s ) 4. Calc u late the percentiles of the res u lt (u s u all y 2.5, 97.5) The res u lt is a 95% con � dence inter v al of the mean of each coe � cient . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Bootstrapping the mean from sklearn.utils import resample # cv_coefficients has shape (n_cv_folds, n_coefficients) n_boots = 100 bootstrap_means = np.zeros(n_boots, n_coefficients) for ii in range(n_boots): # Generate random indices for our data with replacement, # then take the sample mean random_sample = resample(cv_coefficients) bootstrap_means[ii] = random_sample.mean(axis=0) # Compute the percentiles of choice for the bootstrapped means percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Plotting the bootstrapped coefficients fig, ax = plt.subplots() ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200) ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Assessing model performance stabilit y If u sing the TimeSeriesSplit , can plot the model ' s score o v er time . This is u sef u l in � nding certain regions of time that h u rt the score Also u sef u l to � nd non - stationar y signals MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Recommend
More recommend