Concept Drift: Learning on Data Streams Pádraig Cunningham Director Insight @ UCD PI @ CeADAR
Online Learning & Concept Drift Predictive Analytics The Old Days ❏ Static data, the concept doesn’t change ❏ Velocity & Variety rather than Volume Concept Drift Online Learning Learning on Data Streams ❏ Model Update ❏ Tools MOA: Massive Online Analysis (moa.cms.waikato.ac.nz) ❏ from the makers of Weka ❏ Apache Spark ❏ RapidMiner (rapidminer.com) ❏
A typical predictive analytics task Heart attack patient admitted ❏ 19 variables measured during first 24 hours Blood pressure, age, + 17 other ordered and binary variables ❏ Considered important indicators of patient’s condition ❏ Goal: Consider just 3 features Learn from historic data ❏ No Age BMI BP Res. Build a model to identify high risk patients ❏ 1 60 20 140 Ok i.e. will not survive 30 days ❏ 2 60 21 145 Ok (based on evidence of initial 24-hour data) ❏ 3 85 23 130 Ok 4 81 22 160 No Assumed to be a static ‘concept’ 5 70 24 170 No This model is good for all time. 6 72 26 135 No 7 81 26 145 No 8 66 23 155 No Q 66 24 148 ?
Predictive Analytics AKA: Supervised ML
Volume: Big Data’s Little Secret ‘Big’ doesn’t really matter… Typically not a lot of data needed to build a good model ❏ Prostate Cancer Gleason Grade 3 Gleason Grade 4 Classification task on cancer microscopy images
Velocity: Game Analytics Game Lifecycle (user numbers) Task: User segmentation: Predict return from new users Inputs : player profile Outputs : premium user, yes/no? Technology adoption (user type) Consider : Model trained on Early Adopters Used on Late Majority
Google Trends Flappy Bird Other game lifecycles Angry Birds Farmville
Energy Demand Prediction Demand profile has different ‘regimes’
Concept Drift Training Time Later On Over time, things that model expects to be positive come up negative. Bad loans ❏ Antibiotic resistance ❏ Conversion / Churn prediction ❏
Concept Drift: Spam Detection Without retraining error creeps up over time... Static Model Error % Updating Model Time Delany, Cunningham, Tsymbal, FLAIRS 2006
Model Update Strategies ❏ First: when does new data (outcomes) come available? Immediately: energy consumption, financial prediction ❏ Time lag: credit scoring, online games - financial return ❏ Sometimes never: ❏ t t-1 Inputs ❏ Immediately Training window size Outputs ? ❏ Key parameters window size ❏ update frequency ❏
Model Update ❏ Retrain model on recent ‘window’ of data ❏ Window size Big enough: hundreds or examples ❏ Oldest data in window must be relevant ❏ ❏ Update frequency model should be up to date ❏ don’t need to retrain on every click ❏ Too big Window About right: enough data? Window
Software From the makers of Weka...
Software From Apache...
Summary The Old Days: Train static models on historic data ❏ Predictive analytics on data streams Underlying ‘concept’ changes over time ❏ How will we know? ❏ Model can be updated using new data ❏ Adaptive models Data workflows becomes a big issue ❏ New parameters to be considered ❏ Window size ❏ Update frequency ❏
Recommend
More recommend