europython 2020
play

EuroPython 2020 c e d o Real Time Machine Learning with Python - PowerPoint PPT Presentation

@ A x S a u EuroPython 2020 c e d o Real Time Machine Learning with Python Alejandro Saucedo | as@seldon.io Twitter: @AxSaucedo @ my name is Alejandro A Hello, x S a u c e d o Engineering Director Seldon Technologies


  1. @ A x S a u EuroPython 2020 c e d o Real Time Machine Learning with Python Alejandro Saucedo | as@seldon.io Twitter: @AxSaucedo

  2. @ my name is Alejandro A Hello, x S a u c e d o Engineering Director Seldon Technologies Chief Scientist The Institute for Ethical AI & ML Head of Solutions Eng & Sci Eigen Technologies Software Engineer & DevX Lead Bloomberg LP Alejandro Saucedo

  3. Seldon: OSS Production @ A x S a u ML Deployment c e d o

  4. @ A The Institute for Ethical AI x S a u c e & Machine Learning d o

  5. @ We are part of the LFAI A x S a u c e d o

  6. @ A Today x S a u c e d o ● Conceptual intro to stream processing ● Machine learning for real time ● Tradeoffs across tools ● Hands on use-case

  7. @ A Real Time Reddit Processing x S a u c e d o ● Real time ML model for reddit comments ● 200k comments for training model ● /r/science comments removed by mods We will be fixing the front page of the internet

  8. @ A A trip to the past present: ETL x S a u c e d o E - Extract T - Transform L - Load

  9. @ A Variations x S a u c e d o ● ETL - Extract Transform Load ● ELT - Extract Load Transform ● EL - Extract Load ● LT - Load Transform ● WTF - LOL

  10. @ A Specialised Tools x S a u c e d o

  11. @ A x S a u c e d o EL ETL ELT Nifi Oozie Elasticsearch Flume Airflow Data Warehouse … Jupyter notebook?

  12. @ A Batch VS Streaming x S a u c e d o The spectrum of data processing

  13. @ A Batch VS AND Streaming x S a u c e d o The right tool for the challenge

  14. @ A Unifying Worlds x S a u c e d o Massive drive on converging worlds

  15. @ A Streaming Concepts: Windows x S a u c e d o Processing of batches in real time

  16. @ A Streaming Concepts: Checkpoints x S a u c e d o Keeping track of stream progress

  17. @ A Streaming Concepts: Watermarks x S a u c e d o Considering data that comes late in windows and stream batches

  18. @ A Some Stream Processing Tools x S a u c e d o ● Flink (Multiple Languages) ● Kafka Streams (Multiple Languages) ● Spark Stream (Multiple Languages) ● Faust (Python) ● Apache Beam (Python)

  19. @ A Today we’re using x S a u c e d o Stream Processing ML Serving ML Training

  20. @ A Machine Learning Workflow x S a u c e d o

  21. @ A Model Training x S a u c e d o Clean Text clean_text_transformer = CleanTextTransformer() spacy_tokenizer = SpacyTokenTransformer() SpaCy Tokenizer tfidf_vectorizer = TfidfVectorizer( min_df=3, max_features=1000, preprocessor=lambda x: x, tokenizer=lambda x: x, token_pattern=None, TFIDF Vectorizer ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1) lr_model = LogisticRegression(C=1.0, verbose=True) Logistic Regression

  22. @ A Model Training “You are a DUMMY!!!!!” x S a u c e d o x_train_clean = \ clean_text_transformer.transform(x_train) “You are dummy” x_train_tokenized = \ spacy_tokenizer.transform(x_train_clean) tfidf_vectorizer.fit( [ PRON, IS, DUMB ] x_train_tokenized[TOKEN_COLUMN].values) x_train_tfidf = \ tfidf_vectorizer.transform( [ 1000, 0100, 0010 ] x_train_tokenized[TOKEN_COLUMN].values) lr_model.fit(x_train_tfidf, y_train) pred = lr_model.predict(x_test_tfidf) [ 1 ]

  23. @ A More on EDA & Model Evaluation x S a u c e d o https://github.com/axsaucedo/reddit-classification-exploration/

  24. @ A Overview of Components x S a u c e d o Queue Reddit Source Topic: Topic: Topic: reddit_stream prediction alert ML Service Processor: Processor: seldon model fetch_stream ml_predict Stream processor

  25. @ A Generating comments x S a u c e d o @app.timer(0.1) Queue async def generate_reddit_comments(): Reddit Source reddit_sample = await fetch_reddit_comment() Topic: Topic: Topic: reddit_stream prediction alert reddit_data = { "id": reddit_sample["id"].values[0], "score": int(reddit_sample["score"].values[0]), ... # Cut down for simplicity } ML Service Processor: await app.topic("reddit_stream").send( seldon model fetch_stream key=reddit_data["id"], value=reddit_data) Stream processor

  26. @ A ML Stream Processing Step x S a u c e d o @app.agent(app.topic("reddit_stream")) Queue async def predict_reddit_content(tokenized_stream): async for key, comment_extended in tokenized_stream.items(): Reddit Source tokens = comment_extended["body_tokens"] Topic: Topic: Topic: reddit_stream prediction alert probability = seldon_prediction_req(tokens) data = { "probability": probability, "original": comment_extended["body"] } ML Service await app.topic("reddit_prediction").send( Processor: seldon model key=key, ml_predict value=data) Stream processor if probability > MODERATION_THRESHOLD: await reddit_mod_alert_topic.send( key=key, value=data)

  27. @ A ML Model Request Step x S a u c e d o sc = SeldonClient( Queue gateway_endpoint="istio-ingress.istio-system.svc.cluster.local", Reddit Source deploment_name="reddit-model", namespace="default") Topic: Topic: Topic: reddit_stream prediction alert def seldon_prediction_req(tokens): data = np.array(tokens) output = sc.predict(data=data) return output.response["data"]["ndarray"] ML Service Processor: Processor: seldon model fetch_stream ml_predict Stream processor

  28. @ A Overview of Seldon Model Serving x S a u c e d o

  29. @ import dill A x S a Wrapping u from ml_utils import CleanTextTransformer, SpacyTokenTransformer c e d o class RedditClassifier: ML def __init__(self): self._clean_text_transformer = CleanTextTransformer() models self._spacy_tokenizer = SpacyTokenTransformer() for with open('tfidf_vectorizer.model', 'rb') as model_file: self._tfidf_vectorizer = dill.load(model_file) Serving with open('lr.model', 'rb') as model_file: self._lr_model = dill.load(model_file) with def predict(self, X, feature_names): clean_text = self._clean_text_transformer.transform(X) spacy_tokens = self._spacy_tokenizer.transform(clean_text) Seldon tfidf_features = self._tfidf_vectorizer.transform(spacy_tokens) predictions = self._lr_model.predict_proba(tfidf_features) return predictions

  30. @ A Overview of Components x S a u c e d o Queue Reddit Source Topic: Topic: Topic: reddit_stream prediction alert ML Service Processor: Processor: seldon model fetch_stream ml_predict Stream processor

  31. @ A Recap of Today x S a u c e d o ● Conceptual intro to stream processing ● Machine learning for real time ● Tradeoffs across tools ● Hands on use-case

  32. EuroPython 2020 Real Time Machine Learning with Python Alejandro Saucedo | as@seldon.io @AxSaucedo

Recommend


More recommend