data science in the cloud
play

Data Science in the Cloud Stefan Krawczyk @stefkrawczyk - PowerPoint PPT Presentation

Data Science in the Cloud Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk November 2016 Who are Data Scientists? Means: skills vary wildly But theyre in demand and expensive The Sexiest Job of the 21st Century - HBR


  1. Online & Streamed Computation Do you need to recompute: ● Very likely ○ features for all users? you start with predicted results for all users? ○ a batch system Are you heavily dependent on your ● ETL running every night? ● Online vs Streamed depends on in house factors: ○ Number of models How often they change ○ We use online ○ Cadence of output required system for In house eng. expertise recommendations ○ ○ etc.

  2. Streamed Example

  3. Streamed Example

  4. Streamed Example

  5. Streamed Example

  6. Online/Streaming Thoughts Dedicated infrastructure → More room on batch infrastructure ● ○ Hopefully $$$ savings Hopefully less stressed Data Scientists ○

  7. Online/Streaming Thoughts Dedicated infrastructure → More room on batch infrastructure ● ○ Hopefully $$$ savings Hopefully less stressed Data Scientists ○ Requires better software engineering practices ● ○ Code portability/reuse Designing APIs/Tools Data Scientists will use ○

  8. Online/Streaming Thoughts Dedicated infrastructure → More room on batch infrastructure ● ○ Hopefully $$$ savings Hopefully less stressed Data Scientists ○ Requires better software engineering practices ● ○ Code portability/reuse Designing APIs/Tools Data Scientists will use ○ Prototyping on AWS Lambda & Kinesis was surprisingly quick ● ○ Need to compile C libs on an amazon linux instance

  9. What’s in a Model? Scaling model knowledge

  10. Ever: Had someone leave and then nobody understands how they trained their ● models?

  11. Ever: Had someone leave and then nobody understands how they trained their ● models? Or you didn’t remember yourself? ○

  12. Ever: Had someone leave and then nobody understands how they trained their ● models? Or you didn’t remember yourself? ○ Had performance dip in models and you have trouble figuring out why? ●

  13. Ever: Had someone leave and then nobody understands how they trained their ● models? Or you didn’t remember yourself? ○ Had performance dip in models and you have trouble figuring out why? ● Or not known what’s changed between model deployments? ○

  14. Ever: Had someone leave and then nobody understands how they trained their ● models? Or you didn’t remember yourself? ○ Had performance dip in models and you have trouble figuring out why? ● Or not known what’s changed between model deployments? ○ Wanted to compare model performance over time? ●

  15. Ever: Had someone leave and then nobody understands how they trained their ● models? Or you didn’t remember yourself? ○ Had performance dip in models and you have trouble figuring out why? ● Or not known what’s changed between model deployments? ○ Wanted to compare model performance over time? ● Wanted to train a model in R/Python/Spark and then deploy it a webserver? ●

  16. Produce Model Artifacts

  17. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ●

  18. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○

  19. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ●

  20. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ●

  21. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ● How do you deal with organizational drift?

  22. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ● How do you deal with organizational drift? Makes it easy to keep an archive and track changes over time

  23. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ● Helps a lot with model debugging & diagnosis! How do you deal with organizational drift? Makes it easy to keep an archive and track changes over time

  24. Produce Model Artifacts Isn’t that just saving the coefficients/model values? ● NO! ○ Why? ● Helps a lot with model debugging & diagnosis! How do you deal with organizational drift? Makes it easy to keep an archive and track Can more easily use in changes over time downstream processes

Recommend


More recommend