better together
play

BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH - PowerPoint PPT Presentation

BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH PUBLIC DATASETS EUGENE MANDEL (@EUGMANDEL) JAWBONE QCON SF 2014 JAWBONE DATA MOVEMENT SLEEP WORKOUTS MEALS MOOD SOUTH NAPA EARTHQUAKE 2014 % OF PEOPLE AWAKE AT 3:25


  1. BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH PUBLIC DATASETS EUGENE MANDEL (@EUGMANDEL) JAWBONE QCON SF 2014

  2. JAWBONE DATA MOVEMENT SLEEP WORKOUTS MEALS MOOD

  3. SOUTH NAPA EARTHQUAKE 2014

  4. % OF PEOPLE AWAKE AT 3:25 DISTANCE FROM EPICENTER (MILES)

  5. DATA FUSION IS THE PROCESS OF INTEGRATION OF MULTIPLE DATA AND KNOWLEDGE REPRESENTING THE SAME REAL-WORLD OBJECT INTO A CONSISTENT, ACCURATE, AND USEFUL REPRESENTATION. (WIKIPEDIA)

  6. DATA FUSION - HOW TO FIND THE ELEPHANT IMAGE SOURCE: HTTP:/ /COMMONS.WIKIMEDIA.ORG/WIKI/FILE%3ABLIND_MEN_AND_ELEPHANT.PNG

  7. DATA FUSION POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY

  8. LET’S TALK ABOUT THE WEATHER

  9. MODEL THE PROBLEM

  10. 7000.0 ACTIVITY 7,000 6700.0 6700.0 6600.0 6500.0 6100.0 ? 6000.0 5,600 5000.0 5000.0 4500.0 4400.0 4300.0 4,200 2,800 1,400 AIR TEMP (°F) 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 1 1 1

  11. FIND THE DATA

  12. UNDERSTAND THE DATA

  13. HOURLY DAILY

  14. DATA GENERATION PROCESS NETWORK OF WEATHER STATIONS FREQUENCY OF MEASUREMENTS - HOURLY TO DAILY � COLLABORATION WITH INTERNATIONAL AGENCIES � AGGREGATION AND QA BY NCDC �

  15. UNDERSTAND THE DOMAIN WEATHER STATION TIME: 2014-07-09 13:04:00 AIR TEMP: 86°F PRECIPITATION: 3CM

  16. QA THE DATA

  17. BUT ISN’T IT DONE?

  18. …MAYBE NOT! AIR TEMP: 105°F BAKERSFIELD, CA DULUTH, MN JULY 17, 15:00 JAN 12, 05:00

  19. DATA VALIDATION DOMAIN KNOWLEDGE � COMPARE MULTIPLE SOURCES - E.G. CLIMATE � MANUAL REVIEW OF FLAGGED DATA POINTS

  20. JOIN

  21. HOW? DOMAIN SPECIFIC WEATHER STATION A WEATHER STATION B LAT: 39.36 LAT: 39.35 LON: -74.45 LON: -74.44 TIME: 2014-07-09 13:04:00 TIME: 2014-07-09 13:00:00 AIR TEMP: 74°F AIR TEMP: 60°F ELEVATION: 30FT ELEVATION: 120FT

  22. COVERAGE DO THE DATASETS INTERSECT ENOUGH? PLACES � TIMES � USERS

  23. ISOLATE THE EFFECT

  24. CONFOUNDING VARIABLES WHAT ELSE AFFECTS ACTIVITY? WEEKDAYS/WEEKENDS � DAYLIGHT � RAIN/SNOW

  25. REDSHIFT VS SPARK

  26. AMAZON REDSHIFT RELATIONAL ANALYTICAL DATABASE BY AMAZON � COMPLEX QUERIES ON LARGE DATASETS IN SECONDS � SQL INTERFACE (POSTGRES) � MANAGED CLUSTER

  27. EXAMPLE: DAYLIGHT REDSHIFT PYTHON

  28. IN-MEMORY DATA PROCESSING FRAMEWORK � MODELS COMPUTATION AS A GRAPH OF RDDS (RESILIENT DISTRIBUTED DATASETS) � FUNCTIONAL PROGRAMMING MODEL (SCALA, PYTHON) � SQL � CAN READ FROM SAME SOURCES AS HADOOP

  29. EXAMPLE: DAYLIGHT SPARK

  30. SILVER BULLET? PICK YOUR OWN ADVENTURE SPARK REDSHIFT PROGRAMMER-FRIENDLY EASY TO SHARE DATA WITH NON-DEVELOPERS � END-TO-END SOLUTION � MANAGED - EASY SCALING � SELF-DOCUMENTING �

  31. WHAT DID WE FIND?

  32. IDEAL TEMP FOR MOVEMENT DAILY STEPS MAX TEMP (F)

  33. AND NOW BY STATE… DAILY STEPS MAX TEMP (F)

  34. HOURLY STEPS BY AIR TEMP WEEKENDS

  35. LESS CHOICE = SMALLER EFFECT WEEKDAYS

  36. DATA FUSION POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY

  37. THANK YOU! @EUGMANDEL WWW.LINKEDIN.COM/IN/EUGENEMANDEL

Recommend


More recommend