storm
play

Storm Distributed and fault-tolerant realtime computation Nathan - PowerPoint PPT Presentation

Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github (most watched JVM


  1. Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter

  2. Basic info • Open sourced September 19th • Implementation is 15,000 lines of code • Used by over 25 companies • >2700 watchers on Github (most watched JVM project) • Very active mailing list • >2300 messages • >670 members

  3. Before Storm Queues Workers

  4. Example (simplified)

  5. Example Workers schemify tweets and append to Hadoop

  6. Example Workers update statistics on URLs by incrementing counters in Cassandra

  7. Example Use mod/hashing to make sure same URL always goes to same worker

  8. Scaling Deploy Reconfigure/redeploy

  9. Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious

  10. What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works”

  11. Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing “Just works”

  12. Use cases Stream Distributed Continuous processing RPC computation

  13. Storm Cluster

  14. Storm Cluster Master node (similar to Hadoop JobTracker)

  15. Storm Cluster Used for cluster coordination

  16. Storm Cluster Run worker processes

  17. Starting a topology

  18. Killing a topology

  19. Concepts • Streams • Spouts • Bolts • Topologies

  20. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

  21. Spouts Source of streams

  22. Spout examples • Read from Kestrel queue • Read from Twitter streaming API

  23. Bolts Processes input streams and produces new streams

  24. Bolts • Functions • Filters • Aggregation • Joins • Talk to databases

  25. Topology Network of spouts and bolts

  26. Tasks Spouts and bolts execute as many tasks across the cluster

  27. Task execution Tasks are spread across the cluster

  28. Task execution Tasks are spread across the cluster

  29. Stream grouping When a tuple is emitted, which task does it go to?

  30. Stream grouping • Shuffle grouping: pick a random task • Fields grouping: mod hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id

  31. Topology [“id1”, “id2”] shuffle shuffle [“url”] shuffle all

  32. Streaming word count TopologyBuilder is used to construct topologies in Java

  33. Streaming word count Define a spout in the topology with parallelism of 5 tasks

  34. Streaming word count Split sentences into words with parallelism of 8 tasks

  35. Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks

  36. Streaming word count Create a word count stream

  37. Streaming word count splitsentence.py

  38. Streaming word count

  39. Streaming word count Submitting topology to a cluster

  40. Streaming word count Running topology in local mode

  41. Demo

  42. Distributed RPC Data flow for Distributed RPC

  43. DRPC Example Computing “reach” of a URL on the fly

  44. Reach Reach is the number of unique people exposed to a URL on Twitter

  45. Computing reach Follower Distinct follower Tweeter Follower Follower Distinct follower Reach Tweeter Count URL Follower Follower Distinct Tweeter follower Follower

  46. Reach topology

  47. Reach topology

  48. Reach topology

  49. Reach topology Keep set of followers for each request id in memory

  50. Reach topology Update followers set when receive a new follower

  51. Reach topology Emit partial count after receiving all followers for a request id

  52. Demo

  53. Guaranteeing message processing “Tuple tree”

  54. Guaranteeing message processing • A spout tuple is not fully processed until all tuples in the tree have been completed

  55. Guaranteeing message processing • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

  56. Guaranteeing message processing Reliability API

  57. Guaranteeing message processing “Anchoring” creates a new edge in the tuple tree

  58. Guaranteeing message processing Marks a single node in the tree as complete

  59. Guaranteeing message processing • Storm tracks tuple trees for you in an extremely efficient way

  60. Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?

  61. Transactional topologies Won’t you overcount?

  62. Transactional topologies Transactional topologies solve this problem

  63. Transactional topologies Built completely on top of Storm’s primitives of streams, spouts, and bolts

  64. Transactional topologies Enables fault-tolerant, exactly-once messaging semantics

  65. Transactional topologies Batch 1 Batch 2 Batch 3 Process small batches of tuples

  66. Transactional topologies Batch 1 Batch 2 Batch 3 If a batch fails, replay the whole batch

  67. Transactional topologies Batch 1 Batch 2 Batch 3 Once a batch is completed, commit the batch

  68. Transactional topologies Batch 1 Batch 2 Batch 3 Bolts can optionally be “committers”

  69. Transactional topologies Commit 1 Commit 1 Commit 2 Commit 3 Commit 4 Commit 4 Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried

  70. Example

  71. Example New instance of this object for every transaction attempt

  72. Example Aggregate the count for this batch

  73. Example Only update database if transaction ids differ

  74. Example This enables idempotency since commits are ordered

  75. Example (Credit goes to Kafka devs for this trick)

  76. Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered

  77. Transactional topologies • Requires a more sophisticated source queue than Kestrel or RabbitMQ • storm-contrib has a transactional spout implementation for Kafka

  78. Storm UI

  79. Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool

  80. Starter code https://github.com/nathanmarz/storm-starter Example topologies

  81. Documentation

  82. Ecosystem • Scala, JRuby, and Clojure DSL’s • Kestrel, Redis, AMQP , JMS, and other spout adapters • Multilang adapters • Cassandra, MongoDB integration

  83. Questions? http://github.com/nathanmarz/storm

  84. Future work • State spout • Storm on Mesos • “Swapping” • Auto-scaling • Higher level abstractions

  85. Implementation KafkaTransactionalSpout

  86. Implementation all all all

  87. Implementation all all all TransactionalSpout is a subtopology consisting of a spout and a bolt

  88. Implementation all all all The spout consists of one task that coordinates the transactions

  89. Implementation all all all The bolt emits the batches of tuples

  90. Implementation all all all The coordinator emits a “batch” stream and a “commit stream”

  91. Implementation all all all Batch stream

  92. Implementation all all all Commit stream

  93. Implementation all all all Coordinator reuses tuple tree framework to detect success or failure of batches or commits and replays appropriately

Recommend


More recommend