twitter data processing with mongodb
play

Twitter Data Processing with MongoDB By Ama & Sameera - PowerPoint PPT Presentation

Twitter Data Processing with MongoDB By Ama & Sameera Introduction Create twitter developer account Get access key Access REST API Execute some POST and GET queries Download a sample of twitter streaming data


  1. Twitter Data Processing with MongoDB By Ama & Sameera

  2. Introduction � Create twitter developer account � Get access key � Access REST API � Execute some POST and GET queries � Download a sample of twitter streaming data � Analyze a single object a tweet (json format)

  3. Running Hadoop

  4. Twitter Application

  5. Flume configuration

  6. Flume- data streaming

  7. Hadoop File System

  8. Running MongoDB services

  9. Twitter data import

  10. Data Structure � http://www.jsoneditoronline.org/

  11. Data Mining

  12. Tweets Per Topic db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Sunday.*"}},{'text': {$regex: ".*sunday.*"}}] }} ,{$group:{_id:null, count:{$sum:1}} }])

  13. Tweets vs. Time-Zone: Paris db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Paris.*"}},{'text': {$regex: ".*paris.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 8000 7000 6000 5000 4000 3000 2000 1000 0

  14. Tweets vs. Time-Zone: Thanksgiving db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Thanksgiving.*"}},{'text': {$regex: ".*thanksgiving.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 6000 5000 4000 3000 2000 1000 0

  15. American Music Awards(AMA) 2015

  16. AMA : Artist of the year db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex: ".*Nicky Minaj.*"}},{'text': {$regex: ".*@NICKYMINAJ.*"}}, {'text': {$regex: ".*nicky minaj.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

  17. AMA : Performances db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*5SOS.*"}},{'text': {$regex: ".*5 Seconds Of Summer.*"}},{'text': {$regex: ".*5 Seconds of Summer.*"}},{'text': {$regex: ".*5 seconds of summer.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

  18. AMA : Favorite Electronic Dance Music Artist

  19. Research Paper Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture

  20. Introduction � After significant breaking news events, Twitter aims to provide relevant results within minutes; typically ten minutes. � Related query suggestion is a feature that most searchers are likely familiar with, e.g. typing “Obama” � Two systems were built to achieve this target but only one was eventually deployed: � First implementation was based on a typical Hadoop-based analytics stack. � Second implementation, which was eventually deployed, is a custom in-memory processing engine.

  21. Problem definition � "search assistance" @ Twitter � Twitter context introduces a real-time "twist � At twitter, search assistance needs to be provided in real time and must dynamically adapt to the rapidly evolving "global conversation". � The architecture considers 3 aspects of data – volume, velocity, & variety, and it addressed the challenges of real-time data processing in the era of "big data“

  22. First approach: Hadoop � The first solution sought to take advantage of Twitter's existing analytics platform : Hadoop � Incorporated into its' Hadoop platform are components such as Pig, Hbase, ZooKeeper, and Vertica. � Data is written to the Hadoop Distributed File System (HDFS) via a number of real- time and batch processes. � Intead of directly writing Hadoop code in Java, analystics at Twitter is performed mostly using Pig

  23. Hadoop Platform

  24. Disadvantages � Although the system worked reasonably in terms of output, however, latency was estimated in hours. � This is a far away from the targeted 10 minutes. � The latency is primarily attributed to: � Data import pipeline moving data from tens of thousands of production hosts onto HDFS � MapReduce jobs

  25. New approach: In-memory processing engine

  26. New approach: Search Assistance Engine The search assistance engine consists of: � A lightweight frontend serving requests from an in-memory cache, � A backend that consumes the fire hose and query hose to compute related query suggestions and spelling corrections.

  27. Dataflow The query path: as a query from a given user is delivered through the query hose, the following actions are taken: � Query statistics are updated in the query statistics store � The query is added to the sessions store � For each previous query in the session, a query co-occurrence is formed with the new query.

  28. Conclusion � The authors believe that although the experience was instructive, they hope that future system designers can benefit from their story and build the right solution the first time. � It would be desirable to build a generic data processing platform capable of handling both “big data” and “fast data”.

  29. Thank you ☺

  30. Questions?

Recommend


More recommend