Twitter Data Processing with MongoDB By Ama & Sameera
Introduction � Create twitter developer account � Get access key � Access REST API � Execute some POST and GET queries � Download a sample of twitter streaming data � Analyze a single object a tweet (json format)
Running Hadoop
Twitter Application
Flume configuration
Flume- data streaming
Hadoop File System
Running MongoDB services
Twitter data import
Data Structure � http://www.jsoneditoronline.org/
Data Mining
Tweets Per Topic db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Sunday.*"}},{'text': {$regex: ".*sunday.*"}}] }} ,{$group:{_id:null, count:{$sum:1}} }])
Tweets vs. Time-Zone: Paris db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Paris.*"}},{'text': {$regex: ".*paris.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 8000 7000 6000 5000 4000 3000 2000 1000 0
Tweets vs. Time-Zone: Thanksgiving db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Thanksgiving.*"}},{'text': {$regex: ".*thanksgiving.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 6000 5000 4000 3000 2000 1000 0
American Music Awards(AMA) 2015
AMA : Artist of the year db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex: ".*Nicky Minaj.*"}},{'text': {$regex: ".*@NICKYMINAJ.*"}}, {'text': {$regex: ".*nicky minaj.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])
AMA : Performances db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*5SOS.*"}},{'text': {$regex: ".*5 Seconds Of Summer.*"}},{'text': {$regex: ".*5 Seconds of Summer.*"}},{'text': {$regex: ".*5 seconds of summer.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])
AMA : Favorite Electronic Dance Music Artist
Research Paper Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture
Introduction � After significant breaking news events, Twitter aims to provide relevant results within minutes; typically ten minutes. � Related query suggestion is a feature that most searchers are likely familiar with, e.g. typing “Obama” � Two systems were built to achieve this target but only one was eventually deployed: � First implementation was based on a typical Hadoop-based analytics stack. � Second implementation, which was eventually deployed, is a custom in-memory processing engine.
Problem definition � "search assistance" @ Twitter � Twitter context introduces a real-time "twist � At twitter, search assistance needs to be provided in real time and must dynamically adapt to the rapidly evolving "global conversation". � The architecture considers 3 aspects of data – volume, velocity, & variety, and it addressed the challenges of real-time data processing in the era of "big data“
First approach: Hadoop � The first solution sought to take advantage of Twitter's existing analytics platform : Hadoop � Incorporated into its' Hadoop platform are components such as Pig, Hbase, ZooKeeper, and Vertica. � Data is written to the Hadoop Distributed File System (HDFS) via a number of real- time and batch processes. � Intead of directly writing Hadoop code in Java, analystics at Twitter is performed mostly using Pig
Hadoop Platform
Disadvantages � Although the system worked reasonably in terms of output, however, latency was estimated in hours. � This is a far away from the targeted 10 minutes. � The latency is primarily attributed to: � Data import pipeline moving data from tens of thousands of production hosts onto HDFS � MapReduce jobs
New approach: In-memory processing engine
New approach: Search Assistance Engine The search assistance engine consists of: � A lightweight frontend serving requests from an in-memory cache, � A backend that consumes the fire hose and query hose to compute related query suggestions and spelling corrections.
Dataflow The query path: as a query from a given user is delivered through the query hose, the following actions are taken: � Query statistics are updated in the query statistics store � The query is added to the sessions store � For each previous query in the session, a query co-occurrence is formed with the new query.
Conclusion � The authors believe that although the experience was instructive, they hope that future system designers can benefit from their story and build the right solution the first time. � It would be desirable to build a generic data processing platform capable of handling both “big data” and “fast data”.
Thank you ☺
Questions?
Recommend
More recommend