Spark Code Camp Discover Spark Streaming & Spark SQL
Project Overview ● Focus on Spark Streaming and Spark SQL ● Explored Streaming API of Apache Spark on Ukko Cluster ○ Window based Stream Content ○ Direct Stream content ● Use Twitter Streaming API as a data source ● Aim - collect tweet data and analyse ○ Find out popular hashtags ○ Discover tweet frequency per location ○ Discover tweetings trends over time
Open-Source Stack
APIs Stack ● Spark Core & Streaming ○ "org.apache.spark" %% "spark-core" % "1.0.2" % "provided" ○ "org.apache.spark" %% "spark-streaming" % "1.0.2" % "provided" ● Twitter4j & Twitter Stream ○ " org.twitter4j" % "twitter4j-core" % "3.0.3" ○ "org.twitter4j" % "twitter4j-stream" % "3.0.3" ○ "org.apache.spark" %% "spark-streaming-twitter" % "1.0.2" % "provided" ● Akka ○ "com.typesafe.akka" % "akka-actor_2.10" % "2.2-M1" ● Socko ○ "org.mashupbots.socko" % "socko-webserver_2.10" % "0.4.2", ● Spark SQL ○ "org.apache.spark" %% "spark-sql" % "1.0.0" % "provided"
Results ● Discovered most popular hashtags in last n seconds with a sliding window streaming ● Dynamic Graph Plotting with live feeds from Twitter Stream content ● Generated a dataset of tweets in text files and in Spark SQL tables ○ One millions tweets collected ● Used Spark SQL to analyse tweet dataset ● Used Actor based interaction between stream content and Web Server
Challenges & Learning ● Explored Streaming API ○ Few tutorial available to explore streaming in Spark ○ Few Streaming source - Twitter or Other ? ● Build environment ○ Maven or SBT ● Stack selection based on Learning Curve ○ Short time to explore & experiment with different open-source software stack ○ Decision challenges ■ Scala based Framework: Akka or Play ? ■ Web Server: Socko or other http web server ? ■ Graph: Chart.js or other chart libraries ? ■ Storage: File system or Hive or Shark or Spark SQL ? ● Stream Handling ○ Which attributes of twitter status ( a user tweet == status) is useful ? ○ What can be possible with huge stream of data?
References ● http://sockoweb.org/ ● https://github.com/mashupbots/socko ● http://akka.io/ ● https://spark.apache.org/streaming/ ● http://www.chartjs.org/
● Team Members ○ Maninder Pal Singh ○ Ayesha Ahmad ○ Md. Mesbahul Islam
Recommend
More recommend