spark code camp
play

Spark Code Camp Discover Spark Streaming & Spark SQL Project - PowerPoint PPT Presentation

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content


  1. Spark Code Camp Discover Spark Streaming & Spark SQL

  2. Project Overview ● Focus on Spark Streaming and Spark SQL ● Explored Streaming API of Apache Spark on Ukko Cluster ○ Window based Stream Content ○ Direct Stream content ● Use Twitter Streaming API as a data source ● Aim - collect tweet data and analyse ○ Find out popular hashtags ○ Discover tweet frequency per location ○ Discover tweetings trends over time

  3. Open-Source Stack

  4. APIs Stack ● Spark Core & Streaming ○ "org.apache.spark" %% "spark-core" % "1.0.2" % "provided" ○ "org.apache.spark" %% "spark-streaming" % "1.0.2" % "provided" ● Twitter4j & Twitter Stream ○ " org.twitter4j" % "twitter4j-core" % "3.0.3" ○ "org.twitter4j" % "twitter4j-stream" % "3.0.3" ○ "org.apache.spark" %% "spark-streaming-twitter" % "1.0.2" % "provided" ● Akka ○ "com.typesafe.akka" % "akka-actor_2.10" % "2.2-M1" ● Socko ○ "org.mashupbots.socko" % "socko-webserver_2.10" % "0.4.2", ● Spark SQL ○ "org.apache.spark" %% "spark-sql" % "1.0.0" % "provided"

  5. Results ● Discovered most popular hashtags in last n seconds with a sliding window streaming ● Dynamic Graph Plotting with live feeds from Twitter Stream content ● Generated a dataset of tweets in text files and in Spark SQL tables ○ One millions tweets collected ● Used Spark SQL to analyse tweet dataset ● Used Actor based interaction between stream content and Web Server

  6. Challenges & Learning ● Explored Streaming API ○ Few tutorial available to explore streaming in Spark ○ Few Streaming source - Twitter or Other ? ● Build environment ○ Maven or SBT ● Stack selection based on Learning Curve ○ Short time to explore & experiment with different open-source software stack ○ Decision challenges ■ Scala based Framework: Akka or Play ? ■ Web Server: Socko or other http web server ? ■ Graph: Chart.js or other chart libraries ? ■ Storage: File system or Hive or Shark or Spark SQL ? ● Stream Handling ○ Which attributes of twitter status ( a user tweet == status) is useful ? ○ What can be possible with huge stream of data?

  7. References ● http://sockoweb.org/ ● https://github.com/mashupbots/socko ● http://akka.io/ ● https://spark.apache.org/streaming/ ● http://www.chartjs.org/

  8. ● Team Members ○ Maninder Pal Singh ○ Ayesha Ahmad ○ Md. Mesbahul Islam

Recommend


More recommend