scalable tools part ii
play

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - PowerPoint PPT Presentation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/ Scalable Tools session We will be using Spark, Python and PySpark. We will use


  1. Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/

  2. Scalable Tools session • We will be using Spark, Python and PySpark. • We will use Jupyter Notebook as IDE. 2

  3. Install your software VM on your laptop Create Ubuntu VM in VirtualBox • This lecture will walk through how to download and set-up VirtualBox with Ubuntu. • Then we will walk through installing Spark,Python, and the Jupyter Notebook on this VirtualBox Unbtunu. 3

  4. Option 1: host it locally on your laptop. (recommended) • This option require you to download a large files. I will also have those files available locally for copy in session. 1. Download Virtualbox (for Windows and mac) • https://www.virtualbox.org/wiki/Downloads • (108 MB for Windows, 91MB for Mac) • • (If you are using Linux, you don't need this) 2. Download Ubuntu • https://www.ubuntu.com/download/desktop • Select Ubuntu Desktop 18.04 LTS • • (size 1.8GB) 4

  5. • 3. Install Virtualbox • 4. Create VM using Ubuntu .iso file • 5. Login to Ubuntu • We will install pyspark in class 5

  6. VirtualBox (issue if you have docker) • Only 32-bit option available? Disable your Hyper-V from Windows Features. Then restart windows. 6

  7. Update your Ubuntu • sudo apt-get update • sudo apt-get upgrade 7

  8. Verify your python3 • python • python3 8

  9. Install jupyter with pip • Install pip3 by • sudo apt install python3-pip 9

  10. Install jupyter by pip3 • (You can also get the full Anaconda too) • pip3 install jupyter 10

  11. Install Java jdk Spark based on java, so don’t forget to install it. (you will get • weird error) • sudo apt-get install openjdk-8-jdk Or • sudo apt-get install default-jdk • 11

  12. Start the jupyter notebook • Type “ jupyter notebook” • If it shows “command not found” then pip haven’t place your jupyter in to system path • Restart your Ubuntu • Or run jupyter this way • ~/.local/bin/jupyter-notebook 12

  13. Install Pyspark Pyspark available on pypi, but pip3 doesn’t • work!! Get pip by • sudo apt-get install python-pip •  Wow. it’s super easy!! pip install pyspark • If you wish to use conda: • • conda install -c conda-forge pyspark Make sure you see the Spark logo ->> • 13 If not, it’s a trap :P • http://sigdelta.com/blog/how-to-install-pyspark-locally/

  14. Verify your pyspark 14

  15. Let’s run your 1 st spark program 15

  16. Now, your Pyspark exercise • Install Hadoop • Use spark to count words from your favorite site • Example: we are using cs.iastate.edu 16

  17. Installing Hadoop https://www.digitalocean.com/community/tutorials/how-to-install- • hadoop-in-stand-alone-mode-on-ubuntu-16-04 17

  18. 18

  19. 19

  20. Run MapReduce example 20

  21. • 1. Get the text from website You could use bs4 (BeautifulSoup4) to scrap the web – more • elegant Or manually save web page to .txt file - less elegant :P • 21

  22. • pip3 install bs4 • Python code to easily get text (process) from web page 22

  23. • import urllib2 • import html2text • url='' • page = urllib2.urlopen(url) • html_content = page.read() • rendered_content = html2text.html2text(html_content) • file = open('file_text.txt', 'w') • file.write(rendered_content) • file.close() 23

  24. Create pyspark word count program • https://spark.apache.org/examples.html 24

  25. Thank you • Questions? • adisak@iastate.edu • http://web.cs.iastate.edu/~adisak/MBDS2018/ 25

Recommend


More recommend