Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/
Scalable Tools session • We will be using Spark, Python and PySpark. • We will use Jupyter Notebook as IDE. 2
Install your software VM on your laptop Create Ubuntu VM in VirtualBox • This lecture will walk through how to download and set-up VirtualBox with Ubuntu. • Then we will walk through installing Spark,Python, and the Jupyter Notebook on this VirtualBox Unbtunu. 3
Option 1: host it locally on your laptop. (recommended) • This option require you to download a large files. I will also have those files available locally for copy in session. 1. Download Virtualbox (for Windows and mac) • https://www.virtualbox.org/wiki/Downloads • (108 MB for Windows, 91MB for Mac) • • (If you are using Linux, you don't need this) 2. Download Ubuntu • https://www.ubuntu.com/download/desktop • Select Ubuntu Desktop 18.04 LTS • • (size 1.8GB) 4
• 3. Install Virtualbox • 4. Create VM using Ubuntu .iso file • 5. Login to Ubuntu • We will install pyspark in class 5
VirtualBox (issue if you have docker) • Only 32-bit option available? Disable your Hyper-V from Windows Features. Then restart windows. 6
Update your Ubuntu • sudo apt-get update • sudo apt-get upgrade 7
Verify your python3 • python • python3 8
Install jupyter with pip • Install pip3 by • sudo apt install python3-pip 9
Install jupyter by pip3 • (You can also get the full Anaconda too) • pip3 install jupyter 10
Install Java jdk Spark based on java, so don’t forget to install it. (you will get • weird error) • sudo apt-get install openjdk-8-jdk Or • sudo apt-get install default-jdk • 11
Start the jupyter notebook • Type “ jupyter notebook” • If it shows “command not found” then pip haven’t place your jupyter in to system path • Restart your Ubuntu • Or run jupyter this way • ~/.local/bin/jupyter-notebook 12
Install Pyspark Pyspark available on pypi, but pip3 doesn’t • work!! Get pip by • sudo apt-get install python-pip • Wow. it’s super easy!! pip install pyspark • If you wish to use conda: • • conda install -c conda-forge pyspark Make sure you see the Spark logo ->> • 13 If not, it’s a trap :P • http://sigdelta.com/blog/how-to-install-pyspark-locally/
Verify your pyspark 14
Let’s run your 1 st spark program 15
Now, your Pyspark exercise • Install Hadoop • Use spark to count words from your favorite site • Example: we are using cs.iastate.edu 16
Installing Hadoop https://www.digitalocean.com/community/tutorials/how-to-install- • hadoop-in-stand-alone-mode-on-ubuntu-16-04 17
18
19
Run MapReduce example 20
• 1. Get the text from website You could use bs4 (BeautifulSoup4) to scrap the web – more • elegant Or manually save web page to .txt file - less elegant :P • 21
• pip3 install bs4 • Python code to easily get text (process) from web page 22
• import urllib2 • import html2text • url='' • page = urllib2.urlopen(url) • html_content = page.read() • rendered_content = html2text.html2text(html_content) • file = open('file_text.txt', 'w') • file.write(rendered_content) • file.close() 23
Create pyspark word count program • https://spark.apache.org/examples.html 24
Thank you • Questions? • adisak@iastate.edu • http://web.cs.iastate.edu/~adisak/MBDS2018/ 25
Recommend
More recommend