Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - PowerPoint PPT Presentation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/

Scalable Tools session • We will be using Spark, Python and PySpark. • We will use Jupyter Notebook as IDE. 2

Install your software VM on your laptop Create Ubuntu VM in VirtualBox • This lecture will walk through how to download and set-up VirtualBox with Ubuntu. • Then we will walk through installing Spark,Python, and the Jupyter Notebook on this VirtualBox Unbtunu. 3

Option 1: host it locally on your laptop. (recommended) • This option require you to download a large files. I will also have those files available locally for copy in session. 1. Download Virtualbox (for Windows and mac) • https://www.virtualbox.org/wiki/Downloads • (108 MB for Windows, 91MB for Mac) • • (If you are using Linux, you don't need this) 2. Download Ubuntu • https://www.ubuntu.com/download/desktop • Select Ubuntu Desktop 18.04 LTS • • (size 1.8GB) 4

• 3. Install Virtualbox • 4. Create VM using Ubuntu .iso file • 5. Login to Ubuntu • We will install pyspark in class 5

VirtualBox (issue if you have docker) • Only 32-bit option available? Disable your Hyper-V from Windows Features. Then restart windows. 6

Update your Ubuntu • sudo apt-get update • sudo apt-get upgrade 7

Verify your python3 • python • python3 8

Install jupyter with pip • Install pip3 by • sudo apt install python3-pip 9

Install jupyter by pip3 • (You can also get the full Anaconda too) • pip3 install jupyter 10

Install Java jdk Spark based on java, so don’t forget to install it. (you will get • weird error) • sudo apt-get install openjdk-8-jdk Or • sudo apt-get install default-jdk • 11

Start the jupyter notebook • Type “ jupyter notebook” • If it shows “command not found” then pip haven’t place your jupyter in to system path • Restart your Ubuntu • Or run jupyter this way • ~/.local/bin/jupyter-notebook 12

Install Pyspark Pyspark available on pypi, but pip3 doesn’t • work!! Get pip by • sudo apt-get install python-pip •  Wow. it’s super easy!! pip install pyspark • If you wish to use conda: • • conda install -c conda-forge pyspark Make sure you see the Spark logo ->> • 13 If not, it’s a trap :P • http://sigdelta.com/blog/how-to-install-pyspark-locally/

Verify your pyspark 14

Let’s run your 1 st spark program 15

Now, your Pyspark exercise • Install Hadoop • Use spark to count words from your favorite site • Example: we are using cs.iastate.edu 16

Installing Hadoop https://www.digitalocean.com/community/tutorials/how-to-install- • hadoop-in-stand-alone-mode-on-ubuntu-16-04 17

Run MapReduce example 20

• 1. Get the text from website You could use bs4 (BeautifulSoup4) to scrap the web – more • elegant Or manually save web page to .txt file - less elegant :P • 21

• pip3 install bs4 • Python code to easily get text (process) from web page 22

• import urllib2 • import html2text • url='' • page = urllib2.urlopen(url) • html_content = page.read() • rendered_content = html2text.html2text(html_content) • file = open('file_text.txt', 'w') • file.write(rendered_content) • file.close() 23

Create pyspark word count program • https://spark.apache.org/examples.html 24

Thank you • Questions? • adisak@iastate.edu • http://web.cs.iastate.edu/~adisak/MBDS2018/ 25

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - PowerPoint PPT Presentation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/ Scalable Tools session We will be using Spark, Python and PySpark. We will use

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science,

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

A Scalable Tools Communication Infrastructure presented by Richard L. Graham Motivation

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA Elsa

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Examples of online analysis tools for gene expression data Tools integrated in data repositories

Quantum Computing Kitty Yeung, Ph.D. in Applied Physics Creative Technologist + Sr. PM Microsoft

Develop Your Data Mindset Module 1 - Introduction to Course and Theme, Need for Data Training,

David H. Ringstrom, CPA Accounting Advisors, Inc. www.accountingadvisors.com About the speaker:

Photonic Geometries for Light Trapping and Manipulation Zin Lin PI: Steven G. Johnson Outline

Confinement (Running Untrusted Programs) Chester Rebeiro Indian Institute of Technology Madras

Software Design and Modelling Perdita Stevens School of Informatics University of Edinburgh

Logistics Jiasi Chen CS 179i: Project in Computer Science (Networks) Lectures: Monday 3:10-4pm

The Goldilocks Approach to Television Shane OLeary @shaneoleary1 Teens arent

Sambuz

Useful Links

Newsletter

Mail Us

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department - PowerPoint PPT Presentation

Scalable Tools - Part II Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/Bigdata/ Scalable Tools session We will be using Spark, Python and PySpark. We will use

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science,

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

A Scalable Tools Communication Infrastructure presented by Richard L. Graham Motivation

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

The State and Needs of IO Performance Tools Scalable Tools Workshop Lake Tahoe, CA Elsa

The most important free tools for any website owner Google Webmaster Tools &amp; Google Analytics

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Examples of online analysis tools for gene expression data Tools integrated in data repositories

Quantum Computing Kitty Yeung, Ph.D. in Applied Physics Creative Technologist + Sr. PM Microsoft

Develop Your Data Mindset Module 1 - Introduction to Course and Theme, Need for Data Training,

David H. Ringstrom, CPA Accounting Advisors, Inc. www.accountingadvisors.com About the speaker:

Photonic Geometries for Light Trapping and Manipulation Zin Lin PI: Steven G. Johnson Outline

Confinement (Running Untrusted Programs) Chester Rebeiro Indian Institute of Technology Madras

Software Design and Modelling Perdita Stevens School of Informatics University of Edinburgh

Logistics Jiasi Chen CS 179i: Project in Computer Science (Networks) Lectures: Monday 3:10-4pm

The Goldilocks Approach to Television Shane OLeary @shaneoleary1 Teens arent

Sambuz

Useful Links

Newsletter

Mail Us

The most important free tools for any website owner Google Webmaster Tools & Google Analytics