Importing flat files from the w eb IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp
Yo u’ re alread y great at importing ! Flat � les s u ch as . t x t and . cs v Pickled � les , E x cel spreadsheets , and man y others ! Data from relational databases Yo u can do all these locall y What if y o u r data is online ? INTERMEDIATE IMPORTING DATA IN PYTHON
Can y o u import w eb data ? Yo u can : go to URL and click to do w nload � les BUT : not reprod u cible , not scalable INTERMEDIATE IMPORTING DATA IN PYTHON
Yo u’ ll learn ho w to … Import and locall y sa v e datasets from the w eb Load datasets into pandas DataFrames Make HTTP req u ests ( GET req u ests ) Scrape w eb data s u ch as HTML Parse HTML into u sef u l data ( Bea u tif u lSo u p ) Use the u rllib and req u ests packages INTERMEDIATE IMPORTING DATA IN PYTHON
The u rllib package Pro v ides interface for fetching data across the w eb urlopen() - accepts URLs instead of � le names INTERMEDIATE IMPORTING DATA IN PYTHON
Ho w to a u tomate file do w nload in P y thon from urllib.request import urlretrieve url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ winequality-white.csv' urlretrieve(url, 'winequality-white.csv') ('winequality-white.csv', <http.client.HTTPMessage at 0x103cf1128>) INTERMEDIATE IMPORTING DATA IN PYTHON
Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON
HTTP req u ests to import files from the w eb IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp
URL Uniform / Uni v ersal Reso u rce Locator References to w eb reso u rces Foc u s : w eb addresses Ingredients : Protocol identi � er - h � p : Reso u rce name - datacamp . com These specif y w eb addresses u niq u el y INTERMEDIATE IMPORTING DATA IN PYTHON
HTTP H y perTe x t Transfer Protocol Fo u ndation of data comm u nication for the w eb HTTPS - more sec u re form of HTTP Going to a w ebsite = sending HTTP req u est GET req u est urlretrieve() performs a GET req u est HTML - H y perTe x t Mark u p Lang u age INTERMEDIATE IMPORTING DATA IN PYTHON
GET req u ests u sing u rllib from urllib.request import urlopen, Request url = "https://www.wikipedia.org/" request = Request(url) response = urlopen(request) html = response.read() response.close() INTERMEDIATE IMPORTING DATA IN PYTHON
GET req u ests u sing req u ests Used b y “ her Majest y' s Go v ernment , Ama z on , Google , T w ilio , NPR , Obama for America , T w i � er , Son y, and Federal U . S . Instit u tions that prefer to be u nnamed ” INTERMEDIATE IMPORTING DATA IN PYTHON
GET req u ests u sing req u ests One of the most do w nloaded P y thon packages import requests url = "https://www.wikipedia.org/" r = requests.get(url) text = r.text INTERMEDIATE IMPORTING DATA IN PYTHON
Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON
Scraping the w eb in P y thon IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp
HTML Mi x of u nstr u ct u red and str u ct u red data Str u ct u red data : Has pre - de � ned data model , or Organi z ed in a de � ned manner Unstr u ct u red data : neither of these properties INTERMEDIATE IMPORTING DATA IN PYTHON
Bea u tif u lSo u p Parse and e x tract str u ct u red data from HTML Make tag so u p bea u tif u l and e x tract information INTERMEDIATE IMPORTING DATA IN PYTHON
Bea u tif u lSo u p from bs4 import BeautifulSoup import requests url = 'https://www.crummy.com/software/BeautifulSoup/' r = requests.get(url) html_doc = r.text soup = BeautifulSoup(html_doc) INTERMEDIATE IMPORTING DATA IN PYTHON
Prettified So u p print(soup.prettify()) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd"> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title> Beautiful Soup: We called him Tortoise because he taught us. </title> <link href="mailto:leonardr@segfault.org" rev="made"/> <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/> <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/> <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/> <meta content="Leonard Richardson" name="author"/> </head> <body alink="red" bgcolor="white" link="blue" text="black" vlink="660066"> <img align="right" src="10.1.jpg" width="250"/> <br/> <p> INTERMEDIATE IMPORTING DATA IN PYTHON
E x ploring Bea u tif u lSo u p Man y methods s u ch as : print(soup.title) <title>Beautiful Soup: We called him Tortoise because he taught us.</title> print(soup.get_text()) Beautiful Soup: We called him Tortoise because he taught us. You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. INTERMEDIATE IMPORTING DATA IN PYTHON
E x ploring Bea u tif u lSo u p find_all() for link in soup.find_all('a'): print(link.get('href')) bs4/download/ #Download bs4/doc/ #HallOfFame https://code.launchpad.net/beautifulsoup https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup http://www.candlemarkandgleam.com/shop/constellation-games/ http://constellation.crummy.com/Constellation%20Games%20excerpt.html https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup https://bugs.launchpad.net/beautifulsoup/ http://lxml.de/ http://code.google.com/p/html5lib/ INTERMEDIATE IMPORTING DATA IN PYTHON
Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON
Recommend
More recommend