importing flat files from the w eb
play

Importing flat files from the w eb IN TE R ME D IATE IMP OR TIN G - PowerPoint PPT Presentation

Importing flat files from the w eb IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp Yo u re alread y great at importing ! Flat les s u ch as . t x t and . cs v Pickled les , E x cel


  1. Importing flat files from the w eb IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp

  2. Yo u’ re alread y great at importing ! Flat � les s u ch as . t x t and . cs v Pickled � les , E x cel spreadsheets , and man y others ! Data from relational databases Yo u can do all these locall y What if y o u r data is online ? INTERMEDIATE IMPORTING DATA IN PYTHON

  3. Can y o u import w eb data ? Yo u can : go to URL and click to do w nload � les BUT : not reprod u cible , not scalable INTERMEDIATE IMPORTING DATA IN PYTHON

  4. Yo u’ ll learn ho w to … Import and locall y sa v e datasets from the w eb Load datasets into pandas DataFrames Make HTTP req u ests ( GET req u ests ) Scrape w eb data s u ch as HTML Parse HTML into u sef u l data ( Bea u tif u lSo u p ) Use the u rllib and req u ests packages INTERMEDIATE IMPORTING DATA IN PYTHON

  5. The u rllib package Pro v ides interface for fetching data across the w eb urlopen() - accepts URLs instead of � le names INTERMEDIATE IMPORTING DATA IN PYTHON

  6. Ho w to a u tomate file do w nload in P y thon from urllib.request import urlretrieve url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ winequality-white.csv' urlretrieve(url, 'winequality-white.csv') ('winequality-white.csv', <http.client.HTTPMessage at 0x103cf1128>) INTERMEDIATE IMPORTING DATA IN PYTHON

  7. Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON

  8. HTTP req u ests to import files from the w eb IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp

  9. URL Uniform / Uni v ersal Reso u rce Locator References to w eb reso u rces Foc u s : w eb addresses Ingredients : Protocol identi � er - h � p : Reso u rce name - datacamp . com These specif y w eb addresses u niq u el y INTERMEDIATE IMPORTING DATA IN PYTHON

  10. HTTP H y perTe x t Transfer Protocol Fo u ndation of data comm u nication for the w eb HTTPS - more sec u re form of HTTP Going to a w ebsite = sending HTTP req u est GET req u est urlretrieve() performs a GET req u est HTML - H y perTe x t Mark u p Lang u age INTERMEDIATE IMPORTING DATA IN PYTHON

  11. GET req u ests u sing u rllib from urllib.request import urlopen, Request url = "https://www.wikipedia.org/" request = Request(url) response = urlopen(request) html = response.read() response.close() INTERMEDIATE IMPORTING DATA IN PYTHON

  12. GET req u ests u sing req u ests Used b y “ her Majest y' s Go v ernment , Ama z on , Google , T w ilio , NPR , Obama for America , T w i � er , Son y, and Federal U . S . Instit u tions that prefer to be u nnamed ” INTERMEDIATE IMPORTING DATA IN PYTHON

  13. GET req u ests u sing req u ests One of the most do w nloaded P y thon packages import requests url = "https://www.wikipedia.org/" r = requests.get(url) text = r.text INTERMEDIATE IMPORTING DATA IN PYTHON

  14. Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON

  15. Scraping the w eb in P y thon IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON H u go Bo w ne - Anderson Data Scientist at DataCamp

  16. HTML Mi x of u nstr u ct u red and str u ct u red data Str u ct u red data : Has pre - de � ned data model , or Organi z ed in a de � ned manner Unstr u ct u red data : neither of these properties INTERMEDIATE IMPORTING DATA IN PYTHON

  17. Bea u tif u lSo u p Parse and e x tract str u ct u red data from HTML Make tag so u p bea u tif u l and e x tract information INTERMEDIATE IMPORTING DATA IN PYTHON

  18. Bea u tif u lSo u p from bs4 import BeautifulSoup import requests url = 'https://www.crummy.com/software/BeautifulSoup/' r = requests.get(url) html_doc = r.text soup = BeautifulSoup(html_doc) INTERMEDIATE IMPORTING DATA IN PYTHON

  19. Prettified So u p print(soup.prettify()) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd"> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title> Beautiful Soup: We called him Tortoise because he taught us. </title> <link href="mailto:leonardr@segfault.org" rev="made"/> <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/> <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/> <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/> <meta content="Leonard Richardson" name="author"/> </head> <body alink="red" bgcolor="white" link="blue" text="black" vlink="660066"> <img align="right" src="10.1.jpg" width="250"/> <br/> <p> INTERMEDIATE IMPORTING DATA IN PYTHON

  20. E x ploring Bea u tif u lSo u p Man y methods s u ch as : print(soup.title) <title>Beautiful Soup: We called him Tortoise because he taught us.</title> print(soup.get_text()) Beautiful Soup: We called him Tortoise because he taught us. You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. INTERMEDIATE IMPORTING DATA IN PYTHON

  21. E x ploring Bea u tif u lSo u p find_all() for link in soup.find_all('a'): print(link.get('href')) bs4/download/ #Download bs4/doc/ #HallOfFame https://code.launchpad.net/beautifulsoup https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup http://www.candlemarkandgleam.com/shop/constellation-games/ http://constellation.crummy.com/Constellation%20Games%20excerpt.html https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup https://bugs.launchpad.net/beautifulsoup/ http://lxml.de/ http://code.google.com/p/html5lib/ INTERMEDIATE IMPORTING DATA IN PYTHON

  22. Let ' s practice ! IN TE R ME D IATE IMP OR TIN G DATA IN P YTH ON

Recommend


More recommend