web scraping 101
play

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e - PowerPoint PPT Presentation

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e Wickham Instr u ctor Selectors Li le bro w ser e x tensions Identif y the speci c bit ( s ) y o u w ant Gi v e y o u a u niq u e ID to grab them w ith Not u sed in this co


  1. Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo � e Wickham Instr u ctor

  2. Selectors Li � le bro w ser e x tensions Identif y the speci � c bit ( s ) y o u w ant Gi v e y o u a u niq u e ID to grab them w ith Not u sed in this co u rse ( b u t w orth grabbing a � er ) WORKING WITH WEB DATA IN R

  3. r v est rvest is a dedicated w eb scraping package Makes things shockingl y eas y Read HTML page w ith read_html(url = ___) WORKING WITH WEB DATA IN R

  4. Parsing HTML read_html() ret u rns an XML doc u ment Use html_node() to e x tract contents w ith XPATHs WORKING WITH WEB DATA IN R

  5. Parsing HTML wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c . [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 . WORKING WITH WEB DATA IN R

  6. wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c ... [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ... html_node(wiki_r, xpath = "//ul") {xml_node} <ul> [1] <li><a href="/wiki/Common_Lisp" title="Common Lisp">Common Li ... [2] <li><a href="/wiki/S_(programming_language)" title="S (progra ... [3] <li>\n<a href="/wiki/Scheme_(programming_language)" title="Sc ... [4] <li><a href="/wiki/XLispStat" title="XLispStat">XLispStat</a> ... WORKING WITH WEB DATA IN R

  7. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

  8. HTML Str u ct u re W OR K IN G W ITH W E B DATA IN R Oli v er Ke y es Instr u ctor

  9. Tags HTML is content w ithin tags Like XML <p> this is a test </p> WORKING WITH WEB DATA IN R

  10. Attrib u tes <a href = "https://en.wikipedia.org/"> this is a test </a> WORKING WITH WEB DATA IN R

  11. E x tracting information html_text(x = ___) - get te x t contents html_attr(x = ___, name = ___) - get speci � c a � rib u te html_name(x = ___) - get tag name WORKING WITH WEB DATA IN R

  12. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

  13. Reformatting Data W OR K IN G W ITH W E B DATA IN R Charlo � e Wickham Instr u ctor

  14. HTML tables HTML tables are dedicated str u ct u res : <table>...</table> The y can be t u rned into data . frames w ith html_table() Use colnames(table) <- c("name", "second_name") to name the col u mns WORKING WITH WEB DATA IN R

  15. T u rning things into data . frames Non - tables can also become data . frames Use data.frame() , w ith the v ectors of te x t or names or a � rib u tes WORKING WITH WEB DATA IN R

  16. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

Recommend


More recommend