importing data into r
play

Importing data into R Workshop 3 2 Objectives By doing this - PowerPoint PPT Presentation

1 Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: Describe the breadth of data sources Devise reproducible strategies to import


  1. 1 Importing data into R Workshop 3

  2. 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: ● Describe the breadth of data sources ● Devise reproducible strategies to import local and remote data in a variety of formats Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas

  3. 3 Outline Where: stored locally (on your own computer) or remotely (on another computer/server) Format: various. structured as XML or JSON, in databases or may require harvesting How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor Result: always the same: dataframes or dataframe-like structures (eg., tibbles)

  4. 4 Locally stored: txt, csv or similar files Essentially plain text (can be opened in notepad and make sense) Occasionally fixed width columns more commonly ‘delimited’ by a particular character Read in with the read.table() methods read.table(file) is the minimum needed Remember that file location matters

  5. 5 Locally stored: txt, csv or similar files > mydata <- read.table("../data/structurepred.txt") > str(mydata) 'data.frame': 91 obs. of 3 variables: $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ... $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ... $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ... Arguments depend on data format > mydata <- read.table("../data/structurepred.txt", header=T) > str(mydata) 'data.frame': 90 obs. of 3 variables: $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ... $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ... $ prot: int 1 2 3 4 5 6 7 8 9 10 ... The other defaults are appropriate here (incl sep)

  6. 6 Locally stored: txt, csv or similar files Arguments depend on data format > mydata <- read.table("../data/Icd10Code.csv", header=T) Error in read.table("../data/Icd10Code.csv", header = T) : more columns than column names Try reading the first > mydata <- read.table("../data/Icd10Code.csv", header=F, nrows = 1) > (mydata) line only The default sep is the problem V1 1 Code,Description > mydata <- read.table("../data/Icd10Code.csv", header=T, sep=",") OR > mydata <- read.csv("../data/Icd10Code.csv") 'data.frame': 12131 obs. of 2 variables: $ Code : Factor w/ 12131 levels "A00","A000","A001",..: 1 2 3 4 5 ... $ Description: Factor w/ 12079 levels "4-Aminophenol derivatives",..: 1822 1823 1824 1826 11605 ... See manual: defaults depend on which read. method

  7. 7 Locally stored: special format files Can not usually be opened in notepad (and make sense) Often specific to particular software If you have the software you can export it as comma or tab delimited and use a read.table method But it’s much better to do processing in the script: all steps documented and repeatable To determine how to read that type of file: Google Keep googling

  8. 8 Locally stored: special format files Packages are often the solution e.g haven foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ... * Once > install.packages(“haven”) * once each session > library(haven) e.g read_sav > mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 … .... .... ....

  9. 9 Files from the Internet Why read from the web rather than saving the files then reading in normally? Repeatability Especially useful if you need to rerun analyses on regularly updated public data Use same methods as before - you just replace the file location with the URL of the data You still need to know the data format

  10. 10 Files from the Internet Data from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/ # to make the code more readable we set a variable to the website address of the file: > file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" # data format: look on the web or use: > readLines(file,n=5) [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00" > mydata <- read.table(file, header = F, skip=2) > str(mydata) Would still need 'data.frame': 6358 obs. of 18 variables: $ V1 : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... some ‘tidying’ $ V2 : int 12 1 1 1 1 1 1 1 1 1 ... … $ V18: num 99 99 99 99 99 99 99 99 99 99 ...

  11. 11 Web scraping D’ya geddit?? ‘harvest’ What if data are not in a file but on webpage? One solution is to use to 'scrape' the data using package rvest and an extension to Chrome called Selectorgadget that allows you to interactively determine what ‘css selector’ you need to extract desired components from a page To see rvest in action we are going to retrieve the results of a google scholar search for Calvin Dytham user TJUyl1gAAAAJ

  12. 12 Web scraping with rvest > install.packages("XML") > install.packages("rvest") > library(rvest) > library(magrittr) > page <- read_html("https://scholar.google.co.uk/citations?user=TJUyl1gAAAAJ&hl=en") #Specify the css selector in html_nodes() and extract the text with html_text(). # and change the string to numeric using as.numeric(). > citations <- page %>% html_nodes("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() > years <- page %>% html_nodes("#gsc_a_b .gsc_a_y") %>% html_text()%>%as.numeric() > citations [1] 1228 314 290 265 263 216 200 193 184 180 131 111 110 100 94 87 87 86 [19] 79 76 > years [1] 2011 2012 1999 2010 1999 2008 1999 2007 2011 2002 1998 2002 2003 2007 2005 2007 1995 2010 [19] 2009 2009

  13. 13 Data from databases Many packages For relational databases (Oracle, MSSQL, MySQL) RMySQL, RODBC For non- relational databases (MongoDB, Hadoop) rmongodb, rhbase For particular specialised research fields - packages for import, tidying, analysis rentrez, Bioconductor ropensci: R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.

  14. 14 And finally …. Connection to Programming: many programming concepts are typically required for data import. For example: input and output streams, pattern matching, loops When working with big datasets that take a while to read in, save your workspace (.RData) file and reload that rather than reading in the data and tidying each time This R Data Import Tutorial Is Everything You Need Importing Data Into R - Part Two

Recommend


More recommend