1 Importing data into R Workshop 2
2 Learning outcomes By following the slides and applying the techniques to the workshop examples the successful student will be able to: ● Describe the breadth of data sources ● Devise reproducible strategies to import local and remote data in a variety of formats Short talk outlining some possibilities followed by opportunities for you apply and combine ideas. Facilitated problem solving, rather than detailed tutorials
3 Outline: four aspects to consider Where: stored locally (on your own computer) or remotely (on another computer/server) Format: various. structured as XML or JSON, in databases or may require harvesting How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor Result: often dataframes or dataframe-like structures (eg., tibbles), often specialised data structures
4 Revision: Locally stored: txt, csv or similar files Essentially plain text (can be opened in notepad and make sense) Occasionally fixed width columns more commonly ‘delimited’ by a particular character Read in with the read.table() methods read.table(file) is the minimum needed, other arguments have defaults Remember that file location matters
5 Revision: Locally stored: txt, csv or similar files > mydata <- read.table("../data/structurepred.txt") > str(mydata) 'data.frame': 91 obs. of 3 variables: $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ... $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ... $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ... Arguments depend on data format > mydata <- read.table("../data/structurepred.txt", header = T) > str(mydata) 'data.frame': 90 obs. of 3 variables: $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ... $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ... $ prot: int 1 2 3 4 5 6 7 8 9 10 ... The other defaults are appropriate here (incl sep)
6 Revision: Locally stored: txt, csv or similar files Arguments depend on data format > mydata <- read.table("../data/Icd10Code.csv", header=T) Error in read.table("../data/Icd10Code.csv", header = T) : more columns than column names Try reading the first > mydata <- read.table("../data/Icd10Code.csv", header = F, nrows = 1) > (mydata) line only The default sep is the problem V1 1 Code,Description > mydata <- read.table("../data/Icd10Code.csv", header = T, sep=",") OR > mydata <- read.csv("../data/Icd10Code.csv") 'data.frame': 12131 obs. of 2 variables: $ Code : Factor w/ 12131 levels "A00","A000","A001",..: 1 2 3 4 5 ... $ Description: Factor w/ 12079 levels "4-Aminophenol derivatives",..: 1822 1823 1824 1826 11605 ... See manual: defaults depend on which read. method
7 Locally stored: special format files Can not usually be opened in notepad (and make sense) Often specific to particular software Filepaths - no change Method/function - may differ ● If you have the software you can export it as comma or tab delimited and use a read.table method ● But it’s much better to do processing in the script: all steps documented and repeatable ● To determine how to read that type of file: Google ● Keep googling
8 Locally stored: special format files Packages are often the solution e.g haven foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ... * Already installed in Biology/Rlibs (on your own pc, do once ) > install.packages(“haven”) * once each session > library(haven) e.g read_sav > mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 … .... .... ....
9 Files from the Internet Why read from the web rather than saving the files then reading in normally? Repeatability Especially useful if you need to rerun analyses on regularly updated public data Use same methods as before - you just replace the file location with the URL of the data You still need to know the data format
10 Files from the Internet Data from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/ # to make the code more readable we set a variable to the website address of the file: > file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" # data format: look on the web or use: > readLines(file, n = 5) [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00" > mydata <- read.table(file, header = F, skip = 2) > str(mydata) Would still need 'data.frame': 6358 obs. of 18 variables: $ V1 : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... some ‘tidying’ $ V2 : int 12 1 1 1 1 1 1 1 1 1 ... … $ V18: num 99 99 99 99 99 99 99 99 99 99 ...
11 Web scraping D’ya geddit?? ‘harvest’ What if data are not in a file but on webpage? One solution is to use to 'scrape' the data using package rvest and an extension to Chrome called Selectorgadget that allows you to interactively determine what ‘css selector’ you need to extract desired components from a page To see rvest in action we are going to retrieve the results of a google scholar search for Calvin Dytham user TJUyl1gAAAAJ
12 Web scraping with rvest > install.packages("XML") > install.packages("rvest") > library(rvest) > library(magrittr) > page <- read_html("https://scholar.google.co.uk/citations?user=TJUyl1gAAAAJ&hl=en") #Specify the css selector in html_nodes() and extract the text with html_text(). # and change the string to numeric using as.numeric(). > citations <- page %>% html_nodes("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() > years <- page %>% html_nodes("#gsc_a_b .gsc_a_y") %>% html_text()%>%as.numeric() > citations [1] 1228 314 290 265 263 216 200 193 184 180 131 111 110 100 94 87 87 86 [19] 79 76 > years [1] 2011 2012 1999 2010 1999 2008 1999 2007 2011 2002 1998 2002 2003 2007 2005 2007 1995 2010 [19] 2009 2009
13 Data from databases Many packages For relational databases (Oracle, MSSQL, MySQL) RMySQL, RODBC For non- relational databases (MongoDB, Hadoop) rmongodb, rhbase For particular specialised research fields - packages for import, tidying, analysis rentrez, Bioconductor ropensci: R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.
14 Resulting data structures Dataframes or dataframe-like structures (eg., tibbles) Specialised: for microarray expression data, flowcytometry data, image data, proteomic, transcriptomic data. Usually have an element which is the actual data in a dataframe > library(EBImage) > img1 <- readImage(file1) > str(img1) Formal class 'Image' [package "EBImage"] with 2 slots ..@ .Data : num [1:768, 1:512] 0.447 0.451 0.463 0.455 0.463 ... ..@ colormode: int 0 > display(img1)
15
16 And finally …. Connection to Programming: many programming concepts are typically required for data import. For example: input and output streams, pattern matching, loops When working with big datasets that take a while to read in, save your workspace (.RData) file and reload that rather than reading in the data and tidying each time This R Data Import Tutorial Is Everything You Need Importing Data Into R - Part Two
17 Summary Data can:be imported from locally or remotely stored files; scraped from webpages; local or remote databases; or accessed by APIs. Tips ● Understand the data format: read documentation, open plain text files, use readLines ● Google import errors ● Experiment and test with toy examples Data structures: mainly dataframes and tibbles; sometimes specialised structures with metadata (eg bioconductor packages). Read documentation and google a lot
Recommend
More recommend