/
Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt /
Today 1. Data Import in R · 1.1. Datasets · 1.2. Importing · 1.3. Databases · 1.4. Hands-on Data Import Fraud Detection Course 2019/2020 - Nuno Moniz /
Data Import in R /
Datasets · The most common setting: data analysis tasks use source data sets sorted in tabular formats · Datasets are bi-dimensional structures (e.g. table) - Rows are observation a certain phenomenon, and - Columns contain information describing each observation · The most common data structure in R to store such tables is the data frame Today: How to import data stored in di�erent formats/infra-structures into an R data frame Fraud Detection Course 2019/2020 - Nuno Moniz /
Internal R Data Sets · R installations include many data sets. Obtain them by doing the following: data() · New packages may add new datasets: data(package = "DMwR") # datasets from a specific package data(package = .packages(all.available = TRUE)) # from all packages · To load them, use the function data data(iris) head(iris,3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa Fraud Detection Course 2019/2020 - Nuno Moniz /
RData File Format · All R objects (including data frames) can be saved in an RData �le: ds <- data.frame(x = rnorm(10), y = rnorm(10)) # generate data frame with random values save(ds, file="dummy.RData") # save the data frame to an RData file · To load the data frame back into R: rm(ds) # remove the variable ds from the R environment ds ## Error in eval(expr, envir, enclos): object 'ds' not found load("dummy.RData") # load the RData file with the data frame head(ds, 3) ## x y ## 1 -1.5090279 0.1893106 ## 2 0.2562916 1.4275355 ## 3 0.1648592 0.1893078 Fraud Detection Course 2019/2020 - Nuno Moniz /
Import Text Files · Text �les: - A common way of storing and sharing data sets, - And one of the easiest also · Examples include: - CSV �les - Tabbed �les - Spaced �les - . . . Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing CSV Files · In CSV �les, the values in each line are separated by commas File "dummy.csv" Import to a data frame: ID, Name, Age ds <- read.csv("dummy.csv", 23424, Ana, 45 header = TRUE) head(ds) 11234, Charles, 23 77654, Susanne, 76 ## X ID Name Age ## 1 1 23424 Ana 45 ## 2 2 11234 Charles 23 ## 3 3 77654 Susanne 76 Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing CSV Files (alternate) · Some countries use the comma as a decimal separator (Portugal does, US doesn't) · An alternate method for importing CSV �les is based on the use of a semi-colon for separating values File "dummy.csv" Import to a data frame: ID, Name, Age ds <- read.csv2("dummy.csv") 23424, Ana, 45 head(ds) 11234, Charles, 23 77654, Susanne, 76 Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing Other Text Files · Files can also be presented as a regular .txt �le File "dummy.csv" ID, Name, Age 23424, Ana, ? 11234, Charles, 23 77654, Susanne, 76 · Import to a data frame: ds <- read.table("dummy.txt", header=TRUE, na.strings="???") · If you want to speci�cy the type of data in each column: ds <- read.table("dummy.txt", header=TRUE, na.strings="???", colClasses=c("integer","character","character")) Fraud Detection Course 2019/2020 - Nuno Moniz /
Summary on Text Formats · Other functions with similar objective - read.table , read.csv , read.csv2 , read.delim , . . . · Relevant Parameters (very important): - sep : indicates the character used as values separator - dec : indicates the character used as decimal separator - header : indicates whether the �rst line contains the header column names - na.strings : indicates the vector of characters interpreted as unknown values · Calling this functions will result in the creation of a data frame · Beware of problem related to character encoding Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing Data from Spreadsheets (Method 1) ODBC connection (Windows) · Communication protocol between databases · Microsoft Excel is able to communicate using this protocol Installation · Install package RODBC Example library(RODBC) fc <- "C:\\Documents and Settings\\xpto\\My Documents\\calc.xls" # made-up cn <- odbcConnectExcel(fc) shs <- sqlTables(cn) dat <- sqlQuery(cn, paste("SELECT * FROM", shs$TABLE_NAME[1])) Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing Data from Spreadsheets (Method 2) Package gdata · This package includes the functions read.xls Installation · Install package gdata · Requires the availability of Perl Example library(gdata) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read.xls(fc, sheet=1) # ?read.xls for more examples Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing Data from Spreadsheets (Method 3) Using the clipboard · It's the easiest when importing small data tables Copy... And (sort of) paste ds <- read.table("clipboard", header=TRUE) ds ## ID Name Age ## 1 23424 Ana <NA> ## 2 11234 Charles 23 ## 3 77654 Susanne 76 Fraud Detection Course 2019/2020 - Nuno Moniz /
Importing Data from Spreadsheets (Method 4) Package readxl · This packages includes the function read_excel Example library(readxl) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read_excel(fc, sheet=1) Fraud Detection Course 2019/2020 - Nuno Moniz /
Connection to DB's · Take a look at the DBI package · This package provides a front-end interface to DBMS-speci�c drivers Fraud Detection Course 2019/2020 - Nuno Moniz /
Connection to DB's (example) # it's a dummy example - it won't run properly library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv, dbname="transDB", # connecting to the DBMS username="myuser", password="mypasswd", host="localhost") # getting the results of a query as a data frame data <- dbGetQuery(con, "SELECT * FROM clients") dbDisconnect(con) # close connection Fraud Detection Course 2019/2020 - Nuno Moniz /
Connection to DB's (another example) library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv,dbname="transDB", # connecting to the DBMS username="myuser",password="mypasswd", host="localhost") res <- dbSendQuery(con, "SELECT * FROM transactions") while (!dbHasCompleted(res)) { someData <- fetch(res, n = 50) # get the next 50 records on a data frame process(someData) # call some function that handles the current chunk } dbClearResult(res) # clear the results set dbDisconnect(con) # close connection Fraud Detection Course 2019/2020 - Nuno Moniz /
More Data Importation · Databases - R has interfaces to all major DBMS (packages DBI , RMySQL , ROracle , etc.) · Formats from other statistical software - Minitab, S-Plus, SPSS, Stata, SAS, etc. - Packages foreign , Hmisc · For more information, check the R manual on Data Import/Export Fraud Detection Course 2019/2020 - Nuno Moniz /
Hands-on: Importing Data in R · The site UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning- databases/audiology/) (click the link) contains a data set of an Audiology problem - Download the dataset audiology.standardized.data and import that data into an R data frame. - Read the information on the web page to �nd out how unknown values are represented and make sure they are properly translated into R nomenclature Fraud Detection Course 2019/2020 - Nuno Moniz /
/
Recommend
More recommend