Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr.

it depends ▶ every program has dependencies ▶ software packages ( library ) ▶ data files ( readLines , read.csv , scan , dir …) ▶ good programs document their dependencies clearly at the start ▶ nice programs allow their users to meet dependencies in a controlled fashion. Which is better as a file dependency: ▶ "/Users/agoldst/jockers/data/plainText/austen.txt" ▶ "austen.txt" ▶ "../../../../data/plainText/austen.txt"

file system, once and for all ▶ Every R process has a working directory ▶ RStudio defaults to ▶ ~ ▶ the project directory ▶ the containing directory of the file you launched RStudio to open ▶ Knitting starts a new R process whose working directory is the containing directory of the R markdown file

portable dependencies ▶ all file paths relative to the working directory ▶ working directory set to the directory containing the program script ▶ working directory never subsequently modified Testing portability ▶ for console testing, start each session by setting the working directory once to the script-containing directory ▶ for knitting, do not modify the working directory ▶ read your error messages (including in knit PDFs)

Moral scan(..., encoding="UTF-8") j read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e o ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 r P e h T . . . ha ha ha, character encoding

scan(..., encoding="UTF-8") ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e j o r P e h T . . . ha ha ha, character encoding Moral

first class ▶ function definitions look like assignments because they are ▶ function (...) { ... } is a value like any other

bind <- function (x, f) { f(x) } bind(c(1, 2, 3), sum) [1] 6 twice <- function (s) { str_c(s, s) } bind("ha", twice) [1] "haha" function as parameter

`%p%` <- function (x, y) { x + y } 100 %p% 200 [1] 300 `%b%` <- function (x, f) { f(x) } "ha" %b% twice [1] "haha" funny function

bind("parenthetical", function (s) { str_c("(", s, ")") } ) [1] "(parenthetical)" anonymous function

map <- function (f, xs) { result[[j]] <- f(xs[[j]]) } result } map(twice, c("well", "now", "no")) [[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono" result <- list() for (j in seq_along(xs)) {

filter_vector <- function (f, xs) { for (x in xs) { if (f(x)) { } } result } filter_vector(pos, (-5):5) [1] 1 2 3 4 5 result <- c() result <- c(result, x) pos <- function (x) (x > 0)

[1] "nono" } [[3]] [1] "nownow" [[2]] [1] "wellwell" [[1]] map_f(twice)(c("well", "now", "no")) } map_f <- function (f) { result } for (j in seq_along(xs)) { function (xs) { curry result <- list() result[[j]] <- f(xs[[j]])

result <- c() result <- c(result, x) filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ?

filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ? result <- c() result <- c(result, x)

lapply(lst, f) # same as map(f, lst) lapply(lst, f, x, y, ...) # ... passed on to f: # inside the for loop: result[[j]] <- f(xs[[j]], x, y, ...) built-in ▶ sapply (returns a vector if possible) ▶ apply (iterate over rows/columns of a matrix) ▶ tapply (iterate over groups identified by a factor) ▶ what a mess!

dplyr: split-apply-combine ▶ split a data frame up into pieces (rows, groups of rows…) ▶ do something to each piece ▶ put together the result

arrange order No new functionality (but…) select column indexing by name (or $ ) filter logical row subscripting mutate expressions in terms of columns summarize for loops, table , sum , mean …

surnames <- select(laureates, surname) Tranströmer "Müller" [1] "Munro" women$surname[1:3] women <- filter(laureates, gender=="female") Müller 6 5 Vargas Llosa 4 head(surnames) Yan 3 Munro 2 Modiano 1 surname # data type! "Lessing"

women <- filter(laureates, function (laur_row) { }) (notionally) laur_row$gender == "female"

women_surnames <- select(filter(laureates, gender=="female"), modularity, but… surname, year) # ugh

women_last <- laureates %>% filter(gender=="female") %>% select(surname, year) women_last[1:3, ] surname year 1 Munro 2013 2 Müller 2009 3 Lessing 2007 curiouser and curiouser

x %>% f # equivalent to... x %b% f # or... f(x) f(x, y, z) # wow!? “pipe” x %>% f(y, z) # equivalent to...

laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it... total 1 12 laureates %>% filter(gender=="female") %>% summarize(total=n()) total 1 12 a special function

laureates %>% filter(!is.na(diedCountryCode)) %>% filter(bornCountryCode != diedCountryCode) %>% summarize(n_exiles=n()) n_exiles 1 34

f_sws <- laureates %>% Herta fullname 1 Alice Munro Alice Munro 2 Tomas Tranströmer Tomas Tranströmer 3 Müller firstname Herta Müller 4 Doris Lessing Doris Lessing 5 Elfriede Jelinek surname mutate(fullname=str_c(firstname, surname, sep=" ")) filter(gender=="female" | bornCountry=="Sweden") %>% 2 select(firstname, surname) %>% slice(1:5) f_sws firstname surname 1 Alice Munro Tomas Tranströmer f_sws %>% 3 Herta Müller 4 Doris Lessing 5 Elfriede Jelinek Elfriede Jelinek

f_sws %>% 3 Jelinek 5 Doris Lessing Lessing 4 Herta Müller Müller 2 Tranströmer Tomas Tranströmer mutate(fullname=str_c(firstname, surname, sep=" ")) %>% Alice Munro Munro 1 fullname surname # minus!! select(-firstname) Elfriede Jelinek

f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% arrange(surname) %>% select(fullname) fullname 1 Elfriede Jelinek 2 Doris Lessing 3 Alice Munro 4 Herta Müller 5 Tomas Tranströmer

laur_ages <- laureates %>% 2010 2010 3 Tranströmer 2011 82 2010 Munro 2013 2 69 Modiano 2014 filter(born != "0000-00-00") %>% 1 surname year decade age select(surname, year, decade, age) mutate(decade=str_c(str_sub(year, 1, 3), 0)) %>% mutate(age=year - born_year) %>% mutate(born_year=as.numeric(str_sub(born, 1, 4))) %>% # Mo Yan 80 laur_ages %>% slice(1:3)

... ... 54 Le Clézio 2008 2000 68 7 Lessing 2007 2000 88 8 Pamuk 2006 2000 9 56 Pinter 2005 2000 75 10 Jelinek 2004 2000 58 .. ... ... 6 2000 laur_ages %>% Munro 2013 group_by(decade) # no-op? Source: local data frame [106 x 4] Groups: decade surname year decade age 1 Modiano 2014 2010 69 2 2010 Müller 2009 82 3 Tranströmer 2011 2010 80 4 Vargas Llosa 2010 2010 74 5 split

2010 76.25000 6 12 2000 66.40000 11 1990 67.50000 10 1980 67.60000 9 1970 67.00000 8 1960 66.20000 7 1950 63.70000 1940 64.33333 laur_ages %>% group_by(decade) %>% 5 1930 56.44444 4 1920 60.10000 3 1910 58.87500 2 1900 64.11111 1 avg_age decade Source: local data frame [12 x 2] summarize(avg_age=mean(age)) apply-combine

[16] "Other.Role" [7] "Publisher" "OtherLN" "OtherFN" [13] "Country" "Lanuage" "Year" [10] "Month" "Price" "Genre" "TranslatorFN" "TranslatorLN" txl <- read.csv("three-percent.csv", as.is=T, [4] "AuthorLN" "AuthorFN" "Titles" [1] "ISBN" colnames(txl) [1] 3187 nrow(txl) # N.B. encoding="UTF-8") input…

txl <- txl %>% filter(Year < 2015) %>% select(ISBN, Year, Country, Genre, Publisher, Language=Lanuage)

93 10 2012 7 2011 Fiction 303 8 2011 Poetry 67 9 2012 Fiction 387 Poetry Poetry 73 11 2013 Fiction 448 12 2013 Poetry 93 13 2014 Fiction 494 14 2014 Poetry 75 2010 txl %>% group_by(Year, Genre) %>% Poetry summarize(count=n()) Source: local data frame [14 x 3] Groups: Year Year Genre count 1 2008 Fiction 278 2 2008 82 6 3 2009 Fiction 291 4 2009 Poetry 72 5 2010 Fiction 265 split, apply, combine

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr. it depends every program has dependencies software packages ( library ) data files (

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

1 What is a Gospel? Gospel = English translation of (good news)

JC2 LITERARY EPILOGUE A NEW SYLLABUS, A NEW HOPE JC2 LITERARY EPILOGUE Please be seated in 6

Update on the Literary Fund Presentation to: House Appropriations Elementary and Secondary

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

THE GOSPEL OF MARK Outline What Is a Gospel? The Lives of Mark Academic

Fresh Lit! New and Forthcoming Literary Fiction Rosalind Reisner www.areadersplace.net

First Literary Dates @engagenow_eu Jess Sanz Institut del Teatre Organised by: ENGAGE WITH

Overview of the Literary Fund and Overview of the Literary Fund and VPSA Educational Technology

New approaches for improving New approaches for improving Data mining feature selection Data

Childrens Book Contest The Power to Make a Difference Through Literacy 1 National Literary

The Ferrante Effect and the Italian Literary Establishment Maria Mattea Legge Elena

Digital Literary Stylis.cs Anne BANDRY-SCUBBI Womens Novels 1750s-1830s and the Company They

Aligning and Integrating Data in Karma Craig Knoblock University of Southern California Data

WEBINAR - Data Privacy: New Regulation and Implications for Big Data Approaches 29 Nov, 12h CET

Overview Motivation Problem Definition Data Integration Data Integration Approaches

Data Sharing Approaches and Activities in the U.S. Richard Glassco Noblis Consultant to USDOT