literary data some approaches
play

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr. it depends every program has dependencies software packages ( library ) data files (


  1. Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr.

  2. it depends ▶ every program has dependencies ▶ software packages ( library ) ▶ data files ( readLines , read.csv , scan , dir …) ▶ good programs document their dependencies clearly at the start ▶ nice programs allow their users to meet dependencies in a controlled fashion. Which is better as a file dependency: ▶ "/Users/agoldst/jockers/data/plainText/austen.txt" ▶ "austen.txt" ▶ "../../../../data/plainText/austen.txt"

  3. file system, once and for all ▶ Every R process has a working directory ▶ RStudio defaults to ▶ ~ ▶ the project directory ▶ the containing directory of the file you launched RStudio to open ▶ Knitting starts a new R process whose working directory is the containing directory of the R markdown file

  4. portable dependencies ▶ all file paths relative to the working directory ▶ working directory set to the directory containing the program script ▶ working directory never subsequently modified Testing portability ▶ for console testing, start each session by setting the working directory once to the script-containing directory ▶ for knitting, do not modify the working directory ▶ read your error messages (including in knit PDFs)

  5. Moral scan(..., encoding="UTF-8") j read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e o ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 r P e h T . . . ha ha ha, character encoding

  6. scan(..., encoding="UTF-8") ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e j o r P e h T . . . ha ha ha, character encoding Moral

  7. first class ▶ function definitions look like assignments because they are ▶ function (...) { ... } is a value like any other

  8. bind <- function (x, f) { f(x) } bind(c(1, 2, 3), sum) [1] 6 twice <- function (s) { str_c(s, s) } bind("ha", twice) [1] "haha" function as parameter

  9. `%p%` <- function (x, y) { x + y } 100 %p% 200 [1] 300 `%b%` <- function (x, f) { f(x) } "ha" %b% twice [1] "haha" funny function

  10. `%p%` <- function (x, y) { x + y } 100 %p% 200 [1] 300 `%b%` <- function (x, f) { f(x) } "ha" %b% twice [1] "haha" funny function

  11. bind("parenthetical", function (s) { str_c("(", s, ")") } ) [1] "(parenthetical)" anonymous function

  12. map <- function (f, xs) { result[[j]] <- f(xs[[j]]) } result } map(twice, c("well", "now", "no")) [[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono" result <- list() for (j in seq_along(xs)) {

  13. filter_vector <- function (f, xs) { for (x in xs) { if (f(x)) { } } result } filter_vector(pos, (-5):5) [1] 1 2 3 4 5 result <- c() result <- c(result, x) pos <- function (x) (x > 0)

  14. [1] "nono" } [[3]] [1] "nownow" [[2]] [1] "wellwell" [[1]] map_f(twice)(c("well", "now", "no")) } map_f <- function (f) { result } for (j in seq_along(xs)) { function (xs) { curry result <- list() result[[j]] <- f(xs[[j]])

  15. result <- c() result <- c(result, x) filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ?

  16. filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ? result <- c() result <- c(result, x)

  17. lapply(lst, f) # same as map(f, lst) lapply(lst, f, x, y, ...) # ... passed on to f: # inside the for loop: result[[j]] <- f(xs[[j]], x, y, ...) built-in ▶ sapply (returns a vector if possible) ▶ apply (iterate over rows/columns of a matrix) ▶ tapply (iterate over groups identified by a factor) ▶ what a mess!

  18. dplyr: split-apply-combine ▶ split a data frame up into pieces (rows, groups of rows…) ▶ do something to each piece ▶ put together the result

  19. arrange order No new functionality (but…) select column indexing by name (or $ ) filter logical row subscripting mutate expressions in terms of columns summarize for loops, table , sum , mean …

  20. surnames <- select(laureates, surname) Tranströmer "Müller" [1] "Munro" women$surname[1:3] women <- filter(laureates, gender=="female") Müller 6 5 Vargas Llosa 4 head(surnames) Yan 3 Munro 2 Modiano 1 surname # data type! "Lessing"

  21. women <- filter(laureates, function (laur_row) { }) (notionally) laur_row$gender == "female"

  22. women_surnames <- select(filter(laureates, gender=="female"), modularity, but… surname, year) # ugh

  23. women_last <- laureates %>% filter(gender=="female") %>% select(surname, year) women_last[1:3, ] surname year 1 Munro 2013 2 Müller 2009 3 Lessing 2007 curiouser and curiouser

  24. x %>% f # equivalent to... x %b% f # or... f(x) f(x, y, z) # wow!? “pipe” x %>% f(y, z) # equivalent to...

  25. laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it... total 1 12 laureates %>% filter(gender=="female") %>% summarize(total=n()) total 1 12 a special function

  26. laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it... total 1 12 laureates %>% filter(gender=="female") %>% summarize(total=n()) total 1 12 a special function

  27. laureates %>% filter(!is.na(diedCountryCode)) %>% filter(bornCountryCode != diedCountryCode) %>% summarize(n_exiles=n()) n_exiles 1 34

  28. f_sws <- laureates %>% Herta fullname 1 Alice Munro Alice Munro 2 Tomas Tranströmer Tomas Tranströmer 3 Müller firstname Herta Müller 4 Doris Lessing Doris Lessing 5 Elfriede Jelinek surname mutate(fullname=str_c(firstname, surname, sep=" ")) filter(gender=="female" | bornCountry=="Sweden") %>% 2 select(firstname, surname) %>% slice(1:5) f_sws firstname surname 1 Alice Munro Tomas Tranströmer f_sws %>% 3 Herta Müller 4 Doris Lessing 5 Elfriede Jelinek Elfriede Jelinek

  29. f_sws %>% 3 Jelinek 5 Doris Lessing Lessing 4 Herta Müller Müller 2 Tranströmer Tomas Tranströmer mutate(fullname=str_c(firstname, surname, sep=" ")) %>% Alice Munro Munro 1 fullname surname # minus!! select(-firstname) Elfriede Jelinek

  30. f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% arrange(surname) %>% select(fullname) fullname 1 Elfriede Jelinek 2 Doris Lessing 3 Alice Munro 4 Herta Müller 5 Tomas Tranströmer

  31. laur_ages <- laureates %>% 2010 2010 3 Tranströmer 2011 82 2010 Munro 2013 2 69 Modiano 2014 filter(born != "0000-00-00") %>% 1 surname year decade age select(surname, year, decade, age) mutate(decade=str_c(str_sub(year, 1, 3), 0)) %>% mutate(age=year - born_year) %>% mutate(born_year=as.numeric(str_sub(born, 1, 4))) %>% # Mo Yan 80 laur_ages %>% slice(1:3)

  32. ... ... 54 Le Clézio 2008 2000 68 7 Lessing 2007 2000 88 8 Pamuk 2006 2000 9 56 Pinter 2005 2000 75 10 Jelinek 2004 2000 58 .. ... ... 6 2000 laur_ages %>% Munro 2013 group_by(decade) # no-op? Source: local data frame [106 x 4] Groups: decade surname year decade age 1 Modiano 2014 2010 69 2 2010 Müller 2009 82 3 Tranströmer 2011 2010 80 4 Vargas Llosa 2010 2010 74 5 split

  33. 2010 76.25000 6 12 2000 66.40000 11 1990 67.50000 10 1980 67.60000 9 1970 67.00000 8 1960 66.20000 7 1950 63.70000 1940 64.33333 laur_ages %>% group_by(decade) %>% 5 1930 56.44444 4 1920 60.10000 3 1910 58.87500 2 1900 64.11111 1 avg_age decade Source: local data frame [12 x 2] summarize(avg_age=mean(age)) apply-combine

  34. [16] "Other.Role" [7] "Publisher" "OtherLN" "OtherFN" [13] "Country" "Lanuage" "Year" [10] "Month" "Price" "Genre" "TranslatorFN" "TranslatorLN" txl <- read.csv("three-percent.csv", as.is=T, [4] "AuthorLN" "AuthorFN" "Titles" [1] "ISBN" colnames(txl) [1] 3187 nrow(txl) # N.B. encoding="UTF-8") input…

  35. txl <- txl %>% filter(Year < 2015) %>% select(ISBN, Year, Country, Genre, Publisher, Language=Lanuage)

  36. 93 10 2012 7 2011 Fiction 303 8 2011 Poetry 67 9 2012 Fiction 387 Poetry Poetry 73 11 2013 Fiction 448 12 2013 Poetry 93 13 2014 Fiction 494 14 2014 Poetry 75 2010 txl %>% group_by(Year, Genre) %>% Poetry summarize(count=n()) Source: local data frame [14 x 3] Groups: Year Year Genre count 1 2008 Fiction 278 2 2008 82 6 3 2009 Fiction 291 4 2009 Poetry 72 5 2010 Fiction 265 split, apply, combine

Recommend


More recommend