Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition.

one tool does not do it all ▶ NLP and openNLP packages let you interface with the Apache OpenNLP Java software (and theoretically Stanford CoreNLP too). These packages are not good. ▶ much useful software lacks R glue ▶ but some of it can be used from the command line…and thus via R

program <- function (input, arg1, arg2, ...) { ... # make calculations ... # possible side effects output } program arg1 arg2 ... This is a Unix system! I know this! ▶ think of a program as a function: ▶ in the shell , this looks like

echo "Hello Raptor" example

where did input and output go? standard input source, might represent user typing standard output target, might represent the console

wc < sheik.txt > sheik_wc.txt echo "Hello Raptor" | wc txl %>% filter(Genre == "poetry") %>% arrange(Price) redirection ▶ can replace stdin and stdout with files (no more typing) ▶ can connect stdout of one program to stdin of the next one ▶ shell analogue of dplyr pipelines:

opennlp SomeTool arg ... openNLP command line ▶ takes text input on stdin , sends results to stdout ▶ arg is the name of an auxiliary data file SomeTool needs SentenceDetector split text into sentences TokenizerME split sentences (N.B.) into words TokenNameFinder Named Entity Recognition (persons, places…) POSTagger part-of-speech labeling

opennlp SentenceDetector en-sent.bin < sheik.txt \ > sheik_sent.txt opennlp TokenizerME en-tok.bin < sheik_sent.txt \ > sheik_words.txt opennlp POSTagger en-pos.bin < sheik_words.txt \ > sheik_pos.txt head sheik_pos.txt "_`` Are_VBP you_PRP coming_VBG in_IN to_TO watch_VB the_DT dancing_NN ,_, Lady_NNP Conway_NNP ?_. "_'' "_`` I_PRP most_RBS decidedly_RB am_VBP not_RB ._. openNLP pipeline

opennlp TokenizerME en-tok.bin | \ opennlp SentenceDetector en-sent.bin < sheik.txt | \ opennlp POSTagger en-pos.bin sheik_pos.txt simplify the pipeline

system(s, intern=T) # return stdout as a character vector system(s) # run command s in current working directory the command line from R

install.packages("openNLP") install.packages("openNLPmodels.en", repos="http://datacube.wu.ac.at") brew install apache-opennlp # Mac, requires homebrew # I'm working on a setup for the VM getting the stuff

} str_c("opennlp SentenceDetector ", entity_model_file) " | opennlp TokenNameFinder ", token_model_file, " | opennlp TokenizerME ", " < ", input_file, sent_model_file, package = "openNLPmodels.en") ner_command <- function (input_file, entity_type) { str_c("en-ner-", entity_type, ".bin"), entity_model_file <- system.file("models", "en-token.bin", package = "openNLPdata") token_model_file <- system.file("models", "en-sent.bin", package = "openNLPdata") composing a command sent_model_file <- system.file("models",

# Stowe, Uncle Tom's Cabin, vol. 1 input_file <- "wright_body/VAC7958.txt" stowe <- system(ner_command(input_file, "location"), intern=T) head(stowe) [1] "LATE in the afternoon of a chilly day in February , two gentlemen were sitting" [2] "alone over their wine , in a well-furnished dining parlor , in the town of <START:location> P—— <END> , in" [3] "<START:location> Kentucky <END> ." [4] "There were no servants present , and the gentlemen , with chairs closely" [5] "approaching , seemed to be discussing some subject with great earnestness ." [6] "For convenience sake , we have said , hitherto , two gentlemen ." doing the deed

extract_locs <- function (tagged) { unlist(str_extract_all(tagged, perl("<START:location> .*? <END>"))) %>% str_replace("^<START:location> ", "") %>% str_replace(" <END>$", "") } re-parsing the results

locs_frame <- function (fs) { data_frame(filename=fs) %>% mutate(cmd=ner_command(filename, "location")) %>% group_by(filename) %>% do({ message("NER on ", .$filename) locs <- extract_locs(system(.$cmd, intern=T)) if (length(locs) == 0) locs <- NA data_frame(loc=locs) }) %>% group_by(filename, loc) %>% summarize(count=n()) } corpus

if (!file.exists("wright_locs.tsv")) { locs_frame(fs) %>% # takes a while write.table("wright_locs.tsv", sep="\t", col.names=T, row.names=F, quote=F) } fs <- Sys.glob("wright_body/*.txt")

locs <- read.table("wright_locs.tsv", sep="\t", comment.char="", header=T, as.is=T, quote="") locs <- locs %>% group_by(loc) %>% filter(sum(count) > 5, n() > 2) # Wilkens's filter

meta <- read.table("wright_meta.tsv", sep="\t", as.is=T, header=T, quote="") %>% mutate(filename=file.path("wright_body", file)) %>% select(filename, pubplace) %>% mutate(pubplace=str_trim(pubplace)) # (but the place names need more cleaning than this) locs <- inner_join(locs, meta, by="filename") merge with metadata

top_locs <- locs %>% group_by(pubplace, loc) %>% # but just what have we been counting here? summarize(count=sum(count)) %>% filter(pubplace %in% c("Boston", "New York")) %>% top_n(5) %>% arrange(desc(count)) %>% rename(published=pubplace, mentioned=loc) preliminary peek

print_tabular(top_locs) hmmm published mentioned count Boston New York 3493 Boston Boston 3273 Boston England 2716 Boston Florence 2309 Boston Paris 1781 New York New York 13866 New York England 7800 New York Paris 5463 New York London 5131 New York Washington 4852

what could come next: geospatial data and visualization ▶ Lincoln Mullen’s extensive draft chapters lincolnmullen.com/projects/dh-r/geospatial-data.html lincolnmullen.com/projects/dh-r/mapping.html ▶ Bivand, Roger, et al. Applied Spatial Data Analysis with R . 2nd ed. New York: Springer, 2013. DOI: 10.1007/978-1-4614-7618-4. ▶ ggmap package (geocoding) ▶ sp and many more R packages (spatial analysis) ▶ GIS libraries and command line tools (not R): GDAL, GEOS, …

install.packages("tmap") # new on CRAN; needs GDAL, GEOS library("tmap") vignette("tmap-nutshell") # some documentation I saw something shiny

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition. one tool does not do it all NLP and openNLP packages let you interface with the Apache OpenNLP Java software

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Update on the Literary Fund Presentation to: House Appropriations Elementary and Secondary

Overview of the Literary Fund and Overview of the Literary Fund and VPSA Educational Technology

JC2 LITERARY EPILOGUE A NEW SYLLABUS, A NEW HOPE JC2 LITERARY EPILOGUE Please be seated in 6

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

The Ferrante Effect and the Italian Literary Establishment Maria Mattea Legge Elena

Seamus Heaney and Literary Tourism November 2015 BTS team Stewart Walker Ivan Broussine

Childrens Book Contest The Power to Make a Difference Through Literacy 1 National Literary

First Literary Dates @engagenow_eu Jess Sanz Institut del Teatre Organised by: ENGAGE WITH

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema matching and

George Tzagkarakis FORTH-ICS, SPL SPL at a glance 2006 3 Researchers/Academics (permanent) 2

dnsprobe for probing anycast DNS -- Analysis & Future Work -- Yuji Sekiya WIDE Project

On algebraic variants of the LWE problem Damien Stehl e Based on joint works with M. Rosca, A.

Gamma Phi Beta Clemson University Organization and Chapter History Founded November 11,

GBIO0009 Topics in Bioinformatics Montefiore Institute - Systems and Modeling GIGA -

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Flexible Network Services as Frameworks? Zhi-Li Zhang Qwest Chair Professor & Distinguished

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition. one tool does not do it all NLP and openNLP packages let you interface with the Apache OpenNLP Java software

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Update on the Literary Fund Presentation to: House Appropriations Elementary and Secondary

Overview of the Literary Fund and Overview of the Literary Fund and VPSA Educational Technology

JC2 LITERARY EPILOGUE A NEW SYLLABUS, A NEW HOPE JC2 LITERARY EPILOGUE Please be seated in 6

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

The Ferrante Effect and the Italian Literary Establishment Maria Mattea Legge Elena

Seamus Heaney and Literary Tourism November 2015 BTS team Stewart Walker Ivan Broussine

Childrens Book Contest The Power to Make a Difference Through Literacy 1 National Literary

First Literary Dates @engagenow_eu Jess Sanz Institut del Teatre Organised by: ENGAGE WITH

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema matching and

George Tzagkarakis FORTH-ICS, SPL SPL at a glance 2006 3 Researchers/Academics (permanent) 2

dnsprobe for probing anycast DNS -- Analysis &amp; Future Work -- Yuji Sekiya WIDE Project

On algebraic variants of the LWE problem Damien Stehl e Based on joint works with M. Rosca, A.

Gamma Phi Beta Clemson University Organization and Chapter History Founded November 11,

GBIO0009 Topics in Bioinformatics Montefiore Institute - Systems and Modeling GIGA -

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Flexible Network Services as Frameworks? Zhi-Li Zhang Qwest Chair Professor &amp; Distinguished

dnsprobe for probing anycast DNS -- Analysis & Future Work -- Yuji Sekiya WIDE Project

Flexible Network Services as Frameworks? Zhi-Li Zhang Qwest Chair Professor & Distinguished