Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition.
one tool does not do it all ▶ NLP and openNLP packages let you interface with the Apache OpenNLP Java software (and theoretically Stanford CoreNLP too). These packages are not good. ▶ much useful software lacks R glue ▶ but some of it can be used from the command line…and thus via R
program <- function (input, arg1, arg2, ...) { ... # make calculations ... # possible side effects output } program arg1 arg2 ... This is a Unix system! I know this! ▶ think of a program as a function: ▶ in the shell , this looks like
echo "Hello Raptor" example
where did input and output go? standard input source, might represent user typing standard output target, might represent the console
wc < sheik.txt > sheik_wc.txt echo "Hello Raptor" | wc txl %>% filter(Genre == "poetry") %>% arrange(Price) redirection ▶ can replace stdin and stdout with files (no more typing) ▶ can connect stdout of one program to stdin of the next one ▶ shell analogue of dplyr pipelines:
opennlp SomeTool arg ... openNLP command line ▶ takes text input on stdin , sends results to stdout ▶ arg is the name of an auxiliary data file SomeTool needs SentenceDetector split text into sentences TokenizerME split sentences (N.B.) into words TokenNameFinder Named Entity Recognition (persons, places…) POSTagger part-of-speech labeling
opennlp SentenceDetector en-sent.bin < sheik.txt \ > sheik_sent.txt opennlp TokenizerME en-tok.bin < sheik_sent.txt \ > sheik_words.txt opennlp POSTagger en-pos.bin < sheik_words.txt \ > sheik_pos.txt head sheik_pos.txt "_`` Are_VBP you_PRP coming_VBG in_IN to_TO watch_VB the_DT dancing_NN ,_, Lady_NNP Conway_NNP ?_. "_'' "_`` I_PRP most_RBS decidedly_RB am_VBP not_RB ._. openNLP pipeline
opennlp TokenizerME en-tok.bin | \ opennlp SentenceDetector en-sent.bin < sheik.txt | \ opennlp POSTagger en-pos.bin sheik_pos.txt simplify the pipeline
system(s, intern=T) # return stdout as a character vector system(s) # run command s in current working directory the command line from R
install.packages("openNLP") install.packages("openNLPmodels.en", repos="http://datacube.wu.ac.at") brew install apache-opennlp # Mac, requires homebrew # I'm working on a setup for the VM getting the stuff
} str_c("opennlp SentenceDetector ", entity_model_file) " | opennlp TokenNameFinder ", token_model_file, " | opennlp TokenizerME ", " < ", input_file, sent_model_file, package = "openNLPmodels.en") ner_command <- function (input_file, entity_type) { str_c("en-ner-", entity_type, ".bin"), entity_model_file <- system.file("models", "en-token.bin", package = "openNLPdata") token_model_file <- system.file("models", "en-sent.bin", package = "openNLPdata") composing a command sent_model_file <- system.file("models",
# Stowe, Uncle Tom's Cabin, vol. 1 input_file <- "wright_body/VAC7958.txt" stowe <- system(ner_command(input_file, "location"), intern=T) head(stowe) [1] "LATE in the afternoon of a chilly day in February , two gentlemen were sitting" [2] "alone over their wine , in a well-furnished dining parlor , in the town of <START:location> P—— <END> , in" [3] "<START:location> Kentucky <END> ." [4] "There were no servants present , and the gentlemen , with chairs closely" [5] "approaching , seemed to be discussing some subject with great earnestness ." [6] "For convenience sake , we have said , hitherto , two gentlemen ." doing the deed
extract_locs <- function (tagged) { unlist(str_extract_all(tagged, perl("<START:location> .*? <END>"))) %>% str_replace("^<START:location> ", "") %>% str_replace(" <END>$", "") } re-parsing the results
locs_frame <- function (fs) { data_frame(filename=fs) %>% mutate(cmd=ner_command(filename, "location")) %>% group_by(filename) %>% do({ message("NER on ", .$filename) locs <- extract_locs(system(.$cmd, intern=T)) if (length(locs) == 0) locs <- NA data_frame(loc=locs) }) %>% group_by(filename, loc) %>% summarize(count=n()) } corpus
if (!file.exists("wright_locs.tsv")) { locs_frame(fs) %>% # takes a while write.table("wright_locs.tsv", sep="\t", col.names=T, row.names=F, quote=F) } fs <- Sys.glob("wright_body/*.txt")
locs <- read.table("wright_locs.tsv", sep="\t", comment.char="", header=T, as.is=T, quote="") locs <- locs %>% group_by(loc) %>% filter(sum(count) > 5, n() > 2) # Wilkens's filter
meta <- read.table("wright_meta.tsv", sep="\t", as.is=T, header=T, quote="") %>% mutate(filename=file.path("wright_body", file)) %>% select(filename, pubplace) %>% mutate(pubplace=str_trim(pubplace)) # (but the place names need more cleaning than this) locs <- inner_join(locs, meta, by="filename") merge with metadata
top_locs <- locs %>% group_by(pubplace, loc) %>% # but just what have we been counting here? summarize(count=sum(count)) %>% filter(pubplace %in% c("Boston", "New York")) %>% top_n(5) %>% arrange(desc(count)) %>% rename(published=pubplace, mentioned=loc) preliminary peek
print_tabular(top_locs) hmmm published mentioned count Boston New York 3493 Boston Boston 3273 Boston England 2716 Boston Florence 2309 Boston Paris 1781 New York New York 13866 New York England 7800 New York Paris 5463 New York London 5131 New York Washington 4852
what could come next: geospatial data and visualization ▶ Lincoln Mullen’s extensive draft chapters lincolnmullen.com/projects/dh-r/geospatial-data.html lincolnmullen.com/projects/dh-r/mapping.html ▶ Bivand, Roger, et al. Applied Spatial Data Analysis with R . 2nd ed. New York: Springer, 2013. DOI: 10.1007/978-1-4614-7618-4. ▶ ggmap package (geocoding) ▶ sp and many more R packages (spatial analysis) ▶ GIS libraries and command line tools (not R): GDAL, GEOS, …
install.packages("tmap") # new on CRAN; needs GDAL, GEOS library("tmap") vignette("tmap-nutshell") # some documentation I saw something shiny
next
Recommend
More recommend