more on dplyr previously gg miss fct x riskfactors fct
play

More on dplyr ~/> previously gg_miss_fct(x = riskfactors, fct = - PowerPoint PPT Presentation

~/>_ More on dplyr ~/> previously gg_miss_fct(x = riskfactors, fct = marital) quick_na <- function (x, vals = c(9, 10, 11, 97, 99)) { x[x % in % vals] <- NA x } num_vec <- c(1:12, 97, 97, 99, NA) num_vec #> [1] 1 2 3


  1. ~/>_ More on dplyr

  2. ~/> previously …

  3. gg_miss_fct(x = riskfactors, fct = marital)

  4. quick_na <- function (x, vals = c(9, 10, 11, 97, 99)) { x[x % in % vals] <- NA x } num_vec <- c(1:12, 97, 97, 99, NA) num_vec #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 97 97 99 NA quick_na(num_vec) #> [1] 1 2 3 4 5 6 7 8 NA NA NA 12 NA NA NA NA

  5. ~/>_ Working with dplyr

  6. ~/>_ Standard verbs

  7. group_by() Group the data at the level we want, such as “Religion by Region” or “Authors by Publications by Year”. filter() rows Filter or Select pieces of the data. This select() columns gets us the subset of the table we want to work on. Mutate the data by creating new variables at the mutate() current level of grouping. Mutating adds new columns to the table. Summarize or aggregate the grouped data. This creates new variables at a higher level of grouping. For example we might calculate summarize() means with mean() or counts with n(). This results in a smaller, summary table, which we might do more things with if we want.

  8. ~/>_ Scoped verbs

  9. Scoped Verbs action _all() Take action on all variables action _if() Take action on a subset of variables selected by a criterion action _at() Take action on a subset of variables selected by their names action can be mutate summarize filter

  10. Useful scope-setters is.character() is.factor() is.numeric() is.logical() is.integer() is.ordered() lubridate::is.Date()

  11. Useful scoping helpers starts_with() ends_with() contains() one_of() matches() vars() everything()

  12. Examples organdata %>% group_by(world) %>% summarize_if(is.numeric, mean, na.rm = TRUE) %>% select(world, donors, pubhealth, roads) %>% select_all(tools::toTitleCase) # A tibble: 4 x 4 World Donors Pubhealth Roads <chr> <dbl> <dbl> <dbl> 1 NA 28.1 5.45 161. 2 Corporatist 16.8 6.40 132. 3 Liberal 15.6 5.75 111. 4 SocDem 14.8 6.54 82.7

  13. Examples organdata %>% group_by(country) %>% summarize_if(is.numeric, funs(avg = mean, sd = sd), na.rm = TRUE) %>% select(country, donors_avg, donors_sd, roads_avg, roads_sd) %>% arrange(desc(donors_avg))

  14. Examples A tibble: 17 x 5 country donors_avg donors_sd roads_avg roads_sd <chr> <dbl> <dbl> <dbl> <dbl> 1 Spain 28.1 4.96 161. 35.3 2 Austria 23.5 2.42 150. 30.3 3 Belgium 21.9 1.94 155. 20.6 4 United States 20.0 1.33 155. 8.35 5 Ireland 19.8 2.48 118. 10.8 6 Finland 18.4 1.53 93.6 19.0 7 France 16.8 1.60 156. 20.1 8 Norway 15.4 1.11 70.0 6.68 9 Switzerland 14.2 1.71 96.4 21.7 10 Canada 14.0 0.751 109. 17.7 11 Netherlands 13.7 1.55 76.1 9.93 12 United Kingdom 13.5 0.775 67.9 10.5 13 Sweden 13.1 1.75 72.3 13.2 14 Denmark 13.1 1.47 102. 12.4 15 Germany 13.0 0.611 113. 25.9 16 Italy 11.1 4.28 122. 10.2 17 Australia 10.6 1.14 105. 14.3

  15. ~/>_ Scoping and Mapping

  16. map() and friends are the general case out <- lm(donors ~ pop + gdp + roads, data = organdata) Call: summary(out) lm(formula = donors ~ pop + gdp + roads, data = organdata) Residuals: Min 1Q Median 3Q Max -13.423 -2.658 -0.080 1.963 15.864 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.506e+00 2.364e+00 1.906 0.0580 . pop -1.153e-05 5.643e-06 -2.043 0.0423 * gdp 1.082e-04 7.527e-05 1.438 0.1521 roads 8.988e-02 1.032e-02 8.710 1.14e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.325 on 200 degrees of freedom (34 observations deleted due to missingness) Multiple R-squared: 0.2944, Adjusted R-squared: 0.2838 F-statistic: 27.81 on 3 and 200 DF, p-value: 4.486e-15

  17. map() and friends are the general case > names(summary(out)) [1] "call" "terms" "residuals" "coefficients" "aliased" [6] "sigma" "df" "r.squared" "adj.r.squared" "fstatistic" [11] "cov.unscaled" "na.action"

  18. map() and friends are the general case organdata %>% split(.$world) %>% map(~ lm(donors ~ pop + gdp + roads, data = .)) %>% map(summary) %>% map_dbl("r.squared") We’ll see cleaner ways to do this shortly

  19. ~/>_ Zero Counts in dplyr

  20. data %>% select(start_year, job_type1) %>% group_by(start_year, job_type1) %>% summarize(n = n()) %>% mutate(pct = (n/sum(n))*100) # A tibble: 689 x 4 # Groups: start_year [38] start_year job_type1 n pct <date> <chr> <int> <dbl> 1 1945-01-03 NA 5 0.880 2 1945-01-03 Acting/entertainer 11 1.94 3 1945-01-03 Aeronautics 2 0.352 4 1945-01-03 Agriculture 65 11.4 5 1945-01-03 Business or banking 108 19.0 6 1945-01-03 Clergy 3 0.528 7 1945-01-03 Congressional Aide 11 1.94 8 1945-01-03 Construction/building trades 9 1.58 9 1945-01-03 Education 58 10.2 10 1945-01-03 Engineering 2 0.352 # … with 679 more rows

  21. df <- data %>% filter(position == "U.S. Representative", start > "1945-01-01") %>% group_by(pid) %>% nest() %>% mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress)))) %>% unnest() %>% filter(term_id == 1, party %in% c("Democrat", "Republican"), start_year > int_to_year( 2012 )) %>% group_by(start_year, party, sex) %>% select(pid, start_year, party, sex) This caused the di ff erence in N you saw in class. Fixed here.

  22. > df # A tibble: 293 x 4 # Groups: start_year, party, sex [14] pid start_year party sex <int> <date> <chr> <chr> 1 3160 2013-01-03 Republican M 2 3161 2013-01-03 Democrat F 3 3162 2013-01-03 Democrat M 4 3163 2013-01-03 Republican M 5 3164 2013-01-03 Democrat M 6 3165 2013-01-03 Republican M 7 3166 2013-01-03 Republican M 8 3167 2013-01-03 Democrat F 9 3168 2013-01-03 Republican M 10 3169 2013-01-03 Democrat M # … with 283 more rows

  23. df %>% group_by ( start_year , party , sex ) %>% summarize ( N = n ()) %>% mutate ( freq = N / sum( N )) #> # A tibble: 14 x 5 #> # Groups: start_year, party [8] #> start_year party sex N freq #> <date> <chr> <chr> <int> <dbl> #> 1 2013-01-03 Democrat F 21 0.362 #> 2 2013-01-03 Democrat M 37 0.638 #> 3 2013-01-03 Republican F 8 0.101 #> 4 2013-01-03 Republican M 71 0.899 #> 5 2015-01-03 Democrat M 1 1 #> 6 2015-01-03 Republican M 5 1 #> 7 2017-01-03 Democrat F 6 0.24 #> 8 2017-01-03 Democrat M 19 0.76 #> 9 2017-01-03 Republican F 2 0.0667 #> 10 2017-01-03 Republican M 28 0.933 #> 11 2019-01-03 Democrat F 33 0.647 #> 12 2019-01-03 Democrat M 18 0.353 #> 13 2019-01-03 Republican F 1 0.0323 #> 14 2019-01-03 Republican M 30 0.968

  24. ## Hex colors for sex sex_colors <- c( "#E69F00" , "#993300" ) ## Hex color codes for Dem Blue and Rep Red party_colors <- c( "#2E74C0" , "#CB454A" ) ## Group labels mf_labs <- tibble ( M = "Men" , F = "Women" ) theme_set ( theme_minimal ())

  25. df %>% group_by ( start_year , party , sex ) %>% summarize ( N = n ()) %>% mutate ( freq = N / sum( N )) %>% ggplot ( aes ( x = start_year , y = freq , fill = sex )) + geom_col () + scale_y_continuous ( labels = scales :: percent ) + scale_fill_manual ( values = sex_colors , labels = c( "Women" , "Men" )) + labs ( x = "Year" , y = "Percent" , fill = "Group" ) + facet_wrap (~ party )

  26. df %>% group_by ( start_year , party , sex ) %>% summarize ( N = n ()) %>% mutate ( freq = N / sum( N )) %>% ggplot ( aes ( x = start_year , y = freq , color = sex )) + geom_line ( size = 1.1) + scale_y_continuous ( labels = scales :: percent ) + scale_color_manual ( values = sex_colors , labels = c( "Women" , "Men" )) + guides ( color = guide_legend ( reverse = TRUE)) + labs ( x = "Year" , y = "Percent" , color = "Group" ) + facet_wrap (~ party )

  27. Should go to zero!

Recommend


More recommend