understanding string distances
play

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION - PowerPoint PPT Presentation

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R Real


  1. Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  2. What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R

  3. What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R

  4. Real world applications INTERMEDIATE REGULAR EXPRESSIONS IN R

  5. INTERMEDIATE REGULAR EXPRESSIONS IN R

  6. String distances in R library(stringdist) stringdist("saturday", "sunday", method = "lv") Returns: 3 Is identical: stringdist("sunday", "saturday", method = "lv") INTERMEDIATE REGULAR EXPRESSIONS IN R

  7. Finding a match amatch( x = "Sonday", table = c("Friday", "Saturday", "Sunday"), maxDist = 1, method = "lv" ) Returns: 3 INTERMEDIATE REGULAR EXPRESSIONS IN R

  8. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  9. Methods of string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  10. Damerau-Levenshtein INTERMEDIATE REGULAR EXPRESSIONS IN R

  11. Method abbreviations Regular Levenshtein distance: stringdist(a, b, method = "lv") Damerau-Levenshtein distance: stringdist(a, b, method = "dl") Optimal String Alignment distance: stringdist(a, b, method = "osa") INTERMEDIATE REGULAR EXPRESSIONS IN R

  12. Q-Grams (or n-grams) INTERMEDIATE REGULAR EXPRESSIONS IN R

  13. Q-Grams (or n-grams) INTERMEDIATE REGULAR EXPRESSIONS IN R

  14. Inspecting q-grams qgrams("Honolulu", "Hanolulu", q = 2) Returns: Ho on ul no ol lu la V1 1 1 1 1 1 2 0 V2 1 1 1 1 1 1 1 INTERMEDIATE REGULAR EXPRESSIONS IN R

  15. Method abbreviations Sum of qgrams that are not shared stringdist(a, b, method = "qgram") # equals 4 Not shared qgrams divided by total number of qgrams stringdist(a, b, method = "jaccard") # equals 0.5 Optimal String Alignment distance stringdist(a, b, method = "cosine") # equals 0.22 INTERMEDIATE REGULAR EXPRESSIONS IN R

  16. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  17. Fuzzy joins IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Instructor

  18. A regular join INTERMEDIATE REGULAR EXPRESSIONS IN R

  19. A fuzzy join INTERMEDIATE REGULAR EXPRESSIONS IN R

  20. The fuzzyjoin package library(fuzzyjoin) stringdist_join( user_input, database, by = c("user_input" = "name"), method = "lv", max_dist = 1, distance_col = "distance" ) INTERMEDIATE REGULAR EXPRESSIONS IN R

  21. stringdist_join: Result INTERMEDIATE REGULAR EXPRESSIONS IN R

  22. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  23. Custom Fuzzy Matching IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  24. Combining two fuzzy matches INTERMEDIATE REGULAR EXPRESSIONS IN R

  25. Combining two fuzzy matches INTERMEDIATE REGULAR EXPRESSIONS IN R

  26. Fuzzy matches: Helper functions For the string comparison: small_str_distance <- function(left, right) { stringdist(left, right) <= 5 } For the number comparison: close_to_each_other <- function(left, right) { abs(left - right) <= 3 } INTERMEDIATE REGULAR EXPRESSIONS IN R

  27. The fuzzy join fuzzy_left_join( a, b, by = c( "title" = "prod_title", "year" = "prod_year" ), match_fun = c( "title" = small_str_distance, "year" = close_to_each_other ) ) INTERMEDIATE REGULAR EXPRESSIONS IN R

  28. The fuzzy join: The result INTERMEDIATE REGULAR EXPRESSIONS IN R

  29. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  30. Congratulations IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  31. A look back 1. Regular Expressions: Writing custom patterns str_view() , str_match() , str_detect() ... 2. Creating strings with data glue() , glue_collapse() , ... 3. Extracting structured data from text str_extract_all() , extract() , ... 4. Similarities between strings strindist() , amatch() , stringdist_join() INTERMEDIATE REGULAR EXPRESSIONS IN R

  32. Next courses INTERMEDIATE REGULAR EXPRESSIONS IN R

  33. Thank you! IN TERMEDIATE REGULAR EX P RES S ION S IN R

Recommend


More recommend