class: center, middle, inverse, title-slide # Strings and regex in R ## Maria Novosolov ### 2020-11-16 --- # When does it come up? - data stored as notes - non-uniformly formatted data - filenames - almost everywhere --- # What are strings? Strings (character types) = pretty much anything surrounded by quotes **Double quotes** are the preferred style unless your text contains double quotes .center[ ![](https://media.giphy.com/media/xTiN0z1dTpeedhds8o/giphy.gif) ] --- .center[.img-med[ ![](https://media.giphy.com/media/HUjc0sbfgaInK/giphy.gif) ]] ### `"Keeps away the nargles."` -- ### `'Luna whispered, "Keeps away the nargles."'` -- ### `"Luna's eyes widened, and she whispered, \"Keeps away the nargles.\""` --- # What's with the backslash? .center[.img-big[ ![](https://media.giphy.com/media/gCwXh8EHKKfMk/giphy.gif) ]] --- # Escaped characters Special characters that may have an alternate meaning In our previous case: is `"` literally a double-quote or is it marking a character string? -- Some other examples: `\n` = new line `\t` = tab `\\` = backslash --- # Check your escapes with `writeLines()` ```r writeLines("Nitwit! \n\tBlubber! \n\t\tOddment! \n\U1F9D9\t\t\tTweak!") ``` ``` ## Nitwit! ## Blubber! ## Oddment! ## 🧙 Tweak! ``` _You can also write emojis this way!_ --- # UTF-8 Character-encoding system of choice but not fully supported by Windows (yet); this is how you can write in Hebrew, English, Chinese, and emojis in one sentence It is also possible to use letter code to write non-Latin alphabet languages. For example: Hebrew with letter codes: ```r writeLines("\U05DB\U05EA\U05D1") ``` ``` ## כתב ``` ```r # write the letters-codes left-to-right but it prints right-to-left ``` --- class: exercise, center, middle Use `writeLines()` to write your name, degree, and favorite day of the week. Add an emoji in the end. Each one should be in a new line and tab every other line. --- class: inverse # stringr .center[ ![](https://stringr.tidyverse.org/logo.png) https://stringr.tidyverse.org _A consistent, simple and easy-to-use set of wrappers around the `stringi` package._ ] --- # General pattern .center[.xlarge[ str_.gray[verb](.gray[string], .gray[...]) ]] .large[ string = character string or vector ... = additional arguments include `pattern` to match or replacement string ] --- # Functions for the day .large[ - single-input functions: `str_verb(string)` - multi-input functions: `str_verb(string, pattern, ...)` - `*_all` variants - `str_glue()` for interpolating strings - `str_view()` for checking regex ] If you're already familiar with the base R equivalents, check out [this vignette](https://stringr.tidyverse.org/articles/from-base.html) for the "translations" --- # Let's get some data ```r spells <- read_csv("data/spells.csv") # from the rcorpora pkg (incantation <- spells$incantation[1:5]) ``` ``` ## [1] "Accio" "Aguamenti" "Alohomora" "Anapneo" "Aparecium" ``` .center[![](https://media.giphy.com/media/kSjGBipdCXdHW/giphy.gif)] --- .center[.xlarge[ str_.gray[verb](.gray[string]) ]] ```r # often a useful first step to avoid dealing with capitals str_to_lower(incantation) ``` ``` ## [1] "accio" "aguamenti" "alohomora" "anapneo" "aparecium" ``` ```r str_length(incantation) ``` ``` ## [1] 5 9 9 7 9 ``` --- .center[.xlarge[ str_.gray[verb](.gray[string], .gray[pattern]) ]] ```r str_detect(incantation, "o") # similar to grepl ``` ``` ## [1] TRUE FALSE TRUE TRUE FALSE ``` ```r str_subset(incantation, "o") ``` ``` ## [1] "Accio" "Alohomora" "Anapneo" ``` ```r str_count(incantation, "o") ``` ``` ## [1] 1 0 3 1 0 ``` --- .center[.xlarge[str_.gray[verb]_all(.gray[string], .gray[...]) ]] ```r str_extract_all(incantation, ".o") ``` ``` ## [[1]] ## [1] "io" ## ## [[2]] ## character(0) ## ## [[3]] ## [1] "lo" "ho" "mo" ## ## [[4]] ## [1] "eo" ## ## [[5]] ## character(0) ``` --- class: exercise # Your turn How many `object`s are there among the effect descriptions? Replace `object` with a different noun. ```r spells ``` ``` ## # A tibble: 91 x 3 ## incantation effect type ## <chr> <chr> <chr> ## 1 Accio Summons an object Charm ## 2 Aguamenti Shoots water from wand Charm ## 3 Alohomora Opens locked objects Charm ## 4 Anapneo Clears the target's airway Spell ## 5 Aparecium Reveals invisible ink Spell ## # … with 86 more rows ``` --- .center[.xlarge[ str_glue(.gray[...]) ]] ```r # works similarly to `paste()` or `sprintf()` str_glue('Hermione shouted, "{incantation}!"') ``` ``` ## Hermione shouted, "Accio!" ## Hermione shouted, "Aguamenti!" ## Hermione shouted, "Alohomora!" ## Hermione shouted, "Anapneo!" ## Hermione shouted, "Aparecium!" ``` --- class: inverse, center, middle .xlarge[regex] ## regular expressions --- class: exercise # How would you extract out amounts of money? >"We'll bet 37 Galleons, 15 Sickles, 3 Knuts" "George and I invented them - 7 Sickles each, a bargain!" "True, both of them had paid 2 Sickles for a S.P.E.W. badge" "And 1000 Galleons prize money!" -- Pseudocode: ```r money_types <- c("Galleons", "Sickles", "Knuts") sentences %>% extract(number_before(money_types), money_types) ``` --- class: exercise >"We'll bet 37 Galleons, 15 Sickles, 3 Knuts" "George and I invented them - 7 Sickles each, a bargain!" "True, both of them had paid 2 Sickles for a S.P.E.W. badge" "And 1000 Galleons prize money!" >"I pull down about 100 sacks of galleons a year!" -- Pseudocode: ```r money_types <- c("Galleons", "Sickles", "Knuts") %>% * to_lower_case() sentences %>% * to_lower_case() %>% * extract(closest_number_before(money_types)), money_types) ``` --- # regex helpers .center[.xlarge[`str_view_all()`]] and `str_view()` are your friends Other resources: - [regex101](https://regex101.com) - interpret regex - [regexplain](https://www.garrickadenbuie.com/project/regexplain/) - interpret and write regex (lots of cheatsheets) - [rex](https://github.com/kevinushey/rex) package - "friendly" regex - [rebus](https://cran.r-project.org/web/packages/rebus/rebus.pdf) package - "friendly" regex - [regexpal](https://www.regexpal.com/) - interpret regex --- # Key expressions _Note:_ There are often several ways of writing the same regex. For this presentation, I chose my favorite style, which I find the most flexible `^` = start `$` = end `.` = anything ```r str_subset(c("football", "baseball", "ballroom"), "ball$") ``` ``` ## [1] "football" "baseball" ``` --- .center[.xlarge[[.gray[one of]]]] brackets = _one of_ the characters specified within the brackets (in the example, an `o`, `n`, `e`, space, or `f`) `[a-z]` = any lower case letters `[0-9]` = any number from 0 to 9 (also `\\d`) variations on the theme: `[09-]` = 0, 9, or - --- class: exercise # Guess the regex .center[.xlarge[any vowel]] --- class: exercise # Guess the regex .center[.xlarge[starts with a capital letter]] --- class: exercise # Guess the regex .center[.xlarge[a vowel and the two characters next to it]] --- # Numbers of things `*` = 0 or more `+` = 1 or more `{n}` = exactly `n` number of times `{n,}` = at least `n` number of times `{n,m}` = between `n` and `m` times --- class: exercise # Guess the regex .center[.xlarge[at least one number]] --- class: exercise # Guess the regex .center[.xlarge[ends with `jpg` or `jpeg`]] --- class: exercise # Guess the regex .center[.xlarge[at least 2 #s in a row]] --- class: inverse, center, middle # How do we say NOT? --- class: center, middle .img-big[ ![](https://media.giphy.com/media/gbErpwcLlizvi/giphy.gif) ] --- _Inside_ of brackets, a caret ("hat"/`^`) means **NOT** _Outside_ of brackets and at the beginning of a string, it means **begins with** .center[.img-big[![](https://media.giphy.com/media/FvJ4fmyI6gHLi/giphy.gif)]] --- class: exercise # Which is which? .center[.xlarge[`^[a-z]` versus `[^a-z]`]] --- class: inverse, center, middle # Working in a dataframe --- ```r spells ``` ``` ## # A tibble: 91 x 3 ## incantation effect type ## <chr> <chr> <chr> ## 1 Accio Summons an object Charm ## 2 Aguamenti Shoots water from wand Charm ## 3 Alohomora Opens locked objects Charm ## 4 Anapneo Clears the target's airway Spell ## 5 Aparecium Reveals invisible ink Spell ## # … with 86 more rows ``` --- ```r spells %>% * mutate(effect = str_to_lower(effect)) ``` ``` ## # A tibble: 91 x 3 ## incantation effect type ## <chr> <chr> <chr> ## 1 Accio summons an object Charm ## 2 Aguamenti shoots water from wand Charm ## 3 Alohomora opens locked objects Charm ## 4 Anapneo clears the target's airway Spell ## 5 Aparecium reveals invisible ink Spell ## # … with 86 more rows ``` --- ```r spells %>% mutate(effect = str_to_lower(effect)) %>% * mutate(effect = str_split(effect, " ")) ``` ``` ## # A tibble: 91 x 3 ## incantation effect type ## <chr> <list> <chr> ## 1 Accio <chr [3]> Charm ## 2 Aguamenti <chr [4]> Charm ## 3 Alohomora <chr [3]> Charm ## 4 Anapneo <chr [4]> Spell ## 5 Aparecium <chr [3]> Spell ## # … with 86 more rows ``` ### Creates a _list-column_; supports vector results of different lengths --- ```r spells %>% mutate(effect = str_to_lower(effect)) %>% mutate(effect = str_split(effect, " ")) %>% * unnest(effect) ``` ``` ## # A tibble: 351 x 3 ## incantation effect type ## <chr> <chr> <chr> ## 1 Accio summons Charm ## 2 Accio an Charm ## 3 Accio object Charm ## 4 Aguamenti shoots Charm ## 5 Aguamenti water Charm ## # … with 346 more rows ``` ### _Note:_ this is very similar to the `unnest_tokens()` function in the [`tidytext`](https://www.tidytextmining.com/) package --- ```r spells %>% mutate(effect = str_to_lower(effect)) %>% mutate(effect = str_split(effect, " ")) %>% unnest(effect) %>% * count(effect, sort = TRUE) ``` ``` ## # A tibble: 231 x 2 ## effect n ## <chr> <int> ## 1 a 11 ## 2 an 11 ## 3 to 9 ## 4 object 7 ## 5 objects 7 ## # … with 226 more rows ``` --- class: inverse, center, middle # A few more miscellaneous tricks .x-large[![](https://media.giphy.com/media/54Yiwc8TSvoJO/giphy.gif)] --- .center[.xlarge[(.gray[capture groups])]] Extract groups within a pattern ```r phone_numbers <- c("058 222 1234", "054-121 1221") str_match(phone_numbers, "([0-9]+)[ -]([0-9]+)[ -]([0-9]+)") ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "058 222 1234" "058" "222" "1234" ## [2,] "054-121 1221" "054" "121" "1221" ``` ```r # same as "(\\d+)[ -](\\d+)[ -](\\d+)" ``` --- .pull-left[ ## look behinds `(?<=)` ![](https://static.boredpanda.com/blog/wp-content/uploads/2016/10/newborn-baby-harry-potter-photo-shoot-kayla-glover-4.jpg) ] .pull-right[ ## look aheads `(?=)` ![](https://imagesvc.meredithcorp.io/v3/mm/image?url=https%3A%2F%2Fewedit.files.wordpress.com%2F2015%2F01%2Fharry-potter_510.jpg&w=400&c=sc&poi=face&q=85) ] --- class: inverse, center, middle Some more practice --- str_extract("It does not do to dwell on dreams and forget to live.", ) --- str_extract("It does not do to dwell on dreams and forget to live.", ) --- # Bonus question How do you extract the currency conversions from this text? ```r text <- c("There are 29 Knuts in 1 silver Sickle", "and there are 493 Knuts in 1 golden Galleon.") ``` -- Shortcut if you only care about the first number... ```r parse_number(text) ``` ``` ## [1] 29 493 ``` --- # Summary - strings/characters = things that are quoted in R - backslashes are used for "escape characters" (`\n`, `\\`) - use `writeLines()` to show how strings will display - `stringr` provides a simple, consistent interface to work with strings in R - regular expressions describe the logic of a pattern in text - check regex in R with `str_view()`/`str_view_all()` - use `unnest()` to expand list-columns Cheatsheets: [Basic regular expressions in R](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf), [Working with strings with stringr](https://resources.rstudio.com/rstudio-cheatsheets/stringr-cheat-sheet) Santa Barbara Eco-Data-Science [text workshop materials](https://github.com/eco-data-science/text_workshop) (covers pdftools, sentiment analysis, etc.) --- # You are now ready to face the world of strings and regex .center[.xlarge[ ![](https://media.giphy.com/media/ffynNaSYx2yTC/giphy.gif)]]