Strings and regex in R

# Strings and regex in R
## Maria Novosolov
### 2020-11-16

---

# When does it come up?

- data stored as notes
- non-uniformly formatted data
- filenames
- almost everywhere

---

# What are strings?

Strings (character types) = pretty much anything surrounded by quotes

**Double quotes** are the preferred style unless your text contains double quotes

---

### `"Keeps away the nargles."`

--
### `'Luna whispered, "Keeps away the nargles."'`

--
### `"Luna's eyes widened, and she whispered, \"Keeps away the nargles.\""`

---

# What's with the backslash?

---

# Escaped characters

Special characters that may have an alternate meaning

In our previous case: is `"` literally a double-quote or is it marking a character string?

Some other examples:

`\n` = new line  
`\t` = tab  
`\\` = backslash

---

# Check your escapes with `writeLines()`

```r
writeLines("Nitwit! \n\tBlubber! \n\t\tOddment! \n\U1F9D9\t\t\tTweak!")
```

```
## Nitwit! 
## 	Blubber! 
## 		Oddment! 
## 🧙			Tweak!
```

_You can also write emojis this way!_

---

# UTF-8

Character-encoding system of choice but not fully supported by Windows (yet); this is how you can write in Hebrew, English, Chinese, and emojis in one sentence

It is also possible to use letter code to write non-Latin alphabet languages. For example:

Hebrew with letter codes:

```r
writeLines("\U05DB\U05EA\U05D1") 
```

```
## כתב
```

```r
# write the letters-codes left-to-right but it prints right-to-left
```

---
class: exercise, center, middle

Use `writeLines()` to write your name, degree, and favorite day of the week. Add an emoji in the end. Each one should be in a new line and tab every other line.

---

# stringr

https://stringr.tidyverse.org

_A consistent, simple and easy-to-use set of wrappers around the `stringi` package._
]

---

# General pattern

... = additional arguments include `pattern` to match or replacement string
]
---

# Functions for the day

.large[
- single-input functions: `str_verb(string)`
- multi-input functions: `str_verb(string, pattern, ...)`
- `*_all` variants
- `str_glue()` for interpolating strings
- `str_view()` for checking regex
]

If you're already familiar with the base R equivalents, check out [this vignette](https://stringr.tidyverse.org/articles/from-base.html) for the "translations"

---

# Let's get some data

```r
spells <- read_csv("data/spells.csv") # from the rcorpora pkg

(incantation <- spells$incantation[1:5])
```

```
## [1] "Accio"     "Aguamenti" "Alohomora" "Anapneo"   "Aparecium"
```

---

```r
# often a useful first step to avoid dealing with capitals
str_to_lower(incantation) 
```

```
## [1] "accio"     "aguamenti" "alohomora" "anapneo"   "aparecium"
```

```r
str_length(incantation)
```

```
## [1] 5 9 9 7 9
```

---

```r
str_detect(incantation, "o") # similar to grepl
```

```
## [1]  TRUE FALSE  TRUE  TRUE FALSE
```

```r
str_subset(incantation, "o")
```

```
## [1] "Accio"     "Alohomora" "Anapneo"
```

```r
str_count(incantation, "o")
```

```
## [1] 1 0 3 1 0
```

---

```r
str_extract_all(incantation, ".o")
```

```
## [[1]]
## [1] "io"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "lo" "ho" "mo"
## 
## [[4]]
## [1] "eo"
## 
## [[5]]
## character(0)
```

---

# Your turn

How many `object`s are there among the effect descriptions?   
Replace `object` with a different noun.

```r
spells
```

```
## # A tibble: 91 x 3
##   incantation effect                     type 
##   <chr>       <chr>                      <chr>
## 1 Accio       Summons an object          Charm
## 2 Aguamenti   Shoots water from wand     Charm
## 3 Alohomora   Opens locked objects       Charm
## 4 Anapneo     Clears the target's airway Spell
## 5 Aparecium   Reveals invisible ink      Spell
## # … with 86 more rows
```

---

```r
# works similarly to `paste()` or `sprintf()`
str_glue('Hermione shouted, "{incantation}!"')
```

```
## Hermione shouted, "Accio!"
## Hermione shouted, "Aguamenti!"
## Hermione shouted, "Alohomora!"
## Hermione shouted, "Anapneo!"
## Hermione shouted, "Aparecium!"
```

---

## regular expressions

---

# How would you extract out amounts of money?

>"We'll bet 37 Galleons, 15 Sickles, 3 Knuts"  
"George and I invented them - 7 Sickles each, a bargain!"  
"True, both of them had paid 2 Sickles for a S.P.E.W. badge"  
"And 1000 Galleons prize money!"

--
Pseudocode:

```r
money_types <- c("Galleons", "Sickles", "Knuts")

sentences %>% 
  extract(number_before(money_types), money_types) 
```

---

>"I pull down about 100 sacks of galleons a year!"

Pseudocode:

```r
money_types <- c("Galleons", "Sickles", "Knuts") %>% 
* to_lower_case()

sentences %>% 
* to_lower_case() %>%
* extract(closest_number_before(money_types)), money_types)
```

---

# regex helpers

and `str_view()` are your friends

Other resources:

- [regex101](https://regex101.com) - interpret regex
- [regexplain](https://www.garrickadenbuie.com/project/regexplain/) - interpret and write regex (lots of cheatsheets)
- [rex](https://github.com/kevinushey/rex) package - "friendly" regex
- [rebus](https://cran.r-project.org/web/packages/rebus/rebus.pdf) package - "friendly" regex
- [regexpal](https://www.regexpal.com/) - interpret regex

---

# Key expressions

_Note:_ There are often several ways of writing the same regex. For this presentation, I chose my favorite style, which I find the most flexible

`^` = start  
`$` = end  
`.` = anything

```r
str_subset(c("football", "baseball", "ballroom"), "ball$")
```

```
## [1] "football" "baseball"
```

---

brackets = _one of_ the characters specified within the brackets (in the example, an `o`, `n`, `e`, space, or `f`)

`[a-z]` = any lower case letters  
`[0-9]` = any number from 0 to 9 (also `\\d`)

variations on the theme:  
`[09-]` = 0, 9, or -

---

# Guess the regex

---

# Guess the regex

---

# Guess the regex

---

# Numbers of things

`*` = 0 or more  
`+` = 1 or more

`{n}` = exactly `n` number of times  
`{n,}` = at least `n` number of times  
`{n,m}` = between `n` and `m` times

---

# Guess the regex

---

# Guess the regex

---

# Guess the regex

---

# How do we say NOT?

---

---

_Inside_ of brackets, a caret ("hat"/`^`) means **NOT**

_Outside_ of brackets and at the beginning of a string, it means **begins with**

---

# Which is which?

---

# Working in a dataframe

---

```r
spells
```

---

```r
spells %>% 
* mutate(effect = str_to_lower(effect))
```

```
## # A tibble: 91 x 3
##   incantation effect                     type 
##   <chr>       <chr>                      <chr>
## 1 Accio       summons an object          Charm
## 2 Aguamenti   shoots water from wand     Charm
## 3 Alohomora   opens locked objects       Charm
## 4 Anapneo     clears the target's airway Spell
## 5 Aparecium   reveals invisible ink      Spell
## # … with 86 more rows
```

---

```r
spells %>% 
  mutate(effect = str_to_lower(effect)) %>% 
* mutate(effect = str_split(effect, " "))
```

```
## # A tibble: 91 x 3
##   incantation effect    type 
##   <chr>       <list>    <chr>
## 1 Accio       <chr [3]> Charm
## 2 Aguamenti   <chr [4]> Charm
## 3 Alohomora   <chr [3]> Charm
## 4 Anapneo     <chr [4]> Spell
## 5 Aparecium   <chr [3]> Spell
## # … with 86 more rows
```

### Creates a _list-column_; supports vector results of different lengths

---

```r
spells %>% 
  mutate(effect = str_to_lower(effect)) %>% 
  mutate(effect = str_split(effect, " ")) %>% 
* unnest(effect)
```

```
## # A tibble: 351 x 3
##   incantation effect  type 
##   <chr>       <chr>   <chr>
## 1 Accio       summons Charm
## 2 Accio       an      Charm
## 3 Accio       object  Charm
## 4 Aguamenti   shoots  Charm
## 5 Aguamenti   water   Charm
## # … with 346 more rows
```

### _Note:_ this is very similar to the `unnest_tokens()` function in the [`tidytext`](https://www.tidytextmining.com/) package

---

```r
spells %>% 
  mutate(effect = str_to_lower(effect)) %>% 
  mutate(effect = str_split(effect, " ")) %>% 
  unnest(effect) %>% 
* count(effect, sort = TRUE)
```

```
## # A tibble: 231 x 2
##   effect      n
##   <chr>   <int>
## 1 a          11
## 2 an         11
## 3 to          9
## 4 object      7
## 5 objects     7
## # … with 226 more rows
```

---

# A few more miscellaneous tricks

---

Extract groups within a pattern

```r
phone_numbers <- c("058 222 1234", "054-121 1221")

str_match(phone_numbers, "([0-9]+)[ -]([0-9]+)[ -]([0-9]+)")
```

```
##      [,1]           [,2]  [,3]  [,4]  
## [1,] "058 222 1234" "058" "222" "1234"
## [2,] "054-121 1221" "054" "121" "1221"
```

```r
# same as "(\\d+)[ -](\\d+)[ -](\\d+)"
```

---

![](https://static.boredpanda.com/blog/wp-content/uploads/2016/10/newborn-baby-harry-potter-photo-shoot-kayla-glover-4.jpg)
]

![](https://imagesvc.meredithcorp.io/v3/mm/image?url=https%3A%2F%2Fewedit.files.wordpress.com%2F2015%2F01%2Fharry-potter_510.jpg&w=400&c=sc&poi=face&q=85)
]

---

class: inverse, center, middle 
 
Some more practice
---
str_extract("It does not do to dwell on dreams and forget to live.", )

---

str_extract("It does not do to dwell on dreams and forget to live.", )

---

# Bonus question

How do you extract the currency conversions from this text?

```r
text <- c("There are 29 Knuts in 1 silver Sickle",
          "and there are 493 Knuts in 1 golden Galleon.")
```

Shortcut if you only care about the first number...

```r
parse_number(text)
```

```
## [1]  29 493
```

---

# Summary

- strings/characters = things that are quoted in R
- backslashes are used for "escape characters" (`\n`, `\\`)
- use `writeLines()` to show how strings will display
- `stringr` provides a simple, consistent interface to work with strings in R
- regular expressions describe the logic of a pattern in text
- check regex in R with `str_view()`/`str_view_all()`
- use `unnest()` to expand list-columns

Cheatsheets: [Basic regular expressions in R](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf), [Working with strings with stringr](https://resources.rstudio.com/rstudio-cheatsheets/stringr-cheat-sheet)

Santa Barbara Eco-Data-Science [text workshop materials](https://github.com/eco-data-science/text_workshop) (covers pdftools, sentiment analysis, etc.)

---

# You are now ready to face the world of strings and regex