How to write loops correctly and other stories

# How to write loops correctly and other stories
### Maria Novosolov
### 01-09-2020

---

# General comments about the upcoming semester

* Emphasis on reading and writing code correctly

* More interactive - you do the work, we are just here to help

* Community participation
    - Slack
    - Code camp

---
class: inverse, center, middle

# Iterating over data in R

---

# Why bother?

* What, you're really going to write out every line?

---

## Three main methods:

* `for` and `while` loops

* `*apply` function family

* `map` function in `tidyverse`

---

# `for` loops

---

# Basic ingrediants of a `for` loop

It will at least have (1) and (2)
]

.pull-right[
<img src="https://cdn.datamentor.io/wp-content/uploads/2017/11/r-for-loop.jpg" width=60%>
]

---

### `for` loop syntax

```r
for(i in 1:10){ #sequence
  print(i)      #body
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
```

---

### Alternative for sequence

Use `seq_along()` as an idiot proof alternative to `1:length(df)` because it will not work if the vector your iterating over is of length zero

```r
#random dataframe to play with
df<- tibble(a = rnorm(10), b = rnorm(10),
            c = rnorm(10), d = rnorm(10))
#print the names of the columns in df
for(i in seq_along(df)){
  print(names(df[i]))
}
```

```
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
```

---

# An R syntax detour

What is the difference between `df[i]` and `df[[i]]`?

--
.pull-left[

```r
i=1
df[i]
```

```
## # A tibble: 10 x 1
##          a
##      <dbl>
##  1 -0.474 
##  2  0.594 
##  3 -1.56  
##  4  0.925 
##  5  1.07  
##  6 -0.463 
##  7 -0.0949
##  8 -2.02  
##  9 -0.773 
## 10 -0.0626
```
]

```r
df[[i]]
```

```
##  [1] -0.47424657  0.59431182 -1.56432478  0.92470994  1.06844556 -0.46284965
##  [7] -0.09488813 -2.02470098 -0.77316958 -0.06262095
```

]

---
class: exercise, center, middle

# Practice time!

Write a for loop that calculates the mean of each column in `mtcars`
Bonus: calculate mean only for the continuous columns

---

In many cases we would like to save the output so we'll add the output

**Remember!** Initiating the output with a known length will save you computing time

If you don't know the length of the output it is worth saving the output into a list and then flattening it out using `unlist` (for vectors) or `dplyr::bind_rows()` (for dataframe rows)
---

# Let's save the mean for each column of mtcars into a vector

```r
col_mean<- vector(mode = "double",length = (length(mtcars)-1))
for(i in seq_along(mtcars)){
  if(is.numeric(mtcars[[i]])==T){
    col_mean[i]<- mean(mtcars[[i]],na.rm=T)
  }
}
print(col_mean)
```

```
##  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
##  [7]  17.848750   0.437500   0.406250   3.687500   2.812500
```

---
In the previous case I knew how many numeric columns there are. But what if we don't know?

```r
#create a list with a length as the number of columns in mtcars
col_mean<- vector(mode = "list",length = length(mtcars))
for(i in seq_along(mtcars)){
  if(is.numeric(mtcars[[i]])==T){
    #note that tho save into a list you use double brackets 
    col_mean[[i]]<- mean(mtcars[[i]],na.rm=T)
  }
}
unlist(col_mean)
```

```
##  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
##  [7]  17.848750   0.437500   0.406250   3.687500   2.812500
```

---
# Recipe to successful `for` loop writing

1. Write down the general objective for the loop including the data that will be used

**For example:** Find all the characters in the starwars data that are older than X, and add the word "old" to their name.

2. Write in words each of the steps you will be taking to accomplish your objective

3. Write the code that fulfills each step

4. Test the loop on the first iteration
---

# Some more advice

* If the loop is complicated then start with a simplified loop and build on top of it

* If you think you need to write a loop within the loop try and find a way to use `*apply` or write a function that will execute the inner loop.

* If you think you'll be using the loop more than once embed the loop into a function. To read more about functions, check out: https://github.com/globedatasci-ku/functions
---

# Some more advice

* Make comments and documentation for all your code

.center[
<img src="https://img.libquotes.com/pic-quotes/v2/damian-conway-quote-lbx2j9r.jpg" width=60%>
]
---
class: exercise, center, middle

# Let's practice!

I want you to think of an objective that can be used in a loop and then write the steps that you will take to get this objective

---

# Come present your loop architecture!

---

# More practice!

Write the code that fulfills each step of your loop and test that it runs

---

# `while` loops

---

# When to use and general advice

* Use only when you do not know the number of iterations something will take

* Careful!! Writing the `while` loop wrong can lead to infinity

---

# `while` loop syntax

```r
#initiate a counter
i=1
#start the while loop
while(i < 10){
  #do something
  print(i)
  #increase the counter
  i = i+1
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
```
]

.pull-right[
<img src="https://cdn.datamentor.io/wp-content/uploads/2017/11/r-while-loop.jpg" width=60%>
]
---

# Practice time!

Convert the `mtcars` for loop from the examples above into a while loop

---
.center[
#**REMEMBER!**

Use a while loop as your last resort or if you are really certain of why you are using it
]
---

# `*apply` function family

---

# General guidelines

* Allows you to do row-wise/column-wise operations (`apply()`)

* Can be used to iterate over a list (`lapply()`)

* Can be used in parallel (`mcapply()`)

---
---

# `apply()` Examples

### General syntax:

apply(data,margin,function,specific variables of the function)

margins: 1 = row wise; 2 = col wise

```r
#subset the starwars dat to have only height and mass
sub_sw<- starwars[,2:3]
# calculate the mean for each of them
apply(sub_sw,2,mean,na.rm=T)
```

```
##    height      mass 
## 174.35802  97.31186
```

---
# adding a progress bar (`pbapply()`)

Similar to apply but adds a progress bar (can't see in the slide format)

```r
pbapply(sub_sw,2,mean,na.rm=T)
```

```
##    height      mass 
## 174.35802  97.31186
```

---
class: exercise, center, middle
# Time to practice!

Convert part of your loop to use an apply function

---

# `lapply()` Example

general syntax: 
lapply(list, function)

* Returns a list

```r
lapply(starwars$films,length)
```

```
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 7
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 3
## 
## [[7]]
## [1] 3
## 
## [[8]]
## [1] 1
## 
## [[9]]
## [1] 1
## 
## [[10]]
## [1] 6
## 
## [[11]]
## [1] 3
## 
## [[12]]
## [1] 2
## 
## [[13]]
## [1] 5
## 
## [[14]]
## [1] 4
## 
## [[15]]
## [1] 1
## 
## [[16]]
## [1] 3
## 
## [[17]]
## [1] 3
## 
## [[18]]
## [1] 1
## 
## [[19]]
## [1] 5
## 
## [[20]]
## [1] 5
## 
## [[21]]
## [1] 3
## 
## [[22]]
## [1] 1
## 
## [[23]]
## [1] 1
## 
## [[24]]
## [1] 2
## 
## [[25]]
## [1] 1
## 
## [[26]]
## [1] 2
## 
## [[27]]
## [1] 1
## 
## [[28]]
## [1] 1
## 
## [[29]]
## [1] 1
## 
## [[30]]
## [1] 1
## 
## [[31]]
## [1] 1
## 
## [[32]]
## [1] 3
## 
## [[33]]
## [1] 1
## 
## [[34]]
## [1] 2
## 
## [[35]]
## [1] 1
## 
## [[36]]
## [1] 1
## 
## [[37]]
## [1] 1
## 
## [[38]]
## [1] 2
## 
## [[39]]
## [1] 1
## 
## [[40]]
## [1] 1
## 
## [[41]]
## [1] 2
## 
## [[42]]
## [1] 1
## 
## [[43]]
## [1] 1
## 
## [[44]]
## [1] 3
## 
## [[45]]
## [1] 1
## 
## [[46]]
## [1] 1
## 
## [[47]]
## [1] 1
## 
## [[48]]
## [1] 3
## 
## [[49]]
## [1] 3
## 
## [[50]]
## [1] 3
## 
## [[51]]
## [1] 2
## 
## [[52]]
## [1] 2
## 
## [[53]]
## [1] 2
## 
## [[54]]
## [1] 1
## 
## [[55]]
## [1] 3
## 
## [[56]]
## [1] 2
## 
## [[57]]
## [1] 1
## 
## [[58]]
## [1] 1
## 
## [[59]]
## [1] 1
## 
## [[60]]
## [1] 2
## 
## [[61]]
## [1] 2
## 
## [[62]]
## [1] 1
## 
## [[63]]
## [1] 1
## 
## [[64]]
## [1] 2
## 
## [[65]]
## [1] 2
## 
## [[66]]
## [1] 1
## 
## [[67]]
## [1] 1
## 
## [[68]]
## [1] 1
## 
## [[69]]
## [1] 1
## 
## [[70]]
## [1] 1
## 
## [[71]]
## [1] 1
## 
## [[72]]
## [1] 1
## 
## [[73]]
## [1] 2
## 
## [[74]]
## [1] 1
## 
## [[75]]
## [1] 1
## 
## [[76]]
## [1] 2
## 
## [[77]]
## [1] 1
## 
## [[78]]
## [1] 1
## 
## [[79]]
## [1] 2
## 
## [[80]]
## [1] 2
## 
## [[81]]
## [1] 1
## 
## [[82]]
## [1] 1
## 
## [[83]]
## [1] 1
## 
## [[84]]
## [1] 1
## 
## [[85]]
## [1] 1
## 
## [[86]]
## [1] 1
## 
## [[87]]
## [1] 3
```

---

# With progress bar

```r
pblapply(starwars$films,length)
```

---
class: exercise, center, middle

# Time to practice!

create a list from the continuous variables in your data and use `lapply` to calculate the mean of the continuous variables

---
class: inverse, center, middle

# Loops in Parallel

---

# `foreach` package

* `foreach` is a parallel equivalent to `lapply` function

* Let's you run your analysis faster

* A more flexible version of `mcapply()`

* Needs some understanding whether your analysis is independent

---

# General sytax

In its base `foreach` works like `lapply`

```r
foreach(i = 1:3) %do% {
  sqrt(i)
}
```

```
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
```

---

# To make it into a vector

```r
foreach(i = 1:3,.combine ='c') %do% {
  sqrt(i)
}
```

```
## [1] 1.000000 1.414214 1.732051
```

---

# or

```r
#save the results into a variable
res<- foreach(i = 1:3) %do% {
  sqrt(i)
}

#use do.call to combine it
do.call('c', res)
```

```
## [1] 1.000000 1.414214 1.732051
```

---

# An R syntax detour - the `do.call()` function

* Takes a function and gives it a list of arguments
* What, you mean like `lapply()`?
    - Nope, do.call can collapse lists into a *single vector*
    - `lapply` always returns a list.
* Also useful for collapsing lists using `rbind` and `cbind`

---

# An R syntax detour - the `do.call()` function

```r
lapply(starwars$films[8:10], 
       c)
```

```
## [[1]]
## [1] "A New Hope"
## 
## [[2]]
## [1] "A New Hope"
## 
## [[3]]
## [1] "The Empire Strikes Back" "Attack of the Clones"   
## [3] "The Phantom Menace"      "Revenge of the Sith"    
## [5] "Return of the Jedi"      "A New Hope"
```

```r
do.call(c, 
        starwars$films[8:10])
```

```
## [1] "A New Hope"              "A New Hope"             
## [3] "The Empire Strikes Back" "Attack of the Clones"   
## [5] "The Phantom Menace"      "Revenge of the Sith"    
## [7] "Return of the Jedi"      "A New Hope"
```

---
class: center, middle

# Let's put things in parallel

---
# General idea and guidelines

* Use the several cores in your computer to run your analysis in parallel

* Make sure that the iterations are independent of each other

* Parallel works a bit different between Windows and Mac. If you want to read more about it: https://psu-psychology.github.io/r-bootcamp-2018/talks/parallel_r.html

---

# Run `foreach` in parallel

* replace `%do%` with `%dopar%`

* Register a parallel back end - with one of the packages `doParallel`, `doMC`, or `doMPI`

---

# Easy example on registering clusters

```r
# define the numer of cores you're going to use
cl <- parallel::makeCluster(2)

# register them
doParallel::registerDoParallel(cl)

# run the loop in parallel
foreach(i = 1:3, .combine = 'c') %dopar% {
  sqrt(i)
}
```

```
## [1] 1.000000 1.414214 1.732051
```

```r
parallel::stopCluster(cl)
```

---

# Lets compare `lapply` to `foreach`

```r
head(
  unlist(
    lapply(starwars$films,
              length)))
```

```
## [1] 5 6 7 4 5 3
```
]

```r
# define the number of cores you're going to use
cl <- parallel::makeCluster(2)

# register them
doParallel::registerDoParallel(cl)

# run the loop in parallel
head(
  foreach(i = seq_along(starwars$films),
          .combine = 'c') %dopar% {
  length(starwars$films[[i]])
})
```

```
## [1] 5 6 7 4 5 3
```

```r
parallel::stopCluster(cl)
```
]

---
### If you want to dive deeper into working with `foreach`check out this blog post

https://privefl.github.io/blog/a-guide-to-parallelism-in-r/

---
class: exercise, center, middle

# Let's practice!

Convert the `mtcars` for loop to a foreach loop. Bonus, make it run in parallel
---

## Non-parallel version

```r
foreach(i = seq_along(mtcars),.combine = 'c') %do% {
            if(is.numeric(mtcars[[i]])==T){
              mean(mtcars[[i]],na.rm=T)
            }
}
```

```
##  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
##  [7]  17.848750   0.437500   0.406250   3.687500   2.812500
```

# parallel version

```r
# define the number of cores you're going to use
cl <- parallel::makeCluster(2)

# register them
doParallel::registerDoParallel(cl)
foreach(i = seq_along(mtcars),.combine = 'c') %dopar% {
            if(is.numeric(mtcars[[i]])==T){
              mean(mtcars[[i]],na.rm=T)
            }
}
```

```
##  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
##  [7]  17.848750   0.437500   0.406250   3.687500   2.812500
```

```r
parallel::stopCluster(cl)
```

---

# `pblapply` a parallel lapply version

```r
#define the number of clusters and register the cluster
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
#subset the mtcars to have only continuous columns
mtcars2<- mtcars[,1:8]

#run pbapply
pbapply(mtcars2,2,mean,cl=cl)
```

```
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs 
##   0.437500
```

```r
#stop the cluster
parallel::stopCluster(cl)
```