class: center, middle, inverse, title-slide # How to write loops correctly and other stories ### Maria Novosolov ### 01-09-2020 --- # General comments about the upcoming semester * Emphasis on reading and writing code correctly * More interactive - you do the work, we are just here to help * Community participation - Slack - Code camp --- class: inverse, center, middle # Iterating over data in R --- # Why bother? * What, you're really going to write out every line? <img src="http://cdn3.whatculture.com/images/2015/04/Simpsons-Milestone-600x300.jpg" width=60%> --- ## Three main methods: * `for` and `while` loops * `*apply` function family * `map` function in `tidyverse` --- class: inverse, center, middle # `for` loops --- # Basic ingrediants of a `for` loop .pull-left[ A `for` loop has: 1. Sequence 2. Body 3. Output It will at least have (1) and (2) ] .pull-right[ <img src="https://cdn.datamentor.io/wp-content/uploads/2017/11/r-for-loop.jpg" width=60%> ] --- ### `for` loop syntax ```r for(i in 1:10){ #sequence print(i) #body } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10 ``` --- ### Alternative for sequence Use `seq_along()` as an idiot proof alternative to `1:length(df)` because it will not work if the vector your iterating over is of length zero ```r #random dataframe to play with df<- tibble(a = rnorm(10), b = rnorm(10), c = rnorm(10), d = rnorm(10)) #print the names of the columns in df for(i in seq_along(df)){ print(names(df[i])) } ``` ``` ## [1] "a" ## [1] "b" ## [1] "c" ## [1] "d" ``` --- # An R syntax detour What is the difference between `df[i]` and `df[[i]]`? -- .pull-left[ ```r i=1 df[i] ``` ``` ## # A tibble: 10 x 1 ## a ## <dbl> ## 1 -0.474 ## 2 0.594 ## 3 -1.56 ## 4 0.925 ## 5 1.07 ## 6 -0.463 ## 7 -0.0949 ## 8 -2.02 ## 9 -0.773 ## 10 -0.0626 ``` ] .pull-right[ ```r df[[i]] ``` ``` ## [1] -0.47424657 0.59431182 -1.56432478 0.92470994 1.06844556 -0.46284965 ## [7] -0.09488813 -2.02470098 -0.77316958 -0.06262095 ``` ] --- class: exercise, center, middle # Practice time! Write a for loop that calculates the mean of each column in `mtcars` Bonus: calculate mean only for the continuous columns --- In many cases we would like to save the output so we'll add the output **Remember!** Initiating the output with a known length will save you computing time If you don't know the length of the output it is worth saving the output into a list and then flattening it out using `unlist` (for vectors) or `dplyr::bind_rows()` (for dataframe rows) --- # Let's save the mean for each column of mtcars into a vector ```r col_mean<- vector(mode = "double",length = (length(mtcars)-1)) for(i in seq_along(mtcars)){ if(is.numeric(mtcars[[i]])==T){ col_mean[i]<- mean(mtcars[[i]],na.rm=T) } } print(col_mean) ``` ``` ## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 ## [7] 17.848750 0.437500 0.406250 3.687500 2.812500 ``` --- In the previous case I knew how many numeric columns there are. But what if we don't know? ```r #create a list with a length as the number of columns in mtcars col_mean<- vector(mode = "list",length = length(mtcars)) for(i in seq_along(mtcars)){ if(is.numeric(mtcars[[i]])==T){ #note that tho save into a list you use double brackets col_mean[[i]]<- mean(mtcars[[i]],na.rm=T) } } unlist(col_mean) ``` ``` ## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 ## [7] 17.848750 0.437500 0.406250 3.687500 2.812500 ``` --- # Recipe to successful `for` loop writing 1. Write down the general objective for the loop including the data that will be used **For example:** Find all the characters in the starwars data that are older than X, and add the word "old" to their name. 2. Write in words each of the steps you will be taking to accomplish your objective 3. Write the code that fulfills each step 4. Test the loop on the first iteration --- # Some more advice * If the loop is complicated then start with a simplified loop and build on top of it * If you think you need to write a loop within the loop try and find a way to use `*apply` or write a function that will execute the inner loop. * If you think you'll be using the loop more than once embed the loop into a function. To read more about functions, check out: https://github.com/globedatasci-ku/functions --- # Some more advice * Make comments and documentation for all your code .center[ <img src="https://img.libquotes.com/pic-quotes/v2/damian-conway-quote-lbx2j9r.jpg" width=60%> ] --- class: exercise, center, middle # Let's practice! I want you to think of an objective that can be used in a loop and then write the steps that you will take to get this objective --- class: center, middle # Come present your loop architecture! --- class: exercise, center, middle # More practice! Write the code that fulfills each step of your loop and test that it runs --- class: inverse, center, middle # `while` loops --- # When to use and general advice * Use only when you do not know the number of iterations something will take * Careful!! Writing the `while` loop wrong can lead to infinity <img src="img/while_loop2.png" width=40%> --- # `while` loop syntax .pull-left[ ```r #initiate a counter i=1 #start the while loop while(i < 10){ #do something print(i) #increase the counter i = i+1 } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ``` ] .pull-right[ <img src="https://cdn.datamentor.io/wp-content/uploads/2017/11/r-while-loop.jpg" width=60%> ] --- class: exercise, center, middle # Practice time! Convert the `mtcars` for loop from the examples above into a while loop --- .center[ #**REMEMBER!** Use a while loop as your last resort or if you are really certain of why you are using it ] --- class: inverse, center, middle # `*apply` function family --- # General guidelines * Allows you to do row-wise/column-wise operations (`apply()`) * Can be used to iterate over a list (`lapply()`) * Can be used in parallel (`mcapply()`) --- --- # `apply()` Examples ### General syntax: apply(data,margin,function,specific variables of the function) margins: 1 = row wise; 2 = col wise ```r #subset the starwars dat to have only height and mass sub_sw<- starwars[,2:3] # calculate the mean for each of them apply(sub_sw,2,mean,na.rm=T) ``` ``` ## height mass ## 174.35802 97.31186 ``` --- # adding a progress bar (`pbapply()`) Similar to apply but adds a progress bar (can't see in the slide format) ```r pbapply(sub_sw,2,mean,na.rm=T) ``` ``` ## height mass ## 174.35802 97.31186 ``` --- class: exercise, center, middle # Time to practice! Convert part of your loop to use an apply function --- # `lapply()` Example general syntax: lapply(list, function) * Returns a list ```r lapply(starwars$films,length) ``` ``` ## [[1]] ## [1] 5 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 7 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 5 ## ## [[6]] ## [1] 3 ## ## [[7]] ## [1] 3 ## ## [[8]] ## [1] 1 ## ## [[9]] ## [1] 1 ## ## [[10]] ## [1] 6 ## ## [[11]] ## [1] 3 ## ## [[12]] ## [1] 2 ## ## [[13]] ## [1] 5 ## ## [[14]] ## [1] 4 ## ## [[15]] ## [1] 1 ## ## [[16]] ## [1] 3 ## ## [[17]] ## [1] 3 ## ## [[18]] ## [1] 1 ## ## [[19]] ## [1] 5 ## ## [[20]] ## [1] 5 ## ## [[21]] ## [1] 3 ## ## [[22]] ## [1] 1 ## ## [[23]] ## [1] 1 ## ## [[24]] ## [1] 2 ## ## [[25]] ## [1] 1 ## ## [[26]] ## [1] 2 ## ## [[27]] ## [1] 1 ## ## [[28]] ## [1] 1 ## ## [[29]] ## [1] 1 ## ## [[30]] ## [1] 1 ## ## [[31]] ## [1] 1 ## ## [[32]] ## [1] 3 ## ## [[33]] ## [1] 1 ## ## [[34]] ## [1] 2 ## ## [[35]] ## [1] 1 ## ## [[36]] ## [1] 1 ## ## [[37]] ## [1] 1 ## ## [[38]] ## [1] 2 ## ## [[39]] ## [1] 1 ## ## [[40]] ## [1] 1 ## ## [[41]] ## [1] 2 ## ## [[42]] ## [1] 1 ## ## [[43]] ## [1] 1 ## ## [[44]] ## [1] 3 ## ## [[45]] ## [1] 1 ## ## [[46]] ## [1] 1 ## ## [[47]] ## [1] 1 ## ## [[48]] ## [1] 3 ## ## [[49]] ## [1] 3 ## ## [[50]] ## [1] 3 ## ## [[51]] ## [1] 2 ## ## [[52]] ## [1] 2 ## ## [[53]] ## [1] 2 ## ## [[54]] ## [1] 1 ## ## [[55]] ## [1] 3 ## ## [[56]] ## [1] 2 ## ## [[57]] ## [1] 1 ## ## [[58]] ## [1] 1 ## ## [[59]] ## [1] 1 ## ## [[60]] ## [1] 2 ## ## [[61]] ## [1] 2 ## ## [[62]] ## [1] 1 ## ## [[63]] ## [1] 1 ## ## [[64]] ## [1] 2 ## ## [[65]] ## [1] 2 ## ## [[66]] ## [1] 1 ## ## [[67]] ## [1] 1 ## ## [[68]] ## [1] 1 ## ## [[69]] ## [1] 1 ## ## [[70]] ## [1] 1 ## ## [[71]] ## [1] 1 ## ## [[72]] ## [1] 1 ## ## [[73]] ## [1] 2 ## ## [[74]] ## [1] 1 ## ## [[75]] ## [1] 1 ## ## [[76]] ## [1] 2 ## ## [[77]] ## [1] 1 ## ## [[78]] ## [1] 1 ## ## [[79]] ## [1] 2 ## ## [[80]] ## [1] 2 ## ## [[81]] ## [1] 1 ## ## [[82]] ## [1] 1 ## ## [[83]] ## [1] 1 ## ## [[84]] ## [1] 1 ## ## [[85]] ## [1] 1 ## ## [[86]] ## [1] 1 ## ## [[87]] ## [1] 3 ``` --- # With progress bar ```r pblapply(starwars$films,length) ``` ``` ## [[1]] ## [1] 5 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 7 ## ## [[4]] ## [1] 4 ## ## [[5]] ## [1] 5 ## ## [[6]] ## [1] 3 ## ## [[7]] ## [1] 3 ## ## [[8]] ## [1] 1 ## ## [[9]] ## [1] 1 ## ## [[10]] ## [1] 6 ## ## [[11]] ## [1] 3 ## ## [[12]] ## [1] 2 ## ## [[13]] ## [1] 5 ## ## [[14]] ## [1] 4 ## ## [[15]] ## [1] 1 ## ## [[16]] ## [1] 3 ## ## [[17]] ## [1] 3 ## ## [[18]] ## [1] 1 ## ## [[19]] ## [1] 5 ## ## [[20]] ## [1] 5 ## ## [[21]] ## [1] 3 ## ## [[22]] ## [1] 1 ## ## [[23]] ## [1] 1 ## ## [[24]] ## [1] 2 ## ## [[25]] ## [1] 1 ## ## [[26]] ## [1] 2 ## ## [[27]] ## [1] 1 ## ## [[28]] ## [1] 1 ## ## [[29]] ## [1] 1 ## ## [[30]] ## [1] 1 ## ## [[31]] ## [1] 1 ## ## [[32]] ## [1] 3 ## ## [[33]] ## [1] 1 ## ## [[34]] ## [1] 2 ## ## [[35]] ## [1] 1 ## ## [[36]] ## [1] 1 ## ## [[37]] ## [1] 1 ## ## [[38]] ## [1] 2 ## ## [[39]] ## [1] 1 ## ## [[40]] ## [1] 1 ## ## [[41]] ## [1] 2 ## ## [[42]] ## [1] 1 ## ## [[43]] ## [1] 1 ## ## [[44]] ## [1] 3 ## ## [[45]] ## [1] 1 ## ## [[46]] ## [1] 1 ## ## [[47]] ## [1] 1 ## ## [[48]] ## [1] 3 ## ## [[49]] ## [1] 3 ## ## [[50]] ## [1] 3 ## ## [[51]] ## [1] 2 ## ## [[52]] ## [1] 2 ## ## [[53]] ## [1] 2 ## ## [[54]] ## [1] 1 ## ## [[55]] ## [1] 3 ## ## [[56]] ## [1] 2 ## ## [[57]] ## [1] 1 ## ## [[58]] ## [1] 1 ## ## [[59]] ## [1] 1 ## ## [[60]] ## [1] 2 ## ## [[61]] ## [1] 2 ## ## [[62]] ## [1] 1 ## ## [[63]] ## [1] 1 ## ## [[64]] ## [1] 2 ## ## [[65]] ## [1] 2 ## ## [[66]] ## [1] 1 ## ## [[67]] ## [1] 1 ## ## [[68]] ## [1] 1 ## ## [[69]] ## [1] 1 ## ## [[70]] ## [1] 1 ## ## [[71]] ## [1] 1 ## ## [[72]] ## [1] 1 ## ## [[73]] ## [1] 2 ## ## [[74]] ## [1] 1 ## ## [[75]] ## [1] 1 ## ## [[76]] ## [1] 2 ## ## [[77]] ## [1] 1 ## ## [[78]] ## [1] 1 ## ## [[79]] ## [1] 2 ## ## [[80]] ## [1] 2 ## ## [[81]] ## [1] 1 ## ## [[82]] ## [1] 1 ## ## [[83]] ## [1] 1 ## ## [[84]] ## [1] 1 ## ## [[85]] ## [1] 1 ## ## [[86]] ## [1] 1 ## ## [[87]] ## [1] 3 ``` --- class: exercise, center, middle # Time to practice! create a list from the continuous variables in your data and use `lapply` to calculate the mean of the continuous variables --- class: inverse, center, middle # Loops in Parallel --- # `foreach` package * `foreach` is a parallel equivalent to `lapply` function * Let's you run your analysis faster * A more flexible version of `mcapply()` * Needs some understanding whether your analysis is independent --- # General sytax In its base `foreach` works like `lapply` ```r foreach(i = 1:3) %do% { sqrt(i) } ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ``` --- # To make it into a vector ```r foreach(i = 1:3,.combine ='c') %do% { sqrt(i) } ``` ``` ## [1] 1.000000 1.414214 1.732051 ``` --- # or ```r #save the results into a variable res<- foreach(i = 1:3) %do% { sqrt(i) } #use do.call to combine it do.call('c', res) ``` ``` ## [1] 1.000000 1.414214 1.732051 ``` --- # An R syntax detour - the `do.call()` function * Takes a function and gives it a list of arguments * What, you mean like `lapply()`? - Nope, do.call can collapse lists into a *single vector* - `lapply` always returns a list. * Also useful for collapsing lists using `rbind` and `cbind` --- # An R syntax detour - the `do.call()` function ```r lapply(starwars$films[8:10], c) ``` ``` ## [[1]] ## [1] "A New Hope" ## ## [[2]] ## [1] "A New Hope" ## ## [[3]] ## [1] "The Empire Strikes Back" "Attack of the Clones" ## [3] "The Phantom Menace" "Revenge of the Sith" ## [5] "Return of the Jedi" "A New Hope" ``` ```r do.call(c, starwars$films[8:10]) ``` ``` ## [1] "A New Hope" "A New Hope" ## [3] "The Empire Strikes Back" "Attack of the Clones" ## [5] "The Phantom Menace" "Revenge of the Sith" ## [7] "Return of the Jedi" "A New Hope" ``` --- class: center, middle # Let's put things in parallel --- # General idea and guidelines * Use the several cores in your computer to run your analysis in parallel * Make sure that the iterations are independent of each other * Parallel works a bit different between Windows and Mac. If you want to read more about it: https://psu-psychology.github.io/r-bootcamp-2018/talks/parallel_r.html --- # Run `foreach` in parallel * replace `%do%` with `%dopar%` * Register a parallel back end - with one of the packages `doParallel`, `doMC`, or `doMPI` --- # Easy example on registering clusters ```r # define the numer of cores you're going to use cl <- parallel::makeCluster(2) # register them doParallel::registerDoParallel(cl) # run the loop in parallel foreach(i = 1:3, .combine = 'c') %dopar% { sqrt(i) } ``` ``` ## [1] 1.000000 1.414214 1.732051 ``` ```r parallel::stopCluster(cl) ``` --- # Lets compare `lapply` to `foreach` .pull-left[ ```r head( unlist( lapply(starwars$films, length))) ``` ``` ## [1] 5 6 7 4 5 3 ``` ] .pull-right[ ```r # define the number of cores you're going to use cl <- parallel::makeCluster(2) # register them doParallel::registerDoParallel(cl) # run the loop in parallel head( foreach(i = seq_along(starwars$films), .combine = 'c') %dopar% { length(starwars$films[[i]]) }) ``` ``` ## [1] 5 6 7 4 5 3 ``` ```r parallel::stopCluster(cl) ``` ] --- ### If you want to dive deeper into working with `foreach`check out this blog post https://privefl.github.io/blog/a-guide-to-parallelism-in-r/ --- class: exercise, center, middle # Let's practice! Convert the `mtcars` for loop to a foreach loop. Bonus, make it run in parallel --- ## Non-parallel version ```r foreach(i = seq_along(mtcars),.combine = 'c') %do% { if(is.numeric(mtcars[[i]])==T){ mean(mtcars[[i]],na.rm=T) } } ``` ``` ## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 ## [7] 17.848750 0.437500 0.406250 3.687500 2.812500 ``` # parallel version ```r # define the number of cores you're going to use cl <- parallel::makeCluster(2) # register them doParallel::registerDoParallel(cl) foreach(i = seq_along(mtcars),.combine = 'c') %dopar% { if(is.numeric(mtcars[[i]])==T){ mean(mtcars[[i]],na.rm=T) } } ``` ``` ## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 ## [7] 17.848750 0.437500 0.406250 3.687500 2.812500 ``` ```r parallel::stopCluster(cl) ``` --- # `pblapply` a parallel lapply version ```r #define the number of clusters and register the cluster cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) #subset the mtcars to have only continuous columns mtcars2<- mtcars[,1:8] #run pbapply pbapply(mtcars2,2,mean,cl=cl) ``` ``` ## mpg cyl disp hp drat wt qsec ## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 ## vs ## 0.437500 ``` ```r #stop the cluster parallel::stopCluster(cl) ```