for loop

`for` or `apply`?

A for loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply family, which can be used in much the same way. You’ve already used one of the family, apply in the first lesson. The apply family members include

apply - apply over the margins of an array (e.g. the rows or columns of a matrix)
lapply - apply over an object and return list
sapply - apply over an object and return a simplified object (an array) if possible
vapply - similar to sapply but you specify the type of object returned by the iterations

Each of these has an argument FUN which takes a function to apply to each element of the object. Instead of looping over filenames and calling analyze, as you did earlier, you could sapply over filenames with FUN = analyze:

sapply(filenames, FUN = analyze)

Deciding whether to use for or one of the apply family is really personal preference. Using an apply family function forces to you encapsulate your operations as a function rather than separate calls with for. for loops are often more natural in some circumstances; for several related operations, a for loop will avoid you having to pass in a lot of extra arguments to your function.

Loops in R Are Slow

No, they are not! If you follow some golden rules:

Don’t use a loop when a vectorized alternative exists
Don’t grow objects (via c, cbind, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of analyze that will return the mean inflammation per day (column) of each file.

analyze2 <- function(filenames) {
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    res <- apply(fdata, 2, mean)
    if (f == 1) {
      out <- res
    } else {
      # The loop is slowed by this call to cbind that grows the object
      out <- cbind(out, res)
    }
  }
  return(out)
}

system.time(avg2 <- analyze2(filenames))

   user  system elapsed 
  0.027   0.000   0.026 

Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the fth column of our results matrix out. This time there is no copying/growing for R to deal with.

analyze3 <- function(filenames) {
  out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    out[, f] <- apply(fdata, 2, mean)
  }
  return(out)
}

system.time(avg3 <- analyze3(filenames))

   user  system elapsed 
  0.024   0.000   0.024 

In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.

Writing a simple for loop in R

Let’s get back to the conceptual meaning of a loop. Suppose you want to do several printouts of the following form: The year is [year] where [year] is equal to 2010, 2011, up to 2015. You can do this as follows:

print(paste("The year is", 2010))
"The year is 2010"
print(paste("The year is", 2011))
"The year is 2011"
print(paste("The year is", 2012))
"The year is 2012"
print(paste("The year is", 2013))
"The year is 2013"
print(paste("The year is", 2014))
"The year is 2014"
print(paste("The year is", 2015))
"The year is 2015"

You immediately see this is rather tedious: you repeat the same code chunk over and over. This violates the DRY principle, known in every programming language: Don’t Repeat Yourself, at all cost. In this case, by making use of a for loop in R, you can automate the repetitive part:

for (year in c(2010,2011,2012,2013,2014,2015)){
  print(paste("The year is", year))
}
"The year is 2010"
"The year is 2011"
"The year is 2012"
"The year is 2013"
"The year is 2014"
"The year is 2015"

The best way to understand what is going on in the for loop, is by reading it as follows: “For each year that is in the sequence c(2010,2011,2012,2013,2014,2015) you execute the code chunk print(paste("The year is", year))”. Once the for loop has executed the code chunk for every year in the vector, the loop stops and goes to the first instruction after the loop block.

See how we did that? By using a for loop you only need to write down your code chunk once (instead of six times). The for loop then runs the statement once for each provided value (the different years we provided) and sets the variable (year in this case) to that value. You can even simplify the code even more: c(2010,2011,2012,2013,2014,2015) can also be written as 2010:2015; this creates the exact same sequence:

for (year in 2010:2015){
  print(paste("The year is", year))
}
"The year is 2010"
"The year is 2011"
"The year is 2012"
"The year is 2013"
"The year is 2014"
"The year is 2015"

As a last note on the for loop in R: in this case we made use of the variable year but in fact any variable could be used here. For example you could have used i, a commonly-used variable in for loops that stands for index:

for (i in 2010:2015){
  print(paste("The year is", i))
}
"The year is 2010"
"The year is 2011"
"The year is 2012"
"The year is 2013"
"The year is 2014"
"The year is 2015"

This produces the exact same output. So you can really name the variable anyway you want, but it’s just more understandable if you use meaningful names.

Let’s have a look at a more mathematical example. Suppose you need to print all uneven numbers between 1 and 10 but even numbers should not be printed. In that case your loop would look like this:

for (i in 1:10) {
  if (!i %% 2){
    next
  }
    print(i)
}
1
3
5
7
9

Notice the introduction of the next statement. Let’s explore the meaning of this statement walking through this loop together:

When i is between 1 and 10 we enter the loop and if not the loop stops. In case we enter the loop, we need to check if the value of i is uneven. If the value of i has a remainder of zero when divided by 2 (that’s why we use the modulus operand %%) we don’t enter the if statement, execute the print function and loop back. In case the remainder is non zero, the if statement evaluates to TRUE and we enter the conditional. Here we now see the next statement which causes to loop back to the i in 1:10 condition thereby ignoring the the instructions that follows (so the print(i)).

Gds-Tech educational and management system

for loop

`for` or `apply`?

Loops in R Are Slow

Writing a simple for loop in R

Comments

Post a Comment

Popular posts from this blog

spreadsheet in excel

📚Gds-Tech 📚 EMS

vocab

for loop

for or apply?

Loops in R Are Slow

Writing a simple for loop in R

Comments

Post a Comment

Popular posts from this blog

spreadsheet in excel

📚Gds-Tech 📚 EMS

vocab

`for` or `apply`?