for loop
for
or apply
?
A for
loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply
family, which can be used in much the same way. You’ve already used one of the family, apply
in the first lesson. The apply
family members include
apply
- apply over the margins of an array (e.g. the rows or columns of a matrix)lapply
- apply over an object and return listsapply
- apply over an object and return a simplified object (an array) if possiblevapply
- similar tosapply
but you specify the type of object returned by the iterations
Each of these has an argument FUN
which takes a function to apply to each element of the object. Instead of looping over filenames
and calling analyze
, as you did earlier, you could sapply
over filenames
with FUN = analyze
:
sapply(filenames, FUN = analyze)
Deciding whether to use for
or one of the apply
family is really personal preference. Using an apply
family function forces to you encapsulate your operations as a function rather than separate calls with for
. for
loops are often more natural in some circumstances; for several related operations, a for
loop will avoid you having to pass in a lot of extra arguments to your function.
Loops in R Are Slow
No, they are not! If you follow some golden rules:
- Don’t use a loop when a vectorized alternative exists
- Don’t grow objects (via
c
,cbind
, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column - Allocate an object to hold the results and fill it in during the loop
As an example, we’ll create a new version of analyze
that will return the mean inflammation per day (column) of each file.
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
user system elapsed
0.027 0.000 0.026
Note how we add a new column to out
at each iteration? This is a cardinal sin of writing a for
loop in R.
Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the f
th column of our results matrix out
. This time there is no copying/growing for R to deal with.
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) # assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
user system elapsed
0.024 0.000 0.024
In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.
Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.
Writing a simple for loop in R
Let’s get back to the conceptual meaning of a loop. Suppose you want to do several printouts of the following form: The year is [year] where [year] is equal to 2010, 2011, up to 2015. You can do this as follows:
print(paste("The year is", 2010)) "The year is 2010" print(paste("The year is", 2011)) "The year is 2011" print(paste("The year is", 2012)) "The year is 2012" print(paste("The year is", 2013)) "The year is 2013" print(paste("The year is", 2014)) "The year is 2014" print(paste("The year is", 2015)) "The year is 2015"
You immediately see this is rather tedious: you repeat the same code chunk over and over. This violates the DRY principle, known in every programming language: Don’t Repeat Yourself, at all cost. In this case, by making use of a for loop in R, you can automate the repetitive part:
for (year in c(2010,2011,2012,2013,2014,2015)){ print(paste("The year is", year)) } "The year is 2010" "The year is 2011" "The year is 2012" "The year is 2013" "The year is 2014" "The year is 2015"
The best way to understand what is going on in the for loop, is by reading it as follows: “For each year
that is in the sequence c(2010,2011,2012,2013,2014,2015)
you execute the code chunk print(paste("The year is", year))
”. Once the for loop has executed the code chunk for every year in the vector, the loop stops and goes to the first instruction after the loop block.
See how we did that? By using a for loop you only need to write down your code chunk once (instead of six times). The for loop then runs the statement once for each provided value (the different years we provided) and sets the variable (year
in this case) to that value. You can even simplify the code even more: c(2010,2011,2012,2013,2014,2015)
can also be written as 2010:2015
; this creates the exact same sequence:
for (year in 2010:2015){ print(paste("The year is", year)) } "The year is 2010" "The year is 2011" "The year is 2012" "The year is 2013" "The year is 2014" "The year is 2015"
As a last note on the for loop in R: in this case we made use of the variable year
but in fact any variable could be used here. For example you could have used i
, a commonly-used variable in for loops that stands for index:
for (i in 2010:2015){ print(paste("The year is", i)) } "The year is 2010" "The year is 2011" "The year is 2012" "The year is 2013" "The year is 2014" "The year is 2015"
This produces the exact same output. So you can really name the variable anyway you want, but it’s just more understandable if you use meaningful names.
Let’s have a look at a more mathematical example. Suppose you need to print all uneven numbers between 1 and 10 but even numbers should not be printed. In that case your loop would look like this:
for (i in 1:10) { if (!i %% 2){ next } print(i) } 1 3 5 7 9
Notice the introduction of the next statement. Let’s explore the meaning of this statement walking through this loop together:
When i
is between 1 and 10 we enter the loop and if not the loop stops. In case we enter the loop, we need to check if the value of i
is uneven. If the value of i
has a remainder of zero when divided by 2 (that’s why we use the modulus operand %%) we don’t enter the if statement, execute the print function and loop back. In case the remainder is non zero, the if statement evaluates to TRUE and we enter the conditional. Here we now see the next
statement which causes to loop back to the i in 1:10
condition thereby ignoring the the instructions that follows (so the print(i)
).
Comments
Post a Comment