Writing a for-loop in R

freeimages.co.uk

freeimages.co.uk

There may be no R topic that is more controversial than the humble for-loop. And, to top it off, good help is hard to find. I was astounded by the lack of useful posts when I googled “for loops in R” (the top return linked to a page that did not exist). In fact, even searching for help within R is not easy and not even that helpful when successful (?for won’t get you anywhere. ?'for' will get you the help page but it is by no means exhaustive.) So, at the request of Sam, a faithful reader of the Paleocave blog, I’m going to throw my hat into the ring and brace myself for the potential onslaught of internet troll wrath.

How to loop in R

Use the for loop if you want to do the same task a specific number of times.
It looks like this.

for (counter in vector) {commands}

I’m going to set up a loop to square every element of my dataset, foo, which contains the odd integers from 1 to 100 (keep in mind that vectorizing would be faster for my trivial example – see below).


foo = seq(1, 100, by=2)

foo.squared = NULL

for (i in 1:50 ) {
foo.squared[i] = foo[i]^2
}

If the creation of a new vector is the goal, first you have to set up a vector to store things in prior to running the loop. This is the foo.squared = NULL part. This was a hard lesson for me to learn. R doesn’t like being told to operate on a vector that doesn’t exist yet. So, we set up an empty vector to add stuff to later (note that this isn’t the most speed efficient way to do this, but it’s fairly fool-proof). Next, the real for-loop begins. This code says we’ll loop 50 times(1:50). The counter we set up is ‘i’ (but you can put whatever variable name you want there). For our new vector foo.squared, the ith element will equal the number of loops that we are on (for the first loop, i=1; second loop, i=2).

If you are new to programming it is sometimes difficult to keep straight the difference in the number of loops you are on versus the value of the element of vector being operated on. For example when we’ve looped through the instructions 4 times, the next loop will be loop number 5 (so i=5). However the 5th element of foo will be foo[5], which is equal to 9. Therefore, foo.squared[5] should equal 81.

Silly mistakes to be made
If you are having problems with your loop, it could be one of these silly mental slips.

Did you reset your vector inside the loop? Is it possible you put a new.vector = NULL inside the loop instead of before it? Yeah, I’ve done it. About 45 minutes later I finally figured out what was wrong with my loop.

Did you forget to subscript your new vector? Possibly the inside of your loop looks like this
{ foo.squared = foo[i]^2 }.
You are missing your square brackets with a counter on the left side of the equal sign. This will result in foo.squared containing only one value – the last value calculated by the loop.

Why is this controversial?

A little background:
1) Loops are slow in R. This fact puts lots of R users on the defense from the very beginning. Users of almost any other language can just bring up looping speed when they want to get under R users’ skins. The fact is, for many people, it doesn’t matter. Computers are fast and even slow looping will likely accomplish what you need in a reasonable length of time unless you are working with a really huge dataset. And there are lots of workarounds for users of big data in R.

2) R itself is primarily written in C (or some variant like C++). When you set up a vector in R, you can easily do operations on the entire vector (this is the vectorization that gets discussed so frequently in R literature).

foo.squared = foo^2

Underneath the R code you just executed is blazingly fast C code running loops to get you the answer. The upshot here is that C is much faster than R and if you can do get what you seek in R by applying a command to a vector it’s typically a good idea to do so.

3) R is a functional language, the result of that is the flow control and programming is somewhat de-emphasized. Many R natives would prefer that you use the apply family of functions rather than writing a for-loop (often possible, but not always). Adding a layer of vitriol to this preference for the apply command is the rumor (left over from the S language from which R was derived) that apply is faster than a for-loop. This is false (at least theoretically), because inside the code for the apply command is a for-loop written in R. There are a couple of functions in the apply family which do avoid R loops and therefore probably are faster than a loop. But most apply functions are no faster than a well constructed loop (more on well constructed later). But using apply is best left for another post, we have plenty to tackle just learning how to write a half-way decent loop.

Some more advanced looping thoughts

If you are writing a for-loop inside of a larger construct, the number of times you want to loop could depend on the length of a vector which could change depending on other factors. Therefore, you can set up your counter in vector part of the loop like this

for (i in 1:length(foo)) {
#stuff to do the number of times that foo is long
}

The well constructed loop

If you are running into speed problems there are a couple of things to try (see also the R inferno).

Get as much stuff as possible out of the loop. If there are any operations that could be done to the vector prior to looping, get them outside of those curly brackets.

Avoid growing your object

In the example above we created an empty vector to store our new values in (foo.squared). That vector is empty, and every time we go through the loop we grow the vector by one. It would be faster if we could set up our vector to be the right length ahead of time and then just simply fill that vector with the correct values.
foo.squared = numeric(length=50) #generates a vector of 50 zeros; now we run the loop as before

Of course, sometimes when we write loops we don’t know how many things are going to come out the other end. Usually we can guess on an upper bound though. It’s going to be faster to partially fill a very long vector using a loop then get rid of the meaningless stuff at the end than to grow a vector one loop at a time. We can make a very large vector full of NAs and dump them at the end. Give these two loops a try and note the speed difference on your computer.


bar = seq(1,200000, by=2)
bar.squared = rep(NA, 200000)

for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}

#get rid of excess NAs
bar.squared = bar.squared[!is.na(bar.squared)]
summary(bar.squared)

Versus


bar = seq(1, 200000, by=2)
bar.squared = NULL


for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}

summary(bar.squared)

Share

, , , , ,

26 Responses to Writing a for-loop in R

  1. Sam HardmanNo Gravatar 23 March, 2013 at 11:30 am #

    Thanks for writing this, I really appreciate it. You did a great job of making loops easy to understand…

  2. AHJNo Gravatar 29 May, 2013 at 9:06 am #

    Thank you. Finally some help other than to use the apply family.

  3. PatrickNo Gravatar 29 May, 2013 at 6:43 pm #

    Glad to help. Now go write some awesome loops!

  4. AndrewNo Gravatar 26 July, 2013 at 2:30 pm #

    Thanks! This helped me do something I was starting to believe was impossible. Awesome.

  5. Rain MNo Gravatar 4 September, 2013 at 11:23 am #

    Thank you. The tip about NOT creating empty vectors was extremely helpful. You saved me few hours of run time.

    • PatrickNo Gravatar 4 September, 2013 at 3:48 pm #

      Yeah, that’s pretty handy isn’t it? That is the main reason I think that many people believe that the apply command is faster than a loop.

  6. MichelleNo Gravatar 19 September, 2013 at 12:28 pm #

    Finally! A sane post on for loops!

  7. SeanNo Gravatar 18 October, 2013 at 8:37 am #

    Thanks for this, nice and simple. Good tip on not growing the vector!

  8. mobbNo Gravatar 26 October, 2013 at 3:31 pm #

    I’m starting to think using R is like riding a fixie. Everyone thinks it’s cool but it’s a real bitch to ride.

  9. CharlesNo Gravatar 27 October, 2013 at 10:19 am #

    Awesome – thank you!

  10. John BurkhartNo Gravatar 31 October, 2013 at 7:47 pm #

    I wanted to concur with your assessment of the value of speed in most day-to-day statistical computing. I’m a statistician mostly working on ecological datasets and I also work with climatological and designed experiment data. It doesn’t seem like computational speed affects my ability to solve problems efficiently. The problems I work on are different enough from one another that the majority of my working time is spent in conceptualization of the problem and figuring out ways to solve it. Most computational optimization then seems like spending a dime trying to win a nickel.

    Great post!

    • PatrickNo Gravatar 1 November, 2013 at 8:46 am #

      I know, right? If it turns out you are doing something that’s going to be repeated numerous times you can always port your algorithm to some other language like Python (or Java or C) if it’s really necessary for it be blazing fast.

  11. JöriNo Gravatar 12 November, 2013 at 12:25 pm #

    Thanks! It could be so easy to understand and to adapt if everyone would write such straight forward examples without any blabla’s and difficult words ^^

  12. IwenNo Gravatar 22 January, 2014 at 12:29 am #

    Hi Patrick,

    This page has been very helpful, but I was hoping you could help me out with an R function I’m setting up – I have been at this for a few days as I’m new to R. I stumbled on this page because I am trying to figure out why only my last value is being calculated by the for loop. You explain above:

    ” You are missing your square brackets with a counter on the left side of the equal sign. This will result in foo.squared containing only one value – the last value calculated by the loop.”

    I’m not really sure what that means but I am guessing that I have to add ‘[i]‘ to the left side of this..
    ‘{ foo.squared = foo[i]^2 }’
    so that I have this inside the for loop instead:
    ‘ { foo.squared[i] = foo[i]^2 }.’

    Am I reading that wrong?

    Here is the function that I’m testing out in the R Console first:
    >iddffor (i in 1:length(id)) {
    +df[i, ] <- c(id[i],0)
    +test.data<-getmonitor(id[i], "specdata")
    }

    *eventually I am going to extract a single value from each iteration of 'test.data' and insert each into the respective i-th row, column 2 of df

    *That for loop above creates df (new data frame) just as I want it, but when I type 'test.data' into the command line to see what it produces, I only get the data frame for
    'getmonitor(5,"specdata")'

    What I am EXPECTING or wanting to get from test.data is a
    list (getmonitor(1,"specdata"), getmonitor(3,"specdata"), getmonitor(5,"specdata"))
    determined by id which I tried specifying above using the i-th element of id…….
    test.data<-getmonitor(id[i],"specdata")

    **NOTES/BACKGROUND: getmonitor() is a function I have saved in the R editor with the arguments (id,directory) where id is a numeric vector of length 1 and directory is a character vector of length 1. The function opens up one SINGLE file named 'id[i]' inside the folder "specdata".

    I tried doing
    +test.data[i] <- getmonitor(id[i], "specdata")

    but it got really weird… any help would be appreciated! Sorry that this is so long, no other help pages were addressing this exact problem. Thank you in advance even if you are not able to help.

  13. IwenNo Gravatar 22 January, 2014 at 12:35 am #

    Sorry, I don’t know what happened to the for loop. The beginning was cut off. This is the complete code:
    id<-c(1,3,5)
    df<-data.frame(id=rep(NA,3),nobs=rep(NA,3))
    for(i in 1:length(id)){
    df[i,]<-c(id[i],0)
    test.data<-getmonitor(id[i],"specdata")
    }

    If for some reason the code gets messed up again, I set 'id' to equal a list containing the numbers 1, 3, and 5 using the c() function. Then I create an empty data frame called 'df' with two columns labeled 'id' and 'nobs' (1st and 2nd column respectively). Then I begin the for loop iterating from 1:length(id).. which in this case is 1:3

    Thank you. Sorry for the split posts

    • IwenNo Gravatar 22 January, 2014 at 1:16 am #

      I figured out how to do it! I really just wanted to ask someone because this has been taking up so much of my time in the last few days! I’m going to post this just in case someone else has similar problems…

      Basically, I realized I hadn’t passed ‘test.data’ to anything yet, so every iteration was just erasing the previous ‘test.data’ and so only the last ‘test.data <- getmonitor(5,"specdata")' was being calculated… so I added this to the for loop, passing the iterations from 'getmonitor' to the dataframe 'df' each time.

      ……
      test.data<-getmonitor(id[i],"specdata")
      good<-complete.cases(test.data)
      nobbs<-sum(good)

      df[i,]<-c(id[i],nobbs)

      }

      return(df)
      }

      • PatrickNo Gravatar 22 January, 2014 at 5:42 am #

        Iwen,
        Whoa, way to bang your head against a problem until you prevail. Glad you were able to solve your issue!

        Keep up the R!

  14. LTrainNo Gravatar 10 March, 2014 at 4:33 pm #

    Thank you. Thank you many times over. I’m not a programmer, but nonetheless I am determined to learn R. It’s a slow process, made slower by the fact that most R-related discussions are written for/by people that have more experience than I do, so they tend to skip over things that would be obvious to a programmer, but are not obvious to me. It’s pretty hard on the self-esteem to go to the “R Programming for Dummies” page, and still get stuck. Thanks for translating the programming jargon into terms a layperson like me can understand!
    That tip about writing the an empty vector before the loop – I never would have figured that out through trial and error, and never came across it on stack overflow… although it must be there somewhere. If you ever feel inspired to write about debugging code, I’d love to hear what you have to say. I’ve reached the point where I can find WHERE the errors are (woo traceback), but I usually don’t know what they mean, or how to fix them. Keep up the great posts!
    -a wandering marine biologist

    • LTrainNo Gravatar 10 March, 2014 at 4:37 pm #

      ps. I was in the same class, trying to figure out the same getmonitor function. Boy do I sympathize. I must have spent a solid 12 hours trying to figure out how to write that code… but eventually I figured it out. Cheers, Iwen

  15. boralNo Gravatar 16 March, 2014 at 11:12 pm #

    Good post. Keep posting this type of posts. They are really helpful.

  16. tahirNo Gravatar 6 April, 2014 at 1:26 pm #

    thanks for your job
    but I have a question about ability to cut off the loop to return to (FOR) again when a some expression is ascertained

    example
    5 for (i in 1:10){
    if i>5 go to 5
    else go on
    }

    • PatrickNo Gravatar 6 April, 2014 at 1:56 pm #

      Hi Tahir,
      Sounds like you want a “while” loop?
      You can set one up like this…

      foo= NULL
      i=1
      while (i<5){
      foo[i]=i
      i = i+1
      }

      foo would then equal 1 2 3 4

      Is that what you were asking?

  17. AngusNo Gravatar 17 June, 2014 at 2:16 pm #

    Thanks for this tutorial, it is appreciated very much
    It sure helped in figuring out how to get the results i needed.

    obz.id <- unique(data$id)
    n.obz <- NULL

    for(i in 1:length(obz.id)) {
    n.obz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i])
    }

    str(n.obz)
    ## int [1:332] 117 1041 243 474 402 228 442 192 275 148

    Then i manually made a data frame from the output of a variable and the loop

    obz.id <- unique(data$id)
    nobz.

    obz <- data.frame(idz = obz.id, nobz = n.obz)

    head(obz, 3)
    ## idz nobz
    ## 1 1 117
    ## 2 2 1041
    ## 3 3 243

    tail(obs, 3)
    ## idz nobz
    ## 330 330 447
    ## 331 331 284
    ## 332 332 16

    Question:

    How can a loop to ouput data for two or multiple columns for a data frame?
    To achieve the data frame i manually made.

    I have tried several times but i get this results

    obz2 <- data.frame()

    obz2 <- data.frame()
    for(i in 1:length(obz.id)) {
    obs2 <- c(id = id[i] <- obz.id[i],
    nobz = nobz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i]))
    }

    obs2
    ## id nobs
    ## 332 16

    Any help kindly appreciated

    • PatrickNo Gravatar 18 June, 2014 at 6:08 am #

      Since I don’t have your original data frame, I can’t test it out, but I would try this…
      id <- NULL
      nobz<-NULL
      obz2 <- data.frame(id, nobz)
      for(i in 1:length(obz.id)) {
      obz2$id[i] <- obz.id[i]
      obz2$nobz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i]))
      }

  18. AngusNo Gravatar 19 June, 2014 at 7:23 pm #

    Patrick,
    Thanks for your response, it was very helpful.

    I was able to get it to work like this.

    data <- list.files("/path/to/files/data/", full.names = TRUE, pattern = ".csv")

    data.df <- NULL
    obs <- NULL
    id <- NULL
    nobs <- NULL
    id <- 1:332

    for(i in 1:length(id)) {
    data.df <- rbind(data.df, read.csv(data[i]))
    completeCases <- complete.cases(data.df$sulfate, data.df$nitrate)
    obs$id[i] <- id[i]
    obs$nobs[i] <- length(data.df[completeCases,]$ID[data.df[completeCases,]$ID == i])
    }

    obs <- as.data.frame(obs)

    return(obs)

    In the R environment these work fine

    obs[obs$id %in% 1, ]
    id nobs
    1 1 117
    obs[obs$id %in% c(2, 4,8), ]
    id nobs
    2 2 1041
    4 4 474
    8 8 192

    When i use it in a function i comment,
    #id <- NULL
    #id <- 1:332

    then run the function

    complete("data", 1)
    id nobs
    1 1 117
    complete("data", c(2, 4,8))
    id nobs
    1 2 117
    2 4 1041
    3 8 243

    The 'id" field are return corectly, but not the "nobs" field

    Any help appreciated

    • PatrickNo Gravatar 20 June, 2014 at 7:36 am #

      I’m not sure of the problem here. Your code seems to work as I would expect. What is the problem with the obz that are being returned? Again, I don’t know what your actual data are, but if you ask for the subset of your data that are equal to 2 4 and 8, it doesn’t look like any of your obz are that small. If you ask for complete(“data”, 117), I would think it would return your first line just as if you ask for complete(“data”, 1). Is this what happens? What do you want to happen?

Leave a Reply


*

Powered by WordPress. Designed by Woo Themes