Writing a for-loop in R

freeimages.co.uk
freeimages.co.uk

There may be no R topic that is more controversial than the humble for-loop. And, to top it off, good help is hard to find. I was astounded by the lack of useful posts when I googled “for loops in R” (the top return linked to a page that did not exist). In fact, even searching for help within R is not easy and not even that helpful when successful (?for won’t get you anywhere. ?'for' will get you the help page but it is by no means exhaustive.) So, at the request of Sam, a faithful reader of the Paleocave blog, I’m going to throw my hat into the ring and brace myself for the potential onslaught of internet troll wrath.

How to loop in R

Use the for loop if you want to do the same task a specific number of times.
It looks like this.

for (counter in vector) {commands}

I’m going to set up a loop to square every element of my dataset, foo, which contains the odd integers from 1 to 100 (keep in mind that vectorizing would be faster for my trivial example – see below).


foo = seq(1, 100, by=2)

foo.squared = NULL

for (i in 1:50 ) {
foo.squared[i] = foo[i]^2
}

If the creation of a new vector is the goal, first you have to set up a vector to store things in prior to running the loop. This is the foo.squared = NULL part. This was a hard lesson for me to learn. R doesn’t like being told to operate on a vector that doesn’t exist yet. So, we set up an empty vector to add stuff to later (note that this isn’t the most speed efficient way to do this, but it’s fairly fool-proof). Next, the real for-loop begins. This code says we’ll loop 50 times(1:50). The counter we set up is ‘i’ (but you can put whatever variable name you want there). For our new vector foo.squared, the ith element will equal the number of loops that we are on (for the first loop, i=1; second loop, i=2).

If you are new to programming it is sometimes difficult to keep straight the difference in the number of loops you are on versus the value of the element of vector being operated on. For example when we’ve looped through the instructions 4 times, the next loop will be loop number 5 (so i=5). However the 5th element of foo will be foo[5], which is equal to 9. Therefore, foo.squared[5] should equal 81.

Silly mistakes to be made
If you are having problems with your loop, it could be one of these silly mental slips.

Did you reset your vector inside the loop? Is it possible you put a new.vector = NULL inside the loop instead of before it? Yeah, I’ve done it. About 45 minutes later I finally figured out what was wrong with my loop.

Did you forget to subscript your new vector? Possibly the inside of your loop looks like this
{ foo.squared = foo[i]^2 }.
You are missing your square brackets with a counter on the left side of the equal sign. This will result in foo.squared containing only one value – the last value calculated by the loop.

Why is this controversial?

A little background:
1) Loops are slow in R. This fact puts lots of R users on the defense from the very beginning. Users of almost any other language can just bring up looping speed when they want to get under R users’ skins. The fact is, for many people, it doesn’t matter. Computers are fast and even slow looping will likely accomplish what you need in a reasonable length of time unless you are working with a really huge dataset. And there are lots of workarounds for users of big data in R.

2) R itself is primarily written in C (or some variant like C++). When you set up a vector in R, you can easily do operations on the entire vector (this is the vectorization that gets discussed so frequently in R literature).

foo.squared = foo^2

Underneath the R code you just executed is blazingly fast C code running loops to get you the answer. The upshot here is that C is much faster than R and if you can do get what you seek in R by applying a command to a vector it’s typically a good idea to do so.

3) R is a functional language, the result of that is the flow control and programming is somewhat de-emphasized. Many R natives would prefer that you use the apply family of functions rather than writing a for-loop (often possible, but not always). Adding a layer of vitriol to this preference for the apply command is the rumor (left over from the S language from which R was derived) that apply is faster than a for-loop. This is false (at least theoretically), because inside the code for the apply command is a for-loop written in R. There are a couple of functions in the apply family which do avoid R loops and therefore probably are faster than a loop. But most apply functions are no faster than a well constructed loop (more on well constructed later). But using apply is best left for another post, we have plenty to tackle just learning how to write a half-way decent loop.

Some more advanced looping thoughts

If you are writing a for-loop inside of a larger construct, the number of times you want to loop could depend on the length of a vector which could change depending on other factors. Therefore, you can set up your counter in vector part of the loop like this

for (i in 1:length(foo)) {
#stuff to do the number of times that foo is long
}

The well constructed loop

If you are running into speed problems there are a couple of things to try (see also the R inferno).

Get as much stuff as possible out of the loop. If there are any operations that could be done to the vector prior to looping, get them outside of those curly brackets.

Avoid growing your object

In the example above we created an empty vector to store our new values in (foo.squared). That vector is empty, and every time we go through the loop we grow the vector by one. It would be faster if we could set up our vector to be the right length ahead of time and then just simply fill that vector with the correct values.
foo.squared = numeric(length=50) #generates a vector of 50 zeros; now we run the loop as before

Of course, sometimes when we write loops we don’t know how many things are going to come out the other end. Usually we can guess on an upper bound though. It’s going to be faster to partially fill a very long vector using a loop then get rid of the meaningless stuff at the end than to grow a vector one loop at a time. We can make a very large vector full of NAs and dump them at the end. Give these two loops a try and note the speed difference on your computer.


bar = seq(1,200000, by=2)
bar.squared = rep(NA, 200000)

for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}

#get rid of excess NAs
bar.squared = bar.squared[!is.na(bar.squared)]
summary(bar.squared)

Versus


bar = seq(1, 200000, by=2)
bar.squared = NULL


for (i in 1:length(bar) ) {
bar.squared[i] = bar[i]^2
}

summary(bar.squared)

Share

About Patrick

I'm usually a paleontologist or an isotope geochemist, but I like statistics, math history, markets, and soccer on less technical levels. My posts could involve any or all of the above, or anything else for that matter.

57 thoughts on “Writing a for-loop in R

  1. Thanks for writing this, I really appreciate it. You did a great job of making loops easy to understand…

    1. Yeah, that’s pretty handy isn’t it? That is the main reason I think that many people believe that the apply command is faster than a loop.

  2. I’m starting to think using R is like riding a fixie. Everyone thinks it’s cool but it’s a real bitch to ride.

  3. I wanted to concur with your assessment of the value of speed in most day-to-day statistical computing. I’m a statistician mostly working on ecological datasets and I also work with climatological and designed experiment data. It doesn’t seem like computational speed affects my ability to solve problems efficiently. The problems I work on are different enough from one another that the majority of my working time is spent in conceptualization of the problem and figuring out ways to solve it. Most computational optimization then seems like spending a dime trying to win a nickel.

    Great post!

    1. I know, right? If it turns out you are doing something that’s going to be repeated numerous times you can always port your algorithm to some other language like Python (or Java or C) if it’s really necessary for it be blazing fast.

  4. Thanks! It could be so easy to understand and to adapt if everyone would write such straight forward examples without any blabla’s and difficult words ^^

  5. Hi Patrick,

    This page has been very helpful, but I was hoping you could help me out with an R function I’m setting up – I have been at this for a few days as I’m new to R. I stumbled on this page because I am trying to figure out why only my last value is being calculated by the for loop. You explain above:

    ” You are missing your square brackets with a counter on the left side of the equal sign. This will result in foo.squared containing only one value – the last value calculated by the loop.”

    I’m not really sure what that means but I am guessing that I have to add ‘[i]’ to the left side of this..
    ‘{ foo.squared = foo[i]^2 }’
    so that I have this inside the for loop instead:
    ‘ { foo.squared[i] = foo[i]^2 }.’

    Am I reading that wrong?

    Here is the function that I’m testing out in the R Console first:
    >iddffor (i in 1:length(id)) {
    +df[i, ] <- c(id[i],0)
    +test.data<-getmonitor(id[i], "specdata")
    }

    *eventually I am going to extract a single value from each iteration of 'test.data' and insert each into the respective i-th row, column 2 of df

    *That for loop above creates df (new data frame) just as I want it, but when I type 'test.data' into the command line to see what it produces, I only get the data frame for
    'getmonitor(5,"specdata")'

    What I am EXPECTING or wanting to get from test.data is a
    list (getmonitor(1,"specdata"), getmonitor(3,"specdata"), getmonitor(5,"specdata"))
    determined by id which I tried specifying above using the i-th element of id…….
    test.data<-getmonitor(id[i],"specdata")

    **NOTES/BACKGROUND: getmonitor() is a function I have saved in the R editor with the arguments (id,directory) where id is a numeric vector of length 1 and directory is a character vector of length 1. The function opens up one SINGLE file named 'id[i]' inside the folder "specdata".

    I tried doing
    +test.data[i] <- getmonitor(id[i], "specdata")

    but it got really weird… any help would be appreciated! Sorry that this is so long, no other help pages were addressing this exact problem. Thank you in advance even if you are not able to help.

  6. Sorry, I don’t know what happened to the for loop. The beginning was cut off. This is the complete code:
    id<-c(1,3,5)
    df<-data.frame(id=rep(NA,3),nobs=rep(NA,3))
    for(i in 1:length(id)){
    df[i,]<-c(id[i],0)
    test.data<-getmonitor(id[i],"specdata")
    }

    If for some reason the code gets messed up again, I set 'id' to equal a list containing the numbers 1, 3, and 5 using the c() function. Then I create an empty data frame called 'df' with two columns labeled 'id' and 'nobs' (1st and 2nd column respectively). Then I begin the for loop iterating from 1:length(id).. which in this case is 1:3

    Thank you. Sorry for the split posts

    1. I figured out how to do it! I really just wanted to ask someone because this has been taking up so much of my time in the last few days! I’m going to post this just in case someone else has similar problems…

      Basically, I realized I hadn’t passed ‘test.data’ to anything yet, so every iteration was just erasing the previous ‘test.data’ and so only the last ‘test.data <- getmonitor(5,"specdata")' was being calculated… so I added this to the for loop, passing the iterations from 'getmonitor' to the dataframe 'df' each time.

      ……
      test.data<-getmonitor(id[i],"specdata")
      good<-complete.cases(test.data)
      nobbs<-sum(good)

      df[i,]<-c(id[i],nobbs)

      }

      return(df)
      }

      1. Iwen,
        Whoa, way to bang your head against a problem until you prevail. Glad you were able to solve your issue!

        Keep up the R!

  7. Thank you. Thank you many times over. I’m not a programmer, but nonetheless I am determined to learn R. It’s a slow process, made slower by the fact that most R-related discussions are written for/by people that have more experience than I do, so they tend to skip over things that would be obvious to a programmer, but are not obvious to me. It’s pretty hard on the self-esteem to go to the “R Programming for Dummies” page, and still get stuck. Thanks for translating the programming jargon into terms a layperson like me can understand!
    That tip about writing the an empty vector before the loop – I never would have figured that out through trial and error, and never came across it on stack overflow… although it must be there somewhere. If you ever feel inspired to write about debugging code, I’d love to hear what you have to say. I’ve reached the point where I can find WHERE the errors are (woo traceback), but I usually don’t know what they mean, or how to fix them. Keep up the great posts!
    -a wandering marine biologist

    1. ps. I was in the same class, trying to figure out the same getmonitor function. Boy do I sympathize. I must have spent a solid 12 hours trying to figure out how to write that code… but eventually I figured it out. Cheers, Iwen

  8. thanks for your job
    but I have a question about ability to cut off the loop to return to (FOR) again when a some expression is ascertained

    example
    5 for (i in 1:10){
    if i>5 go to 5
    else go on
    }

    1. Hi Tahir,
      Sounds like you want a “while” loop?
      You can set one up like this…

      foo= NULL
      i=1
      while (i<5){
      foo[i]=i
      i = i+1
      }

      foo would then equal 1 2 3 4

      Is that what you were asking?

  9. Thanks for this tutorial, it is appreciated very much
    It sure helped in figuring out how to get the results i needed.

    obz.id <- unique(data$id)
    n.obz <- NULL

    for(i in 1:length(obz.id)) {
    n.obz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i])
    }

    str(n.obz)
    ## int [1:332] 117 1041 243 474 402 228 442 192 275 148

    Then i manually made a data frame from the output of a variable and the loop

    obz.id <- unique(data$id)
    nobz.

    obz <- data.frame(idz = obz.id, nobz = n.obz)

    head(obz, 3)
    ## idz nobz
    ## 1 1 117
    ## 2 2 1041
    ## 3 3 243

    tail(obs, 3)
    ## idz nobz
    ## 330 330 447
    ## 331 331 284
    ## 332 332 16

    Question:

    How can a loop to ouput data for two or multiple columns for a data frame?
    To achieve the data frame i manually made.

    I have tried several times but i get this results

    obz2 <- data.frame()

    obz2 <- data.frame()
    for(i in 1:length(obz.id)) {
    obs2 <- c(id = id[i] <- obz.id[i],
    nobz = nobz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i]))
    }

    obs2
    ## id nobs
    ## 332 16

    Any help kindly appreciated

    1. Since I don’t have your original data frame, I can’t test it out, but I would try this…
      id <- NULL nobz<-NULL obz2 <- data.frame(id, nobz) for(i in 1:length(obz.id)) { obz2$id[i] <- obz.id[i] obz2$nobz[i] <- length(data[completeCases,]$ID[data[completeCases,]$ID == i])) }

  10. Patrick,
    Thanks for your response, it was very helpful.

    I was able to get it to work like this.

    data <- list.files("/path/to/files/data/", full.names = TRUE, pattern = ".csv")

    data.df <- NULL
    obs <- NULL
    id <- NULL
    nobs <- NULL
    id <- 1:332

    for(i in 1:length(id)) {
    data.df <- rbind(data.df, read.csv(data[i]))
    completeCases <- complete.cases(data.df$sulfate, data.df$nitrate)
    obs$id[i] <- id[i]
    obs$nobs[i] <- length(data.df[completeCases,]$ID[data.df[completeCases,]$ID == i])
    }

    obs <- as.data.frame(obs)

    return(obs)

    In the R environment these work fine

    obs[obs$id %in% 1, ]
    id nobs
    1 1 117
    obs[obs$id %in% c(2, 4,8), ]
    id nobs
    2 2 1041
    4 4 474
    8 8 192

    When i use it in a function i comment,
    #id <- NULL
    #id <- 1:332

    then run the function

    complete("data", 1)
    id nobs
    1 1 117
    complete("data", c(2, 4,8))
    id nobs
    1 2 117
    2 4 1041
    3 8 243

    The 'id" field are return corectly, but not the "nobs" field

    Any help appreciated

    1. I’m not sure of the problem here. Your code seems to work as I would expect. What is the problem with the obz that are being returned? Again, I don’t know what your actual data are, but if you ask for the subset of your data that are equal to 2 4 and 8, it doesn’t look like any of your obz are that small. If you ask for complete(“data”, 117), I would think it would return your first line just as if you ask for complete(“data”, 1). Is this what happens? What do you want to happen?

  11. Really nice article! Thanks! Practical explanations like these are very helpful to beginners like me:). Could you pls recommend other good resources for learning R?

  12. Patrick can you please answer a question that is not clear upon reading many different tutorials and help sites? (good subject to maybe expand upon your article by this one point). Other sites and tutorials are too basic and don’t cover things as in depth as you did here.

    What data type is the for loop indexing variable, most commonly defined as “i”? Is it a list, is it a vector, is it a numeric item within a vector… (my guess was the last one but the code doesn’t seem to treat it as such)?

    1. The first time through the loop i takes on the value of the first item in the vector you supply.
      e.g., for (i in 1:5){} i will be 1 the first time through. Your “vector” could be any R object (i think). You could store a vector like this myvector=c(5, 13, 25, 100, 4).
      for (i in myvector){} i will be 5 the first time through and 13 the second time through, etc. You can play around with this by just asking R to print i in the loop.
      for (i in myvector){print i} and see what values i is taking on. Does this help?

      1. It seems to display as a vector of numbers. I am new to R but I suspect that could change depending on how you use i.

        > x for (i in 1:5) {
        + x[i] print(x)
        [1] 1 2 3 4 5
        >

        1. That didn’t format so well… a vector of length one to be specific.

          > x for (i in 1:5) {
          + x[i] print(x)
          [1] 1 2 3 4 5
          >

      2. Third time is a charm…

        x <- vector()
        for (i in 1:5) {
        x[i] <- i
        print(i)
        }
        [1] 1
        [1] 2
        [1] 3
        [1] 4
        [1] 5
        print(x)
        [1] 1 2 3 4 5

        1. Sure, and i could be an item in a character vector as well.
          for (i in letters[1:5]){
          print i
          }

          will give you
          “a” “b” “c” “d” “e”

  13. #Thank you, this is truly very helpful specially for biologist like me. I have a question.. So I am using .csv files that have data.. first column is usually sample/patient name then each subsequent column is a data point that is either numerical or factor. since sometimes we have 100+ columns, it is cumbersome to write mean or table (Filename$Columnname) for each to get an output. .. or whatever function. Instead, i was trying to use the for loop function. so what i did is first create the data frame using:

    IDA=read.csv(“IDAData.csv”, header=TRUE, sep=”,”)

    #Then, i created a null vector as you indicated above

    mean.IDA=NULL

    #then i wrote for loop as such:

    for (i in colnames(IDA) {mean.IDA= mean(IDA$i)}

    #then i type mean.IDA to see an output but what i get is “NULL” . I am not sure why is this happening.

    Also, if some of the columns are numerica and other are continuous, how can i loop over only those who are factor (or numerical) to avoid the error/warning messages.

    I’d appreciate your response as i have been stuck at this for quite some time now.

  14. How can I solve these problems:
    a) For – Loop
    Assume you have estimated the following regression analysis: y = 10 + 5*X. By means of a for-loop you predict the y’s corresponding to all possible X’s (ranging from 0 – 600). Save predicted y’s in vector y.

    b) While – Loop
    Assume you have estimated the following regression analysis: y = 10 + 5*X. By means of a while-loop you predict the y’s corresponding to all possible X’s as long as the predicted value is ≤ 3000. Save the predicted y’s.

    c) Repeat – Loop
    Assume you have estimated the following regression analysis: y = 10 + 5*X. By means of a repeat-loop you predict the y’s corresponding to all possible X’s as long as the predicted value is ≤ 3000. Save the predicted y’s.

    Thanks

  15. Patrick,

    Great post ! I’ve recently learnt how to do complex for loops the hard way. I can’t understand why other teachers of this subject miss the whole point – which is that important objects written in a for loop must have different names from those used outside a for loop eg. data[[i]] instead of data1 etc. – and this is a nightmare if these new names aren’t accepted.

    So here is my method for converting long complex code into a for loop so it can be repeated:

    1. Put your datasets into lists, eg. dataset1 <- list(….), dataset2 <- list(….)

    2. Write objects which will be output from the for loop as empty lists (I haven't used vectors) eg.

    object1 <- list()
    object2 <- list()

    3. Then write the for loop eg.:

    for (i in 1:50 ) {

    4. Then immediately convert your dataset lists into temporary vectors within the for loop. The reason for this is that not all functions will accept an indexed [[i]] object (even if that indexed object IS IDENTICAL to the normal non-indexed object in terms of content !):

    data1 <- as.vector(dataset1[[i]])
    data2 <- as.vector(dataset2[[i]])

    The rest of the coding which is using data1 and data2 can then be left as it was until the next step:

    5. At the end of the for loop add [[i]] to (ie. index) the objects which are to be extracted after you close the loop.

    output1[[i]] <- ………
    output2[[i]] <- ………

    (Make sure these are listed as empty lists in 2. above, and that all mentions of these objects are indexed with [[i]])

    6. Try the for loop – probably won't work (but might) because objects to the right hand side have not been indexed with [[i]] – so now convert these in whole loop to indexed objects and make sure they are listed in 2. above.

    output1[[i]] <- object1[[i]] + object2[[i]]

    7. This way work backwards through the coding – at some point the computer will eventually realise that this is indeed a loop and that you really do want to iterate through all the iterations extracting information from each step (instead of just from the last step) !!! – that should give you a working loop with a minimum number of indexed objects.

    Hope this helps.

  16. This was really helpful, thank you!

    Anyway, I thought you should know that there is another article that pretty blatantly rips off huge parts of this article (eg. “This was a hard lesson for me to learn. R doesn’t like being told to operate on a vector that doesn’t exist yet.” from this article is rewritten as “This was a quite hard lesson that I learnt after spending about half an hour understanding my mistake. R basically does not like to operate upon a vector which does not have an existence.” )

    The plagiarising article is located here: https://blog.udemy.com/r-tutorial/

    Hope this helps!

    1. I have seen it. I was really ruffled by it initially, but as I read through it I saw that it was so bad that I really shouldn’t be too concerned.

  17. Not at all involved. A blatant attempt at a ripoff. But the post was so unreadable and unintelligible, I assume no one is actually using it.

  18. Hey there, I know you’ve mentioned the apply function in a paragraph and that “using apply is best left for another post”, but still I think it would be good if you left the code to do the same command using apply. For those who are interested, this takes the same amount of time as creating an empty vector and filling it.

    bar = seq(1,200000, by=2)
    bar.squared = sapply(bar, function(x) x^2)
    summary(bar.squared)

  19. Im new in R, and the teacher is encouraging me to write in R language, my question is how i can apply “for” loop inside another “for” loop, the equation has to sigma signs ∑∑??

  20. hi, and thanks for useful lessons
    i have a data that should even 2 columns near be 1 column for example:
    1 2 2 3 3 4 1 2
    2 3 1 2 1 3 4 1
    1 2 3 2 4 1 4 2
    these should be:
    12 23 34 12
    23 12 13 41
    12 32 41 42
    how can i do this work in R software for these 3 rows and 4 columns?

  21. hi,sir. I am new in R. I need help to create a command for “for loop” .. I have a variable that have col=16 and row=252. this var is key for some formulas, however I use the data for each 36 rows and all col [i:(i:35),16], such as [1:36,16], [2:37,16],… [217:252,16]. so, how to describe that i is var x.

    x=y-r
    x = [252,16] –>numeric
    t=matrix(colMeans(x)16,1)
    s=matrix(cov(x),16,16)
    A = t(t)%*%(s)^-1%*%t
    ….
    ….
    ….
    therefore, I want use those statement for 17 times windows and per windows have 36 rows.
    my loop(but it doesn’t work, but I write manual or w/o loop, it’s work):
    x=y-r
    x = [252,16] –>numeric
    for (i in exchange){
    t=matrix(colMeans(x[i:(i+35)],)16,1))
    s=matrix(cov(x[i:(i+35)],),16,16))
    A = t(t)%*%(s)^-1%*%t
    ….
    ….
    ….

    }

    I hope you can help me..
    thank you

  22. i have a vector x<-c(4,-2,-3,0,7,-9,10,-10,5,-4)
    i need a r code that will extract and output the negative values of x and retrace positive values of x with multiplication of values of x by 2.
    note using for loop

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.