Insert random NAs in a vector in R

I was recently writing a function which was going to need to deal with NAs in some kind of semi-intelligent way. I wanted to test it with some fake data, meaning that I was going to need a vector with some random NAs sprinkled in. After a few disappointing google searches and a stack overflow post or two that left something to be desired, I sat down, thought for a few minutes, and came up with this.

#create a vector of random values
 foo <- rnorm(n=100, mean=20, sd=5)
#randomly choose 15 indices to replace
#this is the step in which I thought I was clever
#because I use which() and %in% in the same line
 ind <- which(foo %in% sample(foo, 15))
#now replace those indices in foo with NA
 foo[ind]<-NA
#here is our vector with 15 random NAs 
 foo

Not especially game changing but more elegant than any of the solutions I found on the interwebs, so there it is FTW.

Share

About Patrick

I'm usually a paleontologist or an isotope geochemist, but I like statistics, math history, markets, and soccer on less technical levels. My posts could involve any or all of the above, or anything else for that matter.

10 thoughts on “Insert random NAs in a vector in R

  1. You don’t need which() or even the %in%. Just use ind <- sample(seq_along(foo), 15). Or replace seq_along() with length() for that matter.

    If you want to keep the %in% part, leave off the which(); you can use the resulting logical vector to index without the which() to extract the indices of the TRUE values.

  2. Good call everyone on avoiding the index vector. A post on R-bloggers will result in clean code, that’s for sure.

  3. Here’s what I’ve been using (for data frames)…
    ################################################################
    #
    #
    ######### The create random missing values function.
    #
    # First, grab some data.

    no.miss <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module4/missForest_noMiss.txt&quot;,
    header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
    summary(no.miss)
    nrow(no.miss)
    ncol(no.miss)

    # Next, take a sample of the data.

    samp <- no.miss[sample(seq(1:nrow(no.miss)), 500, replace = FALSE),]
    summary(samp)
    nrow(samp)
    ncol(samp)

    # Now, create the function.

    a.prob <- .05 # <— This is the proportion of missing data.
    b.prob <- 1 – a.prob

    create.NA <- function(data, impute.cols = NULL){
    sample(c(NA,1), 1, prob = c(a.prob,b.prob), replace = T)
    }

    # Apply the function (across rows and columns); while leaving out
    # the row ID column (column #1).

    samp.na <- is.na(apply(samp[,-1], c(1,2), create.NA))
    samp.NA <- samp[,-1]

    for (j in 1:ncol(samp.na)){
    out <- which(samp.na[,j] == "TRUE")
    samp.NA[out,j] <- NA
    }; rm(a.prob, b.prob, j, out, create.NA, samp.na)

    # Reassemble the data (with the row ID column).

    samp.NA <- data.frame(samp[,1], samp.NA)
    names(samp.NA)[1] <- "id"
    summary(samp.NA)
    nrow(samp.NA)
    ncol(samp.NA)

    # End script.
    ################################################################

    Cheers,
    Jon

  4. The original code will usually work if you don’t have the same value multiple times in foo, but if you do, then you may not have 15 eliminated. For example:

    foo <- c(letters, letters)

    will always eliminate the letters in pairs with the original code.

Leave a Reply to Bill Denney Cancel reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.