I was recently writing a function which was going to need to deal with NAs in some kind of semi-intelligent way. I wanted to test it with some fake data, meaning that I was going to need a vector with some random NAs sprinkled in. After a few disappointing google searches and a stack overflow post or two that left something to be desired, I sat down, thought for a few minutes, and came up with this.
#create a vector of random values
foo <- rnorm(n=100, mean=20, sd=5)
#randomly choose 15 indices to replace #this is the step in which I thought I was clever #because I use which() and %in% in the same line ind <- which(foo %in% sample(foo, 15))
#now replace those indices in foo with NA foo[ind]<-NA
#here is our vector with 15 random NAs foo
Not especially game changing but more elegant than any of the solutions I found on the interwebs, so there it is FTW.
You can just do:
foo <- rnorm(n=100, mean=20, sd=5)
foo[sample(1:length(foo),15)] <- NA
You don’t need which() or even the %in%. Just use ind <- sample(seq_along(foo), 15). Or replace seq_along() with length() for that matter.
If you want to keep the %in% part, leave off the which(); you can use the resulting logical vector to index without the which() to extract the indices of the TRUE values.
foo[sample(1:length(foo),10)] = NA
foo[sample(seq(foo), 15)] <- NA
Try this:
foo <- rnorm(n=100, mean=20, sd=5)
foo[sample.int(length(foo), 15)] <- NA_real_
Good call everyone on avoiding the index vector. A post on R-bloggers will result in clean code, that’s for sure.
Here’s what I’ve been using (for data frames)…
################################################################
#
#
######### The create random missing values function.
#
# First, grab some data.
no.miss <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module4/missForest_noMiss.txt",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
summary(no.miss)
nrow(no.miss)
ncol(no.miss)
# Next, take a sample of the data.
samp <- no.miss[sample(seq(1:nrow(no.miss)), 500, replace = FALSE),]
summary(samp)
nrow(samp)
ncol(samp)
# Now, create the function.
a.prob <- .05 # <— This is the proportion of missing data.
b.prob <- 1 – a.prob
create.NA <- function(data, impute.cols = NULL){
sample(c(NA,1), 1, prob = c(a.prob,b.prob), replace = T)
}
# Apply the function (across rows and columns); while leaving out
# the row ID column (column #1).
samp.na <- is.na(apply(samp[,-1], c(1,2), create.NA))
samp.NA <- samp[,-1]
for (j in 1:ncol(samp.na)){
out <- which(samp.na[,j] == "TRUE")
samp.NA[out,j] <- NA
}; rm(a.prob, b.prob, j, out, create.NA, samp.na)
# Reassemble the data (with the row ID column).
samp.NA <- data.frame(samp[,1], samp.NA)
names(samp.NA)[1] <- "id"
summary(samp.NA)
nrow(samp.NA)
ncol(samp.NA)
# End script.
################################################################
Cheers,
Jon
Baaahhh!
The data URL was cut-off above….
Here it is…
http://www.unt.edu/rss/class/Jon/R_SC/Module4/missForest_noMiss.txt
The original code will usually work if you don’t have the same value multiple times in foo, but if you do, then you may not have 15 eliminated. For example:
foo <- c(letters, letters)
will always eliminate the letters in pairs with the original code.
You are a life-saver patrick.