Tag Archives: R

Strange behavior from the cut function with dates in R

Update: @hadlywickham tweeted to me to let me know that  daylight savings time was the culprit. Though this explains the behavior I document in the first part of this post, the behavior of the cut function using truncated dates (discussed further down the post) is still unexplained.

I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.

latemail <- function(N, st="2013/01/01", et="2013/12/31") {
 st <- as.POSIXct(as.Date(st))
 et <- as.POSIXct(as.Date(et))
 dt <- as.numeric(difftime(et,st,unit="sec"))
 ev <- sort(runif(N, 0, dt))
 rt <- st + ev
 }

And generate some data…


set.seed(7110)
#generate 1000 random POSIXlt dates and times
bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30"))
# assign factors based on the day portion of the POSIXlt object
bar$dateCut <- cut(bar$date, "day", labels = FALSE)

I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.

head(bar, 10)
     date                 dateCut
1    2013-03-01 19:10:31  1
2    2013-03-01 19:31:31  1
3    2013-03-01 19:55:02  1
4    2013-03-01 20:09:36  1
5    2013-03-01 20:13:32  1
6    2013-03-01 22:15:42  1
7    2013-03-01 22:16:06  1
8    2013-03-01 23:41:50  1
9    2013-03-02 00:30:53  2
10   2013-03-02 01:08:52  2

Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.  
Continue reading Strange behavior from the cut function with dates in R

Share

Insert random NAs in a vector in R

I was recently writing a function which was going to need to deal with NAs in some kind of semi-intelligent way. I wanted to test it with some fake data, meaning that I was going to need a vector with some random NAs sprinkled in. After a few disappointing google searches and a stack overflow post or two that left something to be desired, I sat down, thought for a few minutes, and came up with this.

#create a vector of random values
 foo <- rnorm(n=100, mean=20, sd=5)
#randomly choose 15 indices to replace
#this is the step in which I thought I was clever
#because I use which() and %in% in the same line
 ind <- which(foo %in% sample(foo, 15))
#now replace those indices in foo with NA
 foo[ind]<-NA
#here is our vector with 15 random NAs 
 foo

Not especially game changing but more elegant than any of the solutions I found on the interwebs, so there it is FTW.

Share

GIS in R: Part 1

I messed around with R for years without really learning how to use it properly. I think it’s because I could always throw my hands up when the going got tough and run back and cling the skirts of Excel or JMP or Systat. I finally learned how to use R when I needed to do a fairly hefty GIS project and I didn’t have access to a computer with ArcGIS and couldn’t afford to buy it (who can?). So I started looking into R’s spatial abilities.

Admittedly R might not be the most obvious choice for free GIS options, combinations of QGIS (http://www.qgis.org/), GRASS (http://grass.osgeo.org/), PostGIS (http://postgis.refractions.net/), or OpenGeo (http://boundlessgeo.com/solutions/opengeo-suite/download/) might pop up in google searches before R. R might not even be the first general purpose programming language you think of for GIS, especially now that ArcGIS relies on Python for much of its modeling. However, all of these tools have a significant learning curve, and I was farther along in R than any of these alternatives, so I started googling and watching tutorial videos. So should you be using R for analyzing and displaying spatial data? If you already know a little or a lot of R, if you need a cross platform solution, or need to do some fairly heavy stats applications to your spatial data, R just might be a good solution for you. It turns out R has lots of support for spatial data and does a great job displaying it too.

There are a number of packages useful for analyzing and displaying your spatial data. I think the 4 most useful right out of the gate are sp, rgdal, maptools, and raster. If you haven’t installed packages before do this…

install.packages(“sp”)
install.packages(“raster”)
install.packages(“maptools”)

…and if you are on a Windows machine…

install.packages(“rgdal”)

If you’re on a Mac, installing rgdal is a little tricky. Give this a try

setRepositories(ind=1:2)
install.packages(“rgdal”)

If that doesn’t work read this over.

Installing rgdal on a Mac

After installing the packages, if you want to use the functions contained in that package you need to load the library. To use the functions in the sp package, you should type

library(sp)

to load the rgdal package…

library(rgdal)

etc.

Share

Writing a for-loop in R

freeimages.co.uk
freeimages.co.uk

There may be no R topic that is more controversial than the humble for-loop. And, to top it off, good help is hard to find. I was astounded by the lack of useful posts when I googled “for loops in R” (the top return linked to a page that did not exist). In fact, even searching for help within R is not easy and not even that helpful when successful (?for won’t get you anywhere. ?'for' will get you the help page but it is by no means exhaustive.) So, at the request of Sam, a faithful reader of the Paleocave blog, I’m going to throw my hat into the ring and brace myself for the potential onslaught of internet troll wrath.

How to loop in R

Use the for loop if you want to do the same task a specific number of times.
It looks like this.

for (counter in vector) {commands}

I’m going to set up a loop to square every element of my dataset, foo, which contains the odd integers from 1 to 100 (keep in mind that vectorizing would be faster for my trivial example – see below).


foo = seq(1, 100, by=2)

foo.squared = NULL

for (i in 1:50 ) {
foo.squared[i] = foo[i]^2
}

If the creation of a new vector is the goal, first you have to set up a vector to store things in prior to running the loop. This is the foo.squared = NULL part. This was a hard lesson for me to learn. R doesn’t like being told to operate on a vector that doesn’t exist yet. So, we set up an empty vector to add stuff to later (note that this isn’t the most speed efficient way to do this, but it’s fairly fool-proof). Next, the real for-loop begins. This code says we’ll loop 50 times(1:50). The counter we set up is ‘i’ (but you can put whatever variable name you want there). For our new vector foo.squared, the ith element will equal the number of loops that we are on (for the first loop, i=1; second loop, i=2).
Continue reading Writing a for-loop in R

Share

Science-y New Year’s Resolution: Learn to Code

Matrix-codeIn a 1995 interview Steve Jobs said he thought that computer programming should be a liberal art. In other words, he thought everyone’s education should include a year of learning a computer language, because it teaches you how to think in a certain way. If that was true in 1995, just think how much more crucial knowing how to code in some language is today. Perhaps learning a computer language should be on your to-do list; maybe a new year’s resolution?

If you want to learn a computer language a logical question would be which one to learn?

Continue reading Science-y New Year’s Resolution: Learn to Code

Share

moRe

more_more_more_main_a2Hopefully my first R post whetted your apatite for open source data software.  I’m gearing up for more R posts regardless.  I thought I’d do a quick post about a couple of useful commands, ‘View’ and ‘fix’. When you first break the shackles of Excel one of the toughest things is not being able to see your data. Try this, fire up R (go download it and install it if you haven’t already) and let’s call up a built-in dataset by typing

volcano

Continue reading moRe

Share

R you ready for this? Statistics for free!

If you’ve listened to the show for a while or if you’ve been reading the paleocave blog from the beginning (like when we actually used to update it regularly), then you might know that I’m rather fascinated with statistics. Imagine my delight a few years ago when I found out that one of the most powerful statistical tools available (the one that most of the cool kids use) was available for free! That tool is called R.  It’s a great tool but a terrible name.  R is named both for the developers Robert Gentleman and Ross Ihaka (Robert and Ross), and as a sort of pun because it was an open source rewrite of the S language. That’s cool, I guess, but R as a name is horrible search engine optimization. Oh well, keeps out the riff-raff I suppose.

The vast majority of people would call R a programming language. Real computer programmers (the kind of people that argue about Ruby vs Perl) will tell you it’s not really a ‘language,’ it’s a ‘programming environment.’ Whatever, I don’t think I really know the difference.  Don’t get intimidated, because it’s pretty easy to do as much or as little as you want in R.
Continue reading R you ready for this? Statistics for free!

Share