# Tag Archives | Statistics

## Strange behavior from the cut function with dates in R

Update: @hadlywickham tweeted to me to let me know that  daylight savings time was the culprit. Though this explains the behavior I document in the first part of this post, the behavior of the cut function using truncated dates (discussed further down the post) is still unexplained.

I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.

```latemail <- function(N, st="2013/01/01", et="2013/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}```

And generate some data…

```
set.seed(7110)
#generate 1000 random POSIXlt dates and times
bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30"))
# assign factors based on the day portion of the POSIXlt object
bar\$dateCut <- cut(bar\$date, "day", labels = FALSE)```

I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.

`head(bar, 10)`
```     date                 dateCut
1    2013-03-01 19:10:31  1
2    2013-03-01 19:31:31  1
3    2013-03-01 19:55:02  1
4    2013-03-01 20:09:36  1
5    2013-03-01 20:13:32  1
6    2013-03-01 22:15:42  1
7    2013-03-01 22:16:06  1
8    2013-03-01 23:41:50  1
9    2013-03-02 00:30:53  2
10   2013-03-02 01:08:52  2```

Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.

## Insert random NAs in a vector in R

I was recently writing a function which was going to need to deal with NAs in some kind of semi-intelligent way. I wanted to test it with some fake data, meaning that I was going to need a vector with some random NAs sprinkled in. After a few disappointing google searches and a stack overflow post or two that left something to be desired, I sat down, thought for a few minutes, and came up with this.

```#create a vector of random values
```
` foo <- rnorm(n=100, mean=20, sd=5)`
```#randomly choose 15 indices to replace
#this is the step in which I thought I was clever
#because I use which() and %in% in the same line
ind <- which(foo %in% sample(foo, 15))```
```#now replace those indices in foo with NA
foo[ind]<-NA```
```#here is our vector with 15 random NAs
foo```

Not especially game changing but more elegant than any of the solutions I found on the interwebs, so there it is FTW.

## GIS in R: Part 1

I messed around with R for years without really learning how to use it properly. I think it’s because I could always throw my hands up when the going got tough and run back and cling the skirts of Excel or JMP or Systat. I finally learned how to use R when I needed to do a fairly hefty GIS project and I didn’t have access to a computer with ArcGIS and couldn’t afford to buy it (who can?). So I started looking into R’s spatial abilities.

Admittedly R might not be the most obvious choice for free GIS options, combinations of QGIS (http://www.qgis.org/), GRASS (http://grass.osgeo.org/), PostGIS (http://postgis.refractions.net/), or OpenGeo (http://boundlessgeo.com/solutions/opengeo-suite/download/) might pop up in google searches before R. R might not even be the first general purpose programming language you think of for GIS, especially now that ArcGIS relies on Python for much of its modeling. However, all of these tools have a significant learning curve, and I was farther along in R than any of these alternatives, so I started googling and watching tutorial videos. So should you be using R for analyzing and displaying spatial data? If you already know a little or a lot of R, if you need a cross platform solution, or need to do some fairly heavy stats applications to your spatial data, R just might be a good solution for you. It turns out R has lots of support for spatial data and does a great job displaying it too.

There are a number of packages useful for analyzing and displaying your spatial data. I think the 4 most useful right out of the gate are sp, rgdal, maptools, and raster. If you haven’t installed packages before do this…

install.packages(“sp”)
install.packages(“raster”)
install.packages(“maptools”)

…and if you are on a Windows machine…

install.packages(“rgdal”)

If you’re on a Mac, installing rgdal is a little tricky. Give this a try

setRepositories(ind=1:2)
install.packages(“rgdal”)

If that doesn’t work read this over.
http://blog.fellstat.com/?p=46

After installing the packages, if you want to use the functions contained in that package you need to load the library. To use the functions in the sp package, you should type

library(sp)

library(rgdal)

etc.

## Moneyball: Sports, Markets, and Statistics

I’m not exactly early to the party on this one. Michael Lewis’ book, Moneyball, came out in 2003. I’m a bit of a Michael Lewis fan but I ignored Moneyball for years because I’m not really much of a baseball fan. The bottom line is that you don’t need to be a baseball fan to get something (I would argue a lot) out of this book. If you like baseball, you’ll probably like Moneyball, if you like math/stats/science you’ll probably like Moneyball, if you like business you’ll probably like Moneyball.

### Related Media:

If you’ve listened to the show for a while or if you’ve been reading the paleocave blog from the beginning (like when we actually used to update it regularly), then you might know that I’m rather fascinated with statistics. Imagine my delight a few years ago when I found out that one of the most powerful statistical tools available (the one that most of the cool kids use) was available for free! That tool is called R.  It’s a great tool but a terrible name.  R is named both for the developers Robert Gentleman and Ross Ihaka (Robert and Ross), and as a sort of pun because it was an open source rewrite of the S language. That’s cool, I guess, but R as a name is horrible search engine optimization. Oh well, keeps out the riff-raff I suppose.

The vast majority of people would call R a programming language. Real computer programmers (the kind of people that argue about Ruby vs Perl) will tell you it’s not really a ‘language,’ it’s a ‘programming environment.’ Whatever, I don’t think I really know the difference.  Don’t get intimidated, because it’s pretty easy to do as much or as little as you want in R.

## Post-Halloween Special! You’ll be Metrified!

Why hello there Paleo-Posse! Long-time, no-see!

It’s Jacob here! Back from an extended vacation from the blog. In case you missed it, I got married, went on a not-long-enough honeymoon, and now I’m back to the grind. Did you miss me? I missed you!

Thankfully the Caribbean is full of fruity drinks that helped me keep the urge to supply SCIENCE to the Paleo-Posse in check, at least while I was on my honeymoon.

## Design of Experiment: AT&T vs. Verizon 3G Networks

Hey there PaleoPals.  Today’s article is a bit unorthodox, as it’s actually a quick summary of an experiment that I did with a friend for one of my graduate classes recently.  If you’re interested in engineering, statistics, testing, and scientific journals, you should enjoy it.  If none of those things interest you, I promise there will be a prize at the end if you read all the way through.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In engineering, there exists a field of study known as Systems Engineering, which deals with managing extremely complicated systems (like a rocket or jet fighter) which comprise many different disciplines, and producing a product with the ease and efficiency of building a Lego set.

Inside of Systems Engineering, there is a study known as Design of Experiments (DoE), which deals with how to effectively “design an experiment” (engineers aren’t very creative, lexicologically.  Yes that’s a word.  I just made it up.)   DoE is an extremely important tool for complex projects where many thousands of things may need to be tested at once. You want to design the test such that the important results are readily apparent, so that we don’t have to dig too far through the data to give us the answer we need, and DoE allows us to do that.

So, in my DoE course for my Systems Engineering Master’s degree, I decided to design an experiment to test the 3G networks of AT&T and Verizon.  The results may surprise you…

## Randomness

You’re on a road trip with your buddies and your iPod is the one churning out the tunes, when a Captain and Tenille song comes on.  The guy riding shot gun shoots you a look and skips to the next song.  After that one finishes, there they are again, “Do that to Me One More Time” is killing the good vibes in the car. Skip.  Then, of all things, “Muskrat Love” comes on? “Dude, how much Captain and Tennille do you have on your iPod?”  Turns out you only have a greatest hits album and you have something like 8 days worth of other music on that Jobsian box.  Clearly the iPod can’t generate a random playlist worth a damn.

Noise, chance, whatever you want to call it.  Often times people misunderstand what randomness actually is.   Randomness is a sequence of things such that there is no intelligible pattern or combination.  The problem is we expect that random means that things will be equally spaced out or that there will be “no coincidences.”    In fact, one of the ways you know your are looking at randomness is that there will be repetition or related items in the list.