Strange behavior from the cut function with dates in R

Update: @hadlywickham tweeted to me to let me know that  daylight savings time was the culprit. Though this explains the behavior I document in the first part of this post, the behavior of the cut function using truncated dates (discussed further down the post) is still unexplained.

I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.

latemail <- function(N, st="2013/01/01", et="2013/12/31") {
 st <- as.POSIXct(as.Date(st))
 et <- as.POSIXct(as.Date(et))
 dt <- as.numeric(difftime(et,st,unit="sec"))
 ev <- sort(runif(N, 0, dt))
 rt <- st + ev
 }

And generate some data…


set.seed(7110)
#generate 1000 random POSIXlt dates and times
bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30"))
# assign factors based on the day portion of the POSIXlt object
bar$dateCut <- cut(bar$date, "day", labels = FALSE)

I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.

head(bar, 10)
     date                 dateCut
1    2013-03-01 19:10:31  1
2    2013-03-01 19:31:31  1
3    2013-03-01 19:55:02  1
4    2013-03-01 20:09:36  1
5    2013-03-01 20:13:32  1
6    2013-03-01 22:15:42  1
7    2013-03-01 22:16:06  1
8    2013-03-01 23:41:50  1
9    2013-03-02 00:30:53  2
10   2013-03-02 01:08:52  2

Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.  

For additional locations where I see the expected behavior you can check

bar[ c(259, 260, 294, 295), ]
259  2013-03-08 23:22:15  8
260  2013-03-09 00:11:08  9
294  2013-03-09 23:59:11  9
295  2013-03-10 00:56:19  10

Now the weirdness.

bar[320:326, ]
320  2013-03-10 22:14:22  10
321  2013-03-10 22:28:03  10
322  2013-03-11 00:08:27  10
323  2013-03-11 00:30:08  10
324  2013-03-11 00:56:23  10
325  2013-03-11 01:19:54  11
326  2013-03-11 01:22:43  11

At row 322 the date changes from March 10 to March 11 but the dateCut factor doesn’t change until line 325. After 1:00 AM things seem to behave as expected. At first I thought maybe some sort of floor rounding was going on which was rounding midnight back to the previous day, but notice that the previous examples included times between midnight and 1:00 that were cut as expected. More weirdness examples:

bar[398:405,]
398  2013-03-12 23:56:20  12
399  2013-03-13 00:53:47  12
400  2013-03-13 01:30:33  13
401  2013-03-13 01:45:31  13
bar[430:435,]
430  2013-03-13 23:45:48  13
431  2013-03-14 00:28:40  13
432  2013-03-14 00:46:24  13
433  2013-03-14 00:55:16  13
434  2013-03-14 01:33:19  14
435  2013-03-14 02:02:45  14

I see even stranger behavior when I truncate to just the date.

bar$datetrunc=trunc(bar$date, "day")  
bar$truncCut <- cut(bar$datetrunc, "day", labels = FALSE) 

Again, things work fine for a while

head(bar, 10)
   date             dateCut datetrunc truncCut
1  2013-03-01 19:10:31 1   2013-03-01  1
2  2013-03-01 19:31:31 1   2013-03-01  1
3  2013-03-01 19:55:02 1   2013-03-01  1
4  2013-03-01 20:09:36 1   2013-03-01  1
5  2013-03-01 20:13:32 1   2013-03-01  1
6  2013-03-01 22:15:42 1   2013-03-01  1
7  2013-03-01 22:16:06 1   2013-03-01  1
8  2013-03-01 23:41:50 1   2013-03-01  1
9  2013-03-02 00:30:53 2   2013-03-02  2
10 2013-03-02 01:08:52 2   2013-03-02  2

But eventually wind up worse than ever.

bar[320:330,]
    date               dateCut datetrunc truncCut
320 2013-03-10 22:14:22  10  2013-03-10  10
321 2013-03-10 22:28:03  10  2013-03-10  10
322 2013-03-11 00:08:27  10  2013-03-11  10
323 2013-03-11 00:30:08  10  2013-03-11  10
324 2013-03-11 00:56:23  10  2013-03-11  10
325 2013-03-11 01:19:54  11  2013-03-11  10
326 2013-03-11 01:22:43  11  2013-03-11  10
327 2013-03-11 02:29:34  11  2013-03-11  10
328 2013-03-11 02:34:23  11  2013-03-11  10
329 2013-03-11 02:51:47  11  2013-03-11  10
330 2013-03-11 03:11:00  11  2013-03-11  10

The timeCut factor changes 3 rows too late but the truncCut factor stays stuck at 10 for a long time (47 rows). At row 369, the timeCut factor changes to 12 (correctly) and the truncCut factor finally turns over to 11.

bar[365:375,]
    date              dateCut datetrunc truncCut
365 2013-03-11 19:49:05  11  2013-03-11  10
366 2013-03-11 21:19:31  11  2013-03-11  10
367 2013-03-11 21:31:58  11  2013-03-11  10
368 2013-03-11 22:06:44  11  2013-03-11  10
369 2013-03-12 02:45:14  12  2013-03-12  11
370 2013-03-12 03:14:56  12  2013-03-12  11
371 2013-03-12 04:02:03  12  2013-03-12  11
372 2013-03-12 05:12:03  12  2013-03-12  11
373 2013-03-12 05:31:53  12  2013-03-12  11
374 2013-03-12 05:56:08  12  2013-03-12  11
375 2013-03-12 06:40:45  12  2013-03-12  11

My initial sidestep involved the rank() function (it achieved the desired result, but was S L O W). I won’t torture you with it here. I consulted with Dr. Erin Hodgess and devised this work around, which is pretty speedy.

foo <- unique(bar$datetrunc)
bar$truncMatch <- match(bar$datetrunc, foo)

Here’s that strange section where the truncCut factor behaved so poorly. No problem for my new truncMatch factor.

 

bar[320:330,]
    date            dateCut datetrunc truncCut truncMatch
320 2013-03-10 22:14:22  10  2013-03-10  10   10
321 2013-03-10 22:28:03  10  2013-03-10  10   10
322 2013-03-11 00:08:27  10  2013-03-11  10   11
323 2013-03-11 00:30:08  10  2013-03-11  10   11
324 2013-03-11 00:56:23  10  2013-03-11  10   11
325 2013-03-11 01:19:54  11  2013-03-11  10   11
326 2013-03-11 01:22:43  11  2013-03-11  10   11
327 2013-03-11 02:29:34  11  2013-03-11  10   11
328 2013-03-11 02:34:23  11  2013-03-11  10   11
329 2013-03-11 02:51:47  11  2013-03-11  10   11
330 2013-03-11 03:11:00  11  2013-03-11  10   11


 

Share

About Patrick

I'm usually a paleontologist or an isotope geochemist, but I like statistics, math history, markets, and soccer on less technical levels. My posts could involve any or all of the above, or anything else for that matter.

6 thoughts on “Strange behavior from the cut function with dates in R

  1. Welcome to the world of global time. POSIXct is internally represented as GMT, but displayed as local time, and converting from Date to a POSIXt type always occurs in GMT. You should convert to character first if you plan to use Date with local time zones… actually I tend to avoid the Date type entirely if I am working with POSIXt. I also make a habit of setting TZ explicitly to the time zone (Sys.setenv(TZ=”Etc/GMT+5″)) that applies to my data rather than letting the POSIXt functions use the default time zone of the analysis computer which may only be accidentally related to the time zone of the data.

  2. I cannot reproduce your results.
    I suppose you are using and old R version.

    After setting the same seed “set.seed(7110)”:
    > bar[320:326, ]
    date dateCut
    320 2013-03-11 03:14:22 10
    321 2013-03-11 03:28:03 10
    322 2013-03-11 05:08:27 10
    323 2013-03-11 05:30:08 10
    324 2013-03-11 05:56:23 10
    325 2013-03-11 06:19:54 10
    326 2013-03-11 06:22:43 10

    Can you try this?

    > bar[cumsum(rle(bar$dateCut)$lengths), ]
    date dateCut
    36 2013-03-02 23:37:33 1
    85 2013-03-03 22:02:47 2
    120 2013-03-04 23:34:06 3
    154 2013-03-05 23:38:17 4
    185 2013-03-06 23:56:31 5
    221 2013-03-07 23:04:29 6
    255 2013-03-08 23:24:12 7
    285 2013-03-09 20:42:51 8
    322 2013-03-10 23:17:27 9
    360 2013-03-11 23:50:14 10
    406 2013-03-12 23:28:04 11
    442 2013-03-13 23:59:30 12
    468 2013-03-14 23:58:19 13
    505 2013-03-15 23:06:13 14
    538 2013-03-16 23:29:41 15
    578 2013-03-17 22:14:15 16
    613 2013-03-18 23:05:24 17
    647 2013-03-19 23:38:36 18
    676 2013-03-20 22:41:55 19
    715 2013-03-21 23:07:41 20
    746 2013-03-22 21:38:00 21
    787 2013-03-23 23:44:49 22
    821 2013-03-24 20:37:02 23
    861 2013-03-25 22:45:50 24
    900 2013-03-26 23:35:38 25
    935 2013-03-27 23:36:08 26
    964 2013-03-28 23:15:15 27
    998 2013-03-29 23:36:15 28
    1000 2013-03-30 00:56:12 29
    > library(lubridate)
    > all(mday(bar$date)==bar$dateCut+1)
    [1] TRUE
    > sessionInfo()
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
    [5] LC_TIME=Spanish_Spain.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] lubridate_1.3.3

  3. This definitely is related to daylight saving time as Hadley says.
    But for me the difference doesn’t happen on ‘cut’ or ‘mday’, just happens on ‘factor’:

    > set.seed(7110)
    > bar bar$dateCut
    > bar$datetrunc bar$truncCut
    > library(lubridate)
    > bar$mdayCut all(bar$mdayCut==bar$dateCut)
    [1] TRUE
    > bar$factorCut bar[bar$factorCut!=bar$dateCut, ]
    date dateCut datetrunc truncCut mdayCut factorCut
    32 2013-03-03 00:00:03 2 2013-03-03 2 2 1
    72 2013-03-04 00:33:03 3 2013-03-04 3 3 2
    73 2013-03-04 00:46:06 3 2013-03-04 3 3 2
    106 2013-03-05 00:02:59 4 2013-03-05 4 4 3
    107 2013-03-05 00:06:21 4 2013-03-05 4 4 3
    108 2013-03-05 00:52:46 4 2013-03-05 4 4 3
    138 2013-03-06 00:11:27 5 2013-03-06 5 5 4
    139 2013-03-06 00:33:16 5 2013-03-06 5 5 4
    140 2013-03-06 00:35:49 5 2013-03-06 5 5 4
    141 2013-03-06 00:48:13 5 2013-03-06 5 5 4
    142 2013-03-06 00:50:46 5 2013-03-06 5 5 4
    184 2013-03-07 00:29:31 6 2013-03-07 6 6 5
    218 2013-03-08 00:39:27 7 2013-03-08 7 7 6
    285 2013-03-10 00:08:15 9 2013-03-10 9 9 8
    312 2013-03-11 00:08:20 10 2013-03-11 10 10 9
    313 2013-03-11 00:57:46 10 2013-03-11 10 10 9
    365 2013-03-12 00:49:05 11 2013-03-12 11 11 10
    427 2013-03-14 00:04:02 13 2013-03-14 13 13 12
    457 2013-03-15 00:18:23 14 2013-03-15 14 14 13
    458 2013-03-15 00:36:24 14 2013-03-15 14 14 13
    459 2013-03-15 00:46:44 14 2013-03-15 14 14 13
    482 2013-03-16 00:23:54 15 2013-03-16 15 15 14
    525 2013-03-17 00:20:19 16 2013-03-17 16 16 15
    526 2013-03-17 00:25:21 16 2013-03-17 16 16 15
    562 2013-03-18 00:13:14 17 2013-03-18 17 17 16
    598 2013-03-19 00:17:50 18 2013-03-19 18 18 17
    599 2013-03-19 00:23:36 18 2013-03-19 18 18 17
    720 2013-03-22 00:56:46 21 2013-03-22 21 21 20
    753 2013-03-23 00:51:46 22 2013-03-23 22 22 21
    784 2013-03-24 00:15:04 23 2013-03-24 23 23 22
    785 2013-03-24 00:27:31 23 2013-03-24 23 23 22
    817 2013-03-25 00:14:23 24 2013-03-25 24 24 23
    845 2013-03-26 00:08:52 25 2013-03-26 25 25 24
    846 2013-03-26 00:16:30 25 2013-03-26 25 25 24
    847 2013-03-26 00:26:28 25 2013-03-26 25 25 24
    848 2013-03-26 00:28:59 25 2013-03-26 25 25 24
    849 2013-03-26 00:54:22 25 2013-03-26 25 25 24
    889 2013-03-27 00:01:35 26 2013-03-27 26 26 25
    890 2013-03-27 00:22:36 26 2013-03-27 26 26 25
    891 2013-03-27 00:34:47 26 2013-03-27 26 26 25
    927 2013-03-28 00:58:03 27 2013-03-28 27 27 26
    967 2013-03-29 00:33:26 28 2013-03-29 28 28 27
    998 2013-03-30 00:26:13 29 2013-03-30 29 29 28
    999 2013-03-30 00:56:36 29 2013-03-30 29 29 28
    1000 2013-03-30 00:56:42 29 2013-03-30 29 29 28

  4. (This is a repost. It seems that your webpage doesn’t like the “minus than” symbol and escapes the rest of the line.
    I have changed the symbol in my source code to use “=” instead.)

    This definitely is related to daylight saving time as Hadley says.
    But for me the difference doesn’t happen on ‘cut’ or ‘mday’, just happens on ‘factor’:

    > set.seed(7110)
    > bar=data.frame(“date”=latemail(1000, st=”2013/03/02″, et=”2013/03/30”))
    > bar$dateCut = cut(bar$date, “day”, labels = FALSE)
    >
    > bar$datetrunc = trunc(bar$date, “day”)
    > bar$truncCut = cut(bar$datetrunc, “day”, labels=FALSE)
    >
    > library(lubridate)
    > bar$mdayCut = mday(bar$date) -1
    > all(bar$mdayCut==bar$dateCut)
    [1] TRUE
    > bar$factorCut = as.numeric(factor(as.Date(bar$date)))
    > bar[bar$factorCut!=bar$dateCut, ]
    date dateCut datetrunc truncCut mdayCut factorCut
    32 2013-03-03 00:00:03 2 2013-03-03 2 2 1
    72 2013-03-04 00:33:03 3 2013-03-04 3 3 2
    73 2013-03-04 00:46:06 3 2013-03-04 3 3 2
    106 2013-03-05 00:02:59 4 2013-03-05 4 4 3
    107 2013-03-05 00:06:21 4 2013-03-05 4 4 3
    108 2013-03-05 00:52:46 4 2013-03-05 4 4 3
    138 2013-03-06 00:11:27 5 2013-03-06 5 5 4
    139 2013-03-06 00:33:16 5 2013-03-06 5 5 4
    140 2013-03-06 00:35:49 5 2013-03-06 5 5 4
    141 2013-03-06 00:48:13 5 2013-03-06 5 5 4
    142 2013-03-06 00:50:46 5 2013-03-06 5 5 4
    184 2013-03-07 00:29:31 6 2013-03-07 6 6 5
    218 2013-03-08 00:39:27 7 2013-03-08 7 7 6
    285 2013-03-10 00:08:15 9 2013-03-10 9 9 8
    312 2013-03-11 00:08:20 10 2013-03-11 10 10 9
    313 2013-03-11 00:57:46 10 2013-03-11 10 10 9
    365 2013-03-12 00:49:05 11 2013-03-12 11 11 10
    427 2013-03-14 00:04:02 13 2013-03-14 13 13 12
    457 2013-03-15 00:18:23 14 2013-03-15 14 14 13
    458 2013-03-15 00:36:24 14 2013-03-15 14 14 13
    459 2013-03-15 00:46:44 14 2013-03-15 14 14 13
    482 2013-03-16 00:23:54 15 2013-03-16 15 15 14
    525 2013-03-17 00:20:19 16 2013-03-17 16 16 15
    526 2013-03-17 00:25:21 16 2013-03-17 16 16 15
    562 2013-03-18 00:13:14 17 2013-03-18 17 17 16
    598 2013-03-19 00:17:50 18 2013-03-19 18 18 17
    599 2013-03-19 00:23:36 18 2013-03-19 18 18 17
    720 2013-03-22 00:56:46 21 2013-03-22 21 21 20
    753 2013-03-23 00:51:46 22 2013-03-23 22 22 21
    784 2013-03-24 00:15:04 23 2013-03-24 23 23 22
    785 2013-03-24 00:27:31 23 2013-03-24 23 23 22
    817 2013-03-25 00:14:23 24 2013-03-25 24 24 23
    845 2013-03-26 00:08:52 25 2013-03-26 25 25 24
    846 2013-03-26 00:16:30 25 2013-03-26 25 25 24
    847 2013-03-26 00:26:28 25 2013-03-26 25 25 24
    848 2013-03-26 00:28:59 25 2013-03-26 25 25 24
    849 2013-03-26 00:54:22 25 2013-03-26 25 25 24
    889 2013-03-27 00:01:35 26 2013-03-27 26 26 25
    890 2013-03-27 00:22:36 26 2013-03-27 26 26 25
    891 2013-03-27 00:34:47 26 2013-03-27 26 26 25
    927 2013-03-28 00:58:03 27 2013-03-28 27 27 26
    967 2013-03-29 00:33:26 28 2013-03-29 28 28 27
    998 2013-03-30 00:26:13 29 2013-03-30 29 29 28
    999 2013-03-30 00:56:36 29 2013-03-30 29 29 28
    1000 2013-03-30 00:56:42 29 2013-03-30 29 29 28

  5. Do see the results when you turn off daylight savings, e.g., Sys.setenv(TZ=’EST’) or Sys.setenv(TZ=’UTC’)?

Leave a Reply to Marcos Cancel reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.