Update: @hadlywickham tweeted to me to let me know that daylight savings time was the culprit. Though this explains the behavior I document in the first part of this post, the behavior of the cut function using truncated dates (discussed further down the post) is still unexplained.
I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.
latemail <- function(N, st="2013/01/01", et="2013/12/31") { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as.numeric(difftime(et,st,unit="sec")) ev <- sort(runif(N, 0, dt)) rt <- st + ev }
And generate some data…
set.seed(7110) #generate 1000 random POSIXlt dates and times bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30")) # assign factors based on the day portion of the POSIXlt object bar$dateCut <- cut(bar$date, "day", labels = FALSE)
I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.
head(bar, 10)
date dateCut 1 2013-03-01 19:10:31 1 2 2013-03-01 19:31:31 1 3 2013-03-01 19:55:02 1 4 2013-03-01 20:09:36 1 5 2013-03-01 20:13:32 1 6 2013-03-01 22:15:42 1 7 2013-03-01 22:16:06 1 8 2013-03-01 23:41:50 1 9 2013-03-02 00:30:53 2 10 2013-03-02 01:08:52 2
Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.
For additional locations where I see the expected behavior you can check
bar[ c(259, 260, 294, 295), ]
259 2013-03-08 23:22:15 8 260 2013-03-09 00:11:08 9
294 2013-03-09 23:59:11 9 295 2013-03-10 00:56:19 10
Now the weirdness.
bar[320:326, ] 320 2013-03-10 22:14:22 10 321 2013-03-10 22:28:03 10 322 2013-03-11 00:08:27 10 323 2013-03-11 00:30:08 10 324 2013-03-11 00:56:23 10 325 2013-03-11 01:19:54 11 326 2013-03-11 01:22:43 11
At row 322 the date changes from March 10 to March 11 but the dateCut factor doesn’t change until line 325. After 1:00 AM things seem to behave as expected. At first I thought maybe some sort of floor rounding was going on which was rounding midnight back to the previous day, but notice that the previous examples included times between midnight and 1:00 that were cut as expected. More weirdness examples:
bar[398:405,] 398 2013-03-12 23:56:20 12 399 2013-03-13 00:53:47 12 400 2013-03-13 01:30:33 13 401 2013-03-13 01:45:31 13
bar[430:435,] 430 2013-03-13 23:45:48 13 431 2013-03-14 00:28:40 13 432 2013-03-14 00:46:24 13 433 2013-03-14 00:55:16 13 434 2013-03-14 01:33:19 14 435 2013-03-14 02:02:45 14
I see even stranger behavior when I truncate to just the date.
bar$datetrunc=trunc(bar$date, "day")
bar$truncCut <- cut(bar$datetrunc, "day", labels = FALSE)
Again, things work fine for a while
head(bar, 10) date dateCut datetrunc truncCut 1 2013-03-01 19:10:31 1 2013-03-01 1 2 2013-03-01 19:31:31 1 2013-03-01 1 3 2013-03-01 19:55:02 1 2013-03-01 1 4 2013-03-01 20:09:36 1 2013-03-01 1 5 2013-03-01 20:13:32 1 2013-03-01 1 6 2013-03-01 22:15:42 1 2013-03-01 1 7 2013-03-01 22:16:06 1 2013-03-01 1 8 2013-03-01 23:41:50 1 2013-03-01 1 9 2013-03-02 00:30:53 2 2013-03-02 2 10 2013-03-02 01:08:52 2 2013-03-02 2
But eventually wind up worse than ever.
bar[320:330,] date dateCut datetrunc truncCut 320 2013-03-10 22:14:22 10 2013-03-10 10 321 2013-03-10 22:28:03 10 2013-03-10 10 322 2013-03-11 00:08:27 10 2013-03-11 10 323 2013-03-11 00:30:08 10 2013-03-11 10 324 2013-03-11 00:56:23 10 2013-03-11 10 325 2013-03-11 01:19:54 11 2013-03-11 10 326 2013-03-11 01:22:43 11 2013-03-11 10 327 2013-03-11 02:29:34 11 2013-03-11 10 328 2013-03-11 02:34:23 11 2013-03-11 10 329 2013-03-11 02:51:47 11 2013-03-11 10 330 2013-03-11 03:11:00 11 2013-03-11 10
The timeCut factor changes 3 rows too late but the truncCut factor stays stuck at 10 for a long time (47 rows). At row 369, the timeCut factor changes to 12 (correctly) and the truncCut factor finally turns over to 11.
bar[365:375,]
date dateCut datetrunc truncCut
365 2013-03-11 19:49:05 11 2013-03-11 10
366 2013-03-11 21:19:31 11 2013-03-11 10
367 2013-03-11 21:31:58 11 2013-03-11 10
368 2013-03-11 22:06:44 11 2013-03-11 10
369 2013-03-12 02:45:14 12 2013-03-12 11
370 2013-03-12 03:14:56 12 2013-03-12 11
371 2013-03-12 04:02:03 12 2013-03-12 11
372 2013-03-12 05:12:03 12 2013-03-12 11
373 2013-03-12 05:31:53 12 2013-03-12 11
374 2013-03-12 05:56:08 12 2013-03-12 11
375 2013-03-12 06:40:45 12 2013-03-12 11
My initial sidestep involved the rank() function (it achieved the desired result, but was S L O W). I won’t torture you with it here. I consulted with Dr. Erin Hodgess and devised this work around, which is pretty speedy.
foo <- unique(bar$datetrunc) bar$truncMatch <- match(bar$datetrunc, foo)
Here’s that strange section where the truncCut factor behaved so poorly. No problem for my new truncMatch factor.
bar[320:330,] date dateCut datetrunc truncCut truncMatch 320 2013-03-10 22:14:22 10 2013-03-10 10 10 321 2013-03-10 22:28:03 10 2013-03-10 10 10 322 2013-03-11 00:08:27 10 2013-03-11 10 11 323 2013-03-11 00:30:08 10 2013-03-11 10 11 324 2013-03-11 00:56:23 10 2013-03-11 10 11 325 2013-03-11 01:19:54 11 2013-03-11 10 11 326 2013-03-11 01:22:43 11 2013-03-11 10 11 327 2013-03-11 02:29:34 11 2013-03-11 10 11 328 2013-03-11 02:34:23 11 2013-03-11 10 11 329 2013-03-11 02:51:47 11 2013-03-11 10 11 330 2013-03-11 03:11:00 11 2013-03-11 10 11
Welcome to the world of global time. POSIXct is internally represented as GMT, but displayed as local time, and converting from Date to a POSIXt type always occurs in GMT. You should convert to character first if you plan to use Date with local time zones… actually I tend to avoid the Date type entirely if I am working with POSIXt. I also make a habit of setting TZ explicitly to the time zone (Sys.setenv(TZ=”Etc/GMT+5″)) that applies to my data rather than letting the POSIXt functions use the default time zone of the analysis computer which may only be accidentally related to the time zone of the data.
I cannot reproduce your results.
I suppose you are using and old R version.
After setting the same seed “set.seed(7110)”:
> bar[320:326, ]
date dateCut
320 2013-03-11 03:14:22 10
321 2013-03-11 03:28:03 10
322 2013-03-11 05:08:27 10
323 2013-03-11 05:30:08 10
324 2013-03-11 05:56:23 10
325 2013-03-11 06:19:54 10
326 2013-03-11 06:22:43 10
Can you try this?
> bar[cumsum(rle(bar$dateCut)$lengths), ]
date dateCut
36 2013-03-02 23:37:33 1
85 2013-03-03 22:02:47 2
120 2013-03-04 23:34:06 3
154 2013-03-05 23:38:17 4
185 2013-03-06 23:56:31 5
221 2013-03-07 23:04:29 6
255 2013-03-08 23:24:12 7
285 2013-03-09 20:42:51 8
322 2013-03-10 23:17:27 9
360 2013-03-11 23:50:14 10
406 2013-03-12 23:28:04 11
442 2013-03-13 23:59:30 12
468 2013-03-14 23:58:19 13
505 2013-03-15 23:06:13 14
538 2013-03-16 23:29:41 15
578 2013-03-17 22:14:15 16
613 2013-03-18 23:05:24 17
647 2013-03-19 23:38:36 18
676 2013-03-20 22:41:55 19
715 2013-03-21 23:07:41 20
746 2013-03-22 21:38:00 21
787 2013-03-23 23:44:49 22
821 2013-03-24 20:37:02 23
861 2013-03-25 22:45:50 24
900 2013-03-26 23:35:38 25
935 2013-03-27 23:36:08 26
964 2013-03-28 23:15:15 27
998 2013-03-29 23:36:15 28
1000 2013-03-30 00:56:12 29
> library(lubridate)
> all(mday(bar$date)==bar$dateCut+1)
[1] TRUE
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.3.3
This definitely is related to daylight saving time as Hadley says.
But for me the difference doesn’t happen on ‘cut’ or ‘mday’, just happens on ‘factor’:
> set.seed(7110)
> bar bar$dateCut
> bar$datetrunc bar$truncCut
> library(lubridate)
> bar$mdayCut all(bar$mdayCut==bar$dateCut)
[1] TRUE
> bar$factorCut bar[bar$factorCut!=bar$dateCut, ]
date dateCut datetrunc truncCut mdayCut factorCut
32 2013-03-03 00:00:03 2 2013-03-03 2 2 1
72 2013-03-04 00:33:03 3 2013-03-04 3 3 2
73 2013-03-04 00:46:06 3 2013-03-04 3 3 2
106 2013-03-05 00:02:59 4 2013-03-05 4 4 3
107 2013-03-05 00:06:21 4 2013-03-05 4 4 3
108 2013-03-05 00:52:46 4 2013-03-05 4 4 3
138 2013-03-06 00:11:27 5 2013-03-06 5 5 4
139 2013-03-06 00:33:16 5 2013-03-06 5 5 4
140 2013-03-06 00:35:49 5 2013-03-06 5 5 4
141 2013-03-06 00:48:13 5 2013-03-06 5 5 4
142 2013-03-06 00:50:46 5 2013-03-06 5 5 4
184 2013-03-07 00:29:31 6 2013-03-07 6 6 5
218 2013-03-08 00:39:27 7 2013-03-08 7 7 6
285 2013-03-10 00:08:15 9 2013-03-10 9 9 8
312 2013-03-11 00:08:20 10 2013-03-11 10 10 9
313 2013-03-11 00:57:46 10 2013-03-11 10 10 9
365 2013-03-12 00:49:05 11 2013-03-12 11 11 10
427 2013-03-14 00:04:02 13 2013-03-14 13 13 12
457 2013-03-15 00:18:23 14 2013-03-15 14 14 13
458 2013-03-15 00:36:24 14 2013-03-15 14 14 13
459 2013-03-15 00:46:44 14 2013-03-15 14 14 13
482 2013-03-16 00:23:54 15 2013-03-16 15 15 14
525 2013-03-17 00:20:19 16 2013-03-17 16 16 15
526 2013-03-17 00:25:21 16 2013-03-17 16 16 15
562 2013-03-18 00:13:14 17 2013-03-18 17 17 16
598 2013-03-19 00:17:50 18 2013-03-19 18 18 17
599 2013-03-19 00:23:36 18 2013-03-19 18 18 17
720 2013-03-22 00:56:46 21 2013-03-22 21 21 20
753 2013-03-23 00:51:46 22 2013-03-23 22 22 21
784 2013-03-24 00:15:04 23 2013-03-24 23 23 22
785 2013-03-24 00:27:31 23 2013-03-24 23 23 22
817 2013-03-25 00:14:23 24 2013-03-25 24 24 23
845 2013-03-26 00:08:52 25 2013-03-26 25 25 24
846 2013-03-26 00:16:30 25 2013-03-26 25 25 24
847 2013-03-26 00:26:28 25 2013-03-26 25 25 24
848 2013-03-26 00:28:59 25 2013-03-26 25 25 24
849 2013-03-26 00:54:22 25 2013-03-26 25 25 24
889 2013-03-27 00:01:35 26 2013-03-27 26 26 25
890 2013-03-27 00:22:36 26 2013-03-27 26 26 25
891 2013-03-27 00:34:47 26 2013-03-27 26 26 25
927 2013-03-28 00:58:03 27 2013-03-28 27 27 26
967 2013-03-29 00:33:26 28 2013-03-29 28 28 27
998 2013-03-30 00:26:13 29 2013-03-30 29 29 28
999 2013-03-30 00:56:36 29 2013-03-30 29 29 28
1000 2013-03-30 00:56:42 29 2013-03-30 29 29 28
(This is a repost. It seems that your webpage doesn’t like the “minus than” symbol and escapes the rest of the line.
I have changed the symbol in my source code to use “=” instead.)
This definitely is related to daylight saving time as Hadley says.
But for me the difference doesn’t happen on ‘cut’ or ‘mday’, just happens on ‘factor’:
> set.seed(7110)
> bar=data.frame(“date”=latemail(1000, st=”2013/03/02″, et=”2013/03/30”))
> bar$dateCut = cut(bar$date, “day”, labels = FALSE)
>
> bar$datetrunc = trunc(bar$date, “day”)
> bar$truncCut = cut(bar$datetrunc, “day”, labels=FALSE)
>
> library(lubridate)
> bar$mdayCut = mday(bar$date) -1
> all(bar$mdayCut==bar$dateCut)
[1] TRUE
> bar$factorCut = as.numeric(factor(as.Date(bar$date)))
> bar[bar$factorCut!=bar$dateCut, ]
date dateCut datetrunc truncCut mdayCut factorCut
32 2013-03-03 00:00:03 2 2013-03-03 2 2 1
72 2013-03-04 00:33:03 3 2013-03-04 3 3 2
73 2013-03-04 00:46:06 3 2013-03-04 3 3 2
106 2013-03-05 00:02:59 4 2013-03-05 4 4 3
107 2013-03-05 00:06:21 4 2013-03-05 4 4 3
108 2013-03-05 00:52:46 4 2013-03-05 4 4 3
138 2013-03-06 00:11:27 5 2013-03-06 5 5 4
139 2013-03-06 00:33:16 5 2013-03-06 5 5 4
140 2013-03-06 00:35:49 5 2013-03-06 5 5 4
141 2013-03-06 00:48:13 5 2013-03-06 5 5 4
142 2013-03-06 00:50:46 5 2013-03-06 5 5 4
184 2013-03-07 00:29:31 6 2013-03-07 6 6 5
218 2013-03-08 00:39:27 7 2013-03-08 7 7 6
285 2013-03-10 00:08:15 9 2013-03-10 9 9 8
312 2013-03-11 00:08:20 10 2013-03-11 10 10 9
313 2013-03-11 00:57:46 10 2013-03-11 10 10 9
365 2013-03-12 00:49:05 11 2013-03-12 11 11 10
427 2013-03-14 00:04:02 13 2013-03-14 13 13 12
457 2013-03-15 00:18:23 14 2013-03-15 14 14 13
458 2013-03-15 00:36:24 14 2013-03-15 14 14 13
459 2013-03-15 00:46:44 14 2013-03-15 14 14 13
482 2013-03-16 00:23:54 15 2013-03-16 15 15 14
525 2013-03-17 00:20:19 16 2013-03-17 16 16 15
526 2013-03-17 00:25:21 16 2013-03-17 16 16 15
562 2013-03-18 00:13:14 17 2013-03-18 17 17 16
598 2013-03-19 00:17:50 18 2013-03-19 18 18 17
599 2013-03-19 00:23:36 18 2013-03-19 18 18 17
720 2013-03-22 00:56:46 21 2013-03-22 21 21 20
753 2013-03-23 00:51:46 22 2013-03-23 22 22 21
784 2013-03-24 00:15:04 23 2013-03-24 23 23 22
785 2013-03-24 00:27:31 23 2013-03-24 23 23 22
817 2013-03-25 00:14:23 24 2013-03-25 24 24 23
845 2013-03-26 00:08:52 25 2013-03-26 25 25 24
846 2013-03-26 00:16:30 25 2013-03-26 25 25 24
847 2013-03-26 00:26:28 25 2013-03-26 25 25 24
848 2013-03-26 00:28:59 25 2013-03-26 25 25 24
849 2013-03-26 00:54:22 25 2013-03-26 25 25 24
889 2013-03-27 00:01:35 26 2013-03-27 26 26 25
890 2013-03-27 00:22:36 26 2013-03-27 26 26 25
891 2013-03-27 00:34:47 26 2013-03-27 26 26 25
927 2013-03-28 00:58:03 27 2013-03-28 27 27 26
967 2013-03-29 00:33:26 28 2013-03-29 28 28 27
998 2013-03-30 00:26:13 29 2013-03-30 29 29 28
999 2013-03-30 00:56:36 29 2013-03-30 29 29 28
1000 2013-03-30 00:56:42 29 2013-03-30 29 29 28
Do see the results when you turn off daylight savings, e.g., Sys.setenv(TZ=’EST’) or Sys.setenv(TZ=’UTC’)?
You should try cutting with “DSTday” instead of by “day”.