- Published
- 8 August 2018
- Tagged
Sometimes R is special.
Consider the following objects[1]:
time_strings <- paste0("1997-10-05 ", 1:5, ":00")
strptime(time_strings, format = "%Y-%m-%d %H:%M")
> [1] "1997-10-05 01:00:00 NZST" "1997-10-05 02:00:00"
> [3] "1997-10-05 03:00:00 NZDT" "1997-10-05 04:00:00 NZDT"
> [5] "1997-10-05 05:00:00 NZDT"
You might notice something interesting about the second item in that list: it has no time zone. In fact, it should not exist: the 5th of October 1997 is the date on which daylight savings begins in New Zealand. The clock jumps from 1:59am to 3:00am, skipping the hour of 2:00am entirely. 2:00am on the 5th of October 1997 is a non-time.
This limbo time has some special properties. For example:
is.na(strptime(time_strings, format = "%Y-%m-%d %H:%M"))
> [1] FALSE TRUE FALSE FALSE FALSE
We know that all these times are things - so why does R think that our non-existant time is NA
? Why do we get a time for this value at all?
If we go about creating this date in other ways, we'll get an actual NA
:
as.POSIXct("1997-10-05 2:00", format="%Y-%m-%d %H:%M")
> [1] NA
This weird effect is due to the way R stores times internally, and - more importantly - due to the fact that it can store them in one of two different ways.
A tale of two date-times
R has two ways of storing dates: the POSIXct
class, and the POSIXlt
class. And if you don't care about which is which, you can spend about 98% of your programming time ignoring them:
as.POSIXlt("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")
> [1] "2018-07-01 15:00:00 NZST"
as.POSIXct("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")
> [1] "2018-07-01 15:00:00 NZST"
The difference in the two methods is mainly how they're stored. You can see this if you run unclass()
on the two objects:
lt_date <- as.POSIXlt("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")
ct_date <- as.POSIXct("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")
unclass(ct_date)
> [1] 1530414000
> attr(,"tzone")
> [1] ""
unclass(lt_date)
> $sec
> [1] 0
>
> $min
> [1] 0
>
> $hour
> [1] 15
>
> $mday
> [1] 1
>
> $mon
> [1] 6
>
> $year
> [1] 118
>
> $wday
> [1] 0
>
> $yday
> [1] 181
>
> $isdst
> [1] 0
>
> $zone
> [1] "NZST"
>
> $gmtoff
> [1] NA
The POSIXct
class is a "calendar time", and is stored as a number of seconds since the Unix Epoch (midnight GMT on Thursday January 1, 1970). The POSIXlt
class, however, is a "local time", and is stored as a list of quantities. Once we know this, the strange behaviour we saw at the start of this post starts to make sense.
R's help tells us that the strptime()
method "converts character vectors to class 'POSIXlt'". A POSIXlt
object has no issue with storing times that don't exist, like 2:00am on the 5th of October 1997. It's just a collection of numbers.
If we try to make a POSIXct
with an invalid time, though, things are trickier. POSIXct
values are stored as the number of seconds since the Unix Epoch: if you're referencing a time that literally doesn't exist, you can't actually express it.
So why does is.na()
return TRUE
for invalid POSIXlt
dates?
This mystery has a slightly obscure - but not particularly tricky - solution.
JavaScript has "truthy" and "falsey" values – that is, values, which are not true
or false
but evaluate to true
or false
within logical tests:
if(0 == false){ console.log("The number zero is falsey") }
if("z" == true){ console.log("The character 'z' is truthy") }
if([] == false){ console.log("An empty array is falsey") }
R is a lot more rigorous about what is "truthy" and what is "falsey", although it still has a few annoying edge cases:
if (0 == FALSE){ message("0 is still falsey...")}
But it also has a third boolean option: NA
. And I guess that means things can be considered "NA-y"?
Very few things are NA-y in R. Most notable, however: the value NA
and its children, along with the value NaN
, representing "not a number":
lapply(list(NA_character_, NA_complex_, NA_integer_, NA_real_, NaN), function(x) { is.na(x) })
> [[1]]
> [1] TRUE
>
> [[2]]
> [1] TRUE
>
> [[3]]
> [1] TRUE
>
> [[4]]
> [1] TRUE
>
> [[5]]
> [1] TRUE
The internals of is.na()
are, unfortunately, hidden to the average R developer:
is.na
> function (x) .Primitive("is.na")
But not so for POSIXlt
s:
is.na.POSIXlt
> function (x)
> is.na(as.POSIXct(x))
> <bytecode: 0x108c15e78>
> <environment: namespace:base>
Here we find why our timeless POSIXlt
– even though it's a valid list of a bunch of values – returns TRUE
to is.na()
. The first thing this function does is turn it into a POSIXct
, which (of course) immediately becomes NA
. Another weird mystery of R's internals solved.
So which date should I use?
The general rule of thumb, according to the internet, seems to be:
- If you need to extract bits of dates (months, days, years, what-have-you), use
POSIXlt
. - Otherwise (and if you can help it) use
POSIXct
.
By being aware of the differences, though, you're probably already better-equipped to deal with these sort of issues than 90% of R programmers.
These are created in New Zealand, where systems are set to New Zealand time. You might need to set your environment's time zone with
Sys.setenv(TZ="NZ")
if you wish to replicate these examples. ↩︎