When is an NA not an NA?

Published
8 August 2018
Tagged

Sometimes R is special.

Consider the following objects[1]:

time_strings <- paste0("1997-10-05 ", 1:5, ":00")

strptime(time_strings, format = "%Y-%m-%d %H:%M")

> [1] "1997-10-05 01:00:00 NZST" "1997-10-05 02:00:00"
> [3] "1997-10-05 03:00:00 NZDT" "1997-10-05 04:00:00 NZDT"
> [5] "1997-10-05 05:00:00 NZDT"

You might notice something interesting about the second item in that list: it has no time zone. In fact, it should not exist: the 5th of October 1997 is the date on which daylight savings begins in New Zealand. The clock jumps from 1:59am to 3:00am, skipping the hour of 2:00am entirely. 2:00am on the 5th of October 1997 is a non-time.

This limbo time has some special properties. For example:

is.na(strptime(time_strings, format = "%Y-%m-%d %H:%M"))

> [1] FALSE  TRUE FALSE FALSE FALSE

We know that all these times are things - so why does R think that our non-existant time is NA? Why do we get a time for this value at all?

If we go about creating this date in other ways, we'll get an actual NA:

as.POSIXct("1997-10-05 2:00", format="%Y-%m-%d %H:%M")

> [1] NA

This weird effect is due to the way R stores times internally, and - more importantly - due to the fact that it can store them in one of two different ways.

A tale of two date-times

R has two ways of storing dates: the POSIXct class, and the POSIXlt class. And if you don't care about which is which, you can spend about 98% of your programming time ignoring them:

as.POSIXlt("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")

> [1] "2018-07-01 15:00:00 NZST"

as.POSIXct("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")

> [1] "2018-07-01 15:00:00 NZST"

The difference in the two methods is mainly how they're stored. You can see this if you run unclass() on the two objects:

lt_date <- as.POSIXlt("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")
ct_date <- as.POSIXct("2018-07-01 15:00", format = "%Y-%m-%d %H:%M")

unclass(ct_date)

> [1] 1530414000
> attr(,"tzone")
> [1] ""

unclass(lt_date)

> $sec
> [1] 0
> 
> $min
> [1] 0
> 
> $hour
> [1] 15
> 
> $mday
> [1] 1
> 
> $mon
> [1] 6
> 
> $year
> [1] 118
> 
> $wday
> [1] 0
> 
> $yday
> [1] 181
> 
> $isdst
> [1] 0
> 
> $zone
> [1] "NZST"
> 
> $gmtoff
> [1] NA

The POSIXct class is a "calendar time", and is stored as a number of seconds since the Unix Epoch (midnight GMT on Thursday January 1, 1970). The POSIXlt class, however, is a "local time", and is stored as a list of quantities. Once we know this, the strange behaviour we saw at the start of this post starts to make sense.

R's help tells us that the strptime() method "converts character vectors to class 'POSIXlt'". A POSIXlt object has no issue with storing times that don't exist, like 2:00am on the 5th of October 1997. It's just a collection of numbers.

If we try to make a POSIXct with an invalid time, though, things are trickier. POSIXct values are stored as the number of seconds since the Unix Epoch: if you're referencing a time that literally doesn't exist, you can't actually express it.

So why does is.na() return TRUE for invalid POSIXlt dates?

This mystery has a slightly obscure - but not particularly tricky - solution.

JavaScript has "truthy" and "falsey" values – that is, values, which are not true or false but evaluate to true or false within logical tests:

if(0 == false){ console.log("The number zero is falsey") }
if("z" == true){ console.log("The character 'z' is truthy") }
if([] == false){ console.log("An empty array is falsey") }

R is a lot more rigorous about what is "truthy" and what is "falsey", although it still has a few annoying edge cases:

if (0 == FALSE){ message("0 is still falsey...")}

But it also has a third boolean option: NA. And I guess that means things can be considered "NA-y"?

Very few things are NA-y in R. Most notable, however: the value NA and its children, along with the value NaN, representing "not a number":

lapply(list(NA_character_, NA_complex_, NA_integer_, NA_real_, NaN), function(x) { is.na(x) })

> [[1]]
> [1] TRUE
>
> [[2]]
> [1] TRUE
>
> [[3]]
> [1] TRUE
>
> [[4]]
> [1] TRUE 
>
> [[5]]
> [1] TRUE

The internals of is.na() are, unfortunately, hidden to the average R developer:

is.na

> function (x)  .Primitive("is.na")

But not so for POSIXlts:

is.na.POSIXlt

> function (x) 
> is.na(as.POSIXct(x))
> <bytecode: 0x108c15e78>
> <environment: namespace:base>

Here we find why our timeless POSIXlt – even though it's a valid list of a bunch of values – returns TRUE to is.na(). The first thing this function does is turn it into a POSIXct, which (of course) immediately becomes NA. Another weird mystery of R's internals solved.

So which date should I use?

The general rule of thumb, according to the internet, seems to be:

  • If you need to extract bits of dates (months, days, years, what-have-you), use POSIXlt.
  • Otherwise (and if you can help it) use POSIXct.

By being aware of the differences, though, you're probably already better-equipped to deal with these sort of issues than 90% of R programmers.


  1. These are created in New Zealand, where systems are set to New Zealand time. You might need to set your environment's time zone with Sys.setenv(TZ="NZ") if you wish to replicate these examples. ↩︎