Infix operators I have known and loved

For the last month or so, I’ve been learning the R programming language. It’s been super-interesting, and quite the change from my usual stomping grounds of high-level OO languages like ruby or python.

I’m now past the point of complete beginner, and getting my teeth into some of the more advanced stuff.1 One thing I’ve already had a bunch of fun with, however, is R’s infix operator syntax.

Like a bunch of shiny hinges

I don’t know why I like infix operators. Perhaps it’s R’s lack of per-object methods, resulting in the sort of front-loading of methods that resembles a German dependent clause in reverse:

1
mangled_variable <- mutliate(fold(spindle(variable)))

In a world where functions hang heavy on the left-hand side of your line, infix operators allow you to pace things out - to make sentences out of your code. Hadley Wickam’s magrittr is the most popular (and obvious) example of that:

1
2
3
library(magrittr)

mangled_variable <- variable %>% spindle() %>% fold() %>% mutilate()

…although when overused, it starts to resemble a rambling run-on sentence.

I think that’s really what makes them shine for me: infix operators allow you to change the grammar of your code. And as long as you keep your statements simple, giving the code room to breathe (and your reader room to comprehend), they can turn a stodgy functional morass of parentheses into something a little more…well, human.

The multi-line string

Python has truly spoiled us for multi-line strings:

1
2
3
4
5
"""
Yes, this is a long string that's got more than eighty characters in it.
What of it? I can just wrap it across mulitple lines with the python
triple-quote operator.
"""

Ruby gets into the action as well, although I always feel like I’m somehow poking Matz with a stick whenever I resort to its multi-line string notation:

1
2
3
4
5
long_string = <<-EOF
Here, too, we find ourselves writing long, multi-line strings whose
contents exceed the eighty-character limit that tradition and aesthetics
have made the default for line length.
EOF

But how can we do this in R? I’ve been migrating a lot of data to R from Excel recently, and my predecessors have left informative, useful, historically-relevant comments on data columns. But how do I transcribe a two-hundred-character-long piece of prose?

1
comment(df$Column) <- "This column has a long comment on it, which will be of great use to future programmers and analysts. But adding this comment really breaks that eighty-character limit."

Sure, we can just hard-wrap, if we don’t mind having stray whitespace and newlines all through our code:

1
2
3
4
comment(df$Column) <-
  "This column has a long comment on it, which will be of great use to
  future programmers and analysts. But adding this comment really breaks
  that eighty-character limit."

Or we could edit them all out in post, I guess…

1
2
3
4
5
6
7
remove_whitespace <- function(str){ gsub("[\t\n]","",str) }

comment(df$Column) <- remove_whitespace(
  "This column has a long comment on it, which will be of great use to
  future programmers and analysts. But adding this comment really breaks
  that eighty-character limit."
)

But in my experience, multi-line strings tend to start life as single-line strings, gently growing until they’re pushing up against that 80-character margin, and forcing you to stop and fix them.

R’s base addition operator can’t deal with character strings,2 but we can always make our own…

1
2
3
4
5
6
`%+%` <- function(str1, str2){ paste(str1, str2) }

comment(df$Column) <- 
  "This column has a long comment on it, which will be of great use" %+%
  "to future programmers and analysts. But adding this comment" %+%
  "really breaks that eighty-character limit."

As a side note: since we’re using paste, R will add a space between each of our lines. Which is nice. This notations still isn’t as elegant as python or ruby, but I don’t think I’ll hit anything better until I work out how to write unary operator functions…

Case-insensitve equality

More strings. Excel doesn’t think case is hugely important: case in point, the LOOKUP family of functions will quite happily match foobar against FoOBaR or derivatives. Once I convert a (relatively wide) Excel table into an R data frame, I find myself doing a lot of calls like this:3

1
2
3
4
5
6
7
column <- excel_table$F

some_variable <-
  reference_table %>%
  subset(Category == column[5] & UseCase == column[4]) %>%
  join(date_range, by="Date", type="right") %>%
  use_series("Value")

As soon as your reference table uses a different case for its Category or Use Case qualifiers, though, you’ve got to re-tool that whole subset call:

1
2
3
4
5
6
7
some_variable <-
  reference_table %>%
  subset(
    tolower(Category) == tolower(column[5]) &
    tolower(UseCase) == tolower(column[4])
  ) %>%
  # ...

There’s two points that really jar me here: first, of course, is the fact that this statement has almost doubled in length, but second, we’ve gone from a nice simple infix operator to a couple of bulky functions. Let’s fix that by hiding our implementation away behind a new operator:

1
2
3
4
5
6
`%=i%` <- function(a, b){ tolower(a) == tolower(b) }

some_variable <-
  reference_table %>%
  subset(Category %=i% column[5] & UseCase %=i% column[4]) %>%
  # ...

As a bonus, it’s now a lot easier to switch between case-sensitive and case-insensitive comparisons.

The letter range

As mentioned, I’ve been migrating a lot of data from Excel to R recently. There’s plenty of good libraries for reading data out of Excel files, but when it comes to replicating functions, you generally have to write your own code.

When you’re spending days parsing Excel files, it helps to have your newly-minted data frame’s columns named “A” through whatever column your Excel table goes to. It wasn’t long until I threshed out a quick recursive function for turning integers into Excel-style column letters:

1
2
3
4
5
6
7
8
9
10
11
int_to_column <- function(int) {
  int[int <= 0] <- NA

  col <- ifelse(
    int <= 26,
    LETTERS[int],
    paste0(int_to_column((int-1) %/% 26), LETTERS[(int-1) %% 26 + 1])
  )

  col[is.na(col)] <- ""
}

Which, because this is R and everything is a vector, allows for some pretty cool tricks:

1
2
> int_to_column(20:30)
 [1] "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "AA" "AB" "AC" "AD"

But what if you want to select columns “CD” through “DA”, say? Sure, you can trial-and-error to work out which numbers these correspond to…or you could make your own letter-style range operator.

First, we need to reverse-engineer our engine that turns numbers into columns:

1
2
3
4
5
6
7
8
9
10
11
column_to_int <- function(col) {
  col <- toupper(col)
  col_chars <- nchar(col)

  ifelse(
    col_chars < 2,
    as.numeric(Map(function(l){which(l == LETTERS)}, col)),
    column_to_int(substring(col,1,col_chars-1))*26 +
      column_to_int(substring(col,col_chars))
  )
}

This allows us to, for example:

1
2
> column_to_int(c("A", "E", "P", "DD", "EZ"))
[1]   1   5  16 108 156

Now we can define our alphabetical range operator:

1
2
3
`%:%` <- function(a,b) {
  int_to_column(column_to_int(a):column_to_int(b))
}

And that, in turn, allows us to perform magic:

1
2
3
4
5
6
> "A" %:% "CC"
 [1] "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L"  "M"  "N"  "O"  "P"  "Q"  "R" 
[19] "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "AA" "AB" "AC" "AD" "AE" "AF" "AG" "AH" "AI" "AJ"
[37] "AK" "AL" "AM" "AN" "AO" "AP" "AQ" "AR" "AS" "AT" "AU" "AV" "AW" "AX" "AY" "AZ" "BA" "BB"
[55] "BC" "BD" "BE" "BF" "BG" "BH" "BI" "BJ" "BK" "BL" "BM" "BN" "BO" "BP" "BQ" "BR" "BS" "BT"
[73] "BU" "BV" "BW" "BX" "BY" "BZ" "CA" "CB" "CC"

  1. I’m looking at you, non-standard evaluation. 

  2. Or at least I don’t think you can - redefining base operators involves a fun trip through the guts of R’s S3 and S4 class systems, a current side-project of mine. 

  3. And, incidentally, it’s long-winded statements like this that I believe really benefit from piping. Try writing this without %>%s and maintain readability, go on.