RPN Parsing in R

We explore R’s computation-on-the language capabilities in this post on reverse polish notation.

Reverse Polish Notation

Reverse Polish Notation (RPN henceforth) is an algebraic notation that orders operands ahead of their operators. A + B becomes A B + in RPN. One advantage of polish notation is that under the assumption that operators are binary, there is no precedence ambiguity:

A + (B * C)    # Normal
A B C * +      # RPN
(A (B C *) +)  # RPN, unnecessary parens for emphasis

(A + B) * C    # Normal
A B + C *      # RPN
((A B +) C *)  # RPN, unnecessary parens for emphasis

In RPN as well as standard polish notation, operator-operand grouping is wholly determined by the order in which they appear in the expression. This makes it a easier to enter complex algebraic expressions in calculators with small displays. The additional advantage of reverse polish notation is every time we encounter an operator, we can evaluate the operation and keep only the numeric result instead of storing numbers and operators. This is probably why most Hewlett Packard calculators used RPN, including the venerable HP-12C:

HP-12C Calculator

Undoubtedly the cognitive load required to understand RPN is substantial. I remember being baffled when I got my first HP-12C, but once you get the hang of it there is no going back. I still reach for my HP-12C when I need to do some quick calculations rather than type them in finder or at the R prompt.

Computing on the Language

One of the many remarkable things about R is that you can compute on the language. “Language” in this context are unevaluated R expressions of the type produced by quote or parse1. We will use this capability to convert lists of RPN tokens into standard R expressions. But before we do so we need to cover some of the basics.

Normally, when you type an expression into the R command line, it is parsed and evaluated:

6 * 7
[1] 42

Parsing converts what would otherwise be a text string into a data structure that R can interpret as a set of commands. We can see the result of parsing by preventing evaluation with quote. This produces “call” objects, which themselves are “language”:

exp <- quote(6 * 7)
class(exp)
[1] "call"
typeof(exp)
[1] "language"

The quoted call looks just as we typed it at the terminal, but that is an artifice of how R chooses to display them. In effect, the printing of a call undoes the parsing2 and returns the call as a string. We can reveal the internal list structure3 of calls with as.list4:

str(as.list(exp))
List of 3
 $ : symbol *
 $ : num 6
 $ : num 7

The first element of a call is a function or the name of a function, and subsequent elements the arguments. In R operators are really functions disguised as operators, which explains why the * shows up as the first element of the list:

class(`*`)
[1] "function"
`*`(6, 7)
[1] 42

R could have hidden all this language business behind the scenes, but by exposing it to the user it allows us to do some rather remarkable things:

exp
6 * 7
exp[[1]] <- as.name('/')   # Yes, calls are really lists
exp
6/7

Quoted calls can be evaluated:

eval(exp)
[1] 0.857

Big deal, we could have done this with regex, right? In this specific case we could have, but generally speaking you need the context of a language object to properly manipulate it. Suppose you wanted to replace assignment = symbols with <- in the following expression:

`x=`=TRUE
if(`x=`) `x=`=c(x="x = \"x =\" y")
`x=`
              x 
"x = \"x =\" y" 

We need to deal with variable names that can contain arbitrary strings, as well as arbitrary character strings. I get a headache trying to think of the regex that would correctly identify which = symbols are for assignment, and which one are not. Yet this this type of substitution is trivial if you operate on the language object directly.

Another mechanism for creating call objects is to use call or as.call to assemble them from component pieces:

call('/', 378, 9)
378/9
as.call(list(as.name('/'), 378, 9))
378/9

call wants the function name in character format; internally it will convert it to a symbol when it assembles the call. as.call does less input processing so it requires as.name to create the symbol5.

typeof(as.name('/'))
[1] "symbol"

When symbols are evaluated, R looks for that symbol through the search path and returns the associated object. In this case it would be the division function in the base environment.

Parsing RPN

Why bother with RPN now that we have nice big displays and IDEs with auto-completing parentheses and lots of memory? Well, parsing RPN is a great example of R’s language computation capabilities, and that’s good enough for me.

First, we need to define a helper function to convert operators in character format to the their symbol equivalents6.

chr_to_name <- function(y)
  lapply(y, function(x) if(is.numeric(x)) x else as.name(x))

For a single call conversion to normal form consists just of moving the operator to the front of the call list:

tokens <- chr_to_name(list(20, 22, '+'))
str(tokens)
List of 3
 $ : num 20
 $ : num 22
 $ : symbol +
exp <- as.call(tokens[c(3, 1, 2)])
exp
20 + 22
eval(exp)
[1] 42

We can generalize the single call conversion by using a stack to build the full RPN parser:

rpn <- function(...) {
  L <- chr_to_name(list(...))
  i <- 1
  while(length(L) >= i) {
    if(is.name(L[[i]])) {
      L[[i]] <- as.call(L[i-c(0, 2, 1)])
      L[i-(1:2)] <- NULL
      i <- i-1
    } else {
      i <- i+1
    }
  }
  L[[1]]
}

L is both our input list and stack. The key expression is:

L[[i]] <- as.call(L[i-c(0, 2, 1)])

It selects three elements from our stack in relation to our counter i, orders them correctly, converts them into a call, and re-assigns them to the L[[i]] element of our stack. The rest of the function is essentially bookkeeping and cleanup. This should work with any syntactically correct list of RPN tokens:

rpn(20, 22, '+')
20 + 22
rpn(9, 3, '-', 2, 5, '+', '*')
(9 - 3) * (2 + 5)

In order to get a better sense of what is going on in rpn we modified it with explain and recorded the results for you to step through here:

    FPS:   Frame: / 48

Part of the reason it is so easy to compute on the language in R is that since calls are lists, list manipulation facilities can be used on them. For example, there is also recursive solution for parsing RPN.

One More Thing

The rpn function is a simple example of what you can do with R’s language computation facilities. A more interesting example is explain, which I wrote for this blog post. In the call:

explain(rpn)(9, 3, '-', 2, 5, '+', '*')

explain modifies rpn so that for each step in the evaluation it updates the debugger view to highlight the corresponding line and show the state of our stack L.

explain(rpn)
function (...) 
{
    {
        .res <- (L <- chr_to_name(list(...)))
        refresh_display(2)
        .res
    }
    {
        .res <- (i <- 1)
        refresh_display(3)
        .res
    }
    while ({
        .res <- (length(L) >= i)
        refresh_display(4)
        .res
    }) {
        if ({
            .res <- (is.name(L[[i]]))
            refresh_display(5)
            .res
        }) {
            {
                .res <- (L[[i]] <- as.call(L[i - c(0, 2, 1)]))
                refresh_display(6)
... Omitting 28 lines.

As you can see when comparing to the original function, each “top-level”7 call has been modified by adding a call to refresh_display. This was done with by applying enmonitor_one to each of them:

enmonitor_one
function(lang, line) {
  call(
    '{',
    call('<-', quote(.res), call("(", lang)),  # eval and temporarily store
    bquote(refresh_display(.(line))),          # update debug display
    quote(.res)                                # return temporary value
  )
}
<bytecode: 0x7fe493c2d320>

In R curly braces ({}), parentheses (()), and assignments (<-) are all calls themselves that like operators are displayed specially when deparsed. This is also true of control flow elements like if and while. This allows us to add code blocks, parenthesize expressions, and manipulate control structures.

We also need to line up each call’s position in the displayed output, and there is some work going on behind the scenes that computes the call line number and provides it to refresh_display. So in refresh_display(2), we are telling refresh_display to highlight the second line in the function source. You can see we do this in enmonitor_one with bquote, which is like quote, except that it allows evaluation of expressions enclosed in .().

If you are interested in the gory details of how explain is implemented you can look at the source. Beware though that the design philosophy of that code was expediency, not elegance. It will almost certainly break with anything other than rpn, but it should be possible to generalize it into a terminal debugger.

The call by call modification of a function is inspired by Jim Hester’s fantastic covr package, which contains a properly implemented method for tracing statements in calls. If you are looking for best practices when doing this type of ting you will have better luck looking there. vignette('how_it_works', package='covr') is a good starting point.

Conclusions

It is cute that it is easy to write an RPN parser in R. It is amazing that we can write a terminal debugger8 in R that runs in the same session it is created in. R’s willingness to freely expose its inner workings to programmers is one of its under-appreciated features.

Appendix

Recursive Solution

There is also a recursive solution with fewer lines of code:

rpn2 <- function(...) {
  rpn_rec <- function(tl, hd=list())
    if(length(tl)) {
      hd <- if(is.numeric(tl[[1]])) c(hd, tl[1])
      else c(head(hd, -2), list(as.call(c(tl[1], tail(hd, 2)))))
      Recall(tl[-1], hd)
    } else hd[[1]]
  rpn_rec(chr_to_name(list(...)))
}
rpn2(9, 3, '-', 2, 5, '+', '*')
(9 - 3) * (2 + 5)

Unfortunately this solution does not lend itself well to a step-through analysis. We will not discuss this further other than to point out that since language objects in R are structured like lists, they are amenable to list programming techniques.


  1. quote produces a single R statement or “call”, whereas parse produces a list of them, possibly of length one. The lists of calls produced by parse are called “expression” objects.

  2. Deparsing is the opposite of parsing, whereby a language object is converted back to character representation so it can be displayed. When a language object is printed, it is deparsed first so that it looks like code one might type rather than a list structure.

  3. Pair list really, and even though internally these are stored in a manner quite different to the traditional “vector” lists we’re used to in R, their semantics are very similar when accessed through R. See this SO answer for more details.

  4. We know from looking at the R internals that calls are really stored as (pair) lists. We agree that as.list(x) being “list” is not actually proof that the underlying storage of x is a list.

  5. You can make calls with as.call with string function names, and even an anonymous function, but in those cases the language object will not rendered as an operator, e.g. as.call(list('+', 1, 2)) is "+"(1, 2).

  6. You can make calls with as.call with string function names, and even an anonymous function, but in those cases the language object will not rendered as an operator, e.g. as.call(list('+', 1, 2)) is "+"(1, 2).

  7. Loosely speaking we consider top-level calls encountered first when traversing the function body, except for control loop calls such as if and while which are stepped into.

  8. Granted, we haven’t really done that here, but hopefully our hacky rpn specific debugger is proof-of-concept enough that it can be done. Additionally, I suspect (but haven’t checked), that the Rstudio debugger is implemented in R as well.

Brodie Gaslam Written by:

Brodie Gaslam is a hobbyist programmer based on the US East Coast.