Documenting S3 Methods with Roxygen2

This post is primarily intended for personal documentation. I have been incredibly confused by how to go about documenting S3 Methods in R, and things have gotten worse with the transition to Roxygen2 4.0 (or perhaps 3.0, whenever @export gained auto-S3 detection). This is what I’ve figured out so far:

@export will write export namespace directives for normal functions and S3method namespace directives for functions it detects as S3 methods. Not entirely sure if the S3 detection is just looking for a period, or something else. Corollary: @export will NOT write an export directive for S3methods!

@S3method will write the S3method namespace directive, but is deprecated as of Roxygen2 4.0.

@method produces the documentation entry corresponding to a method, e.g.:

but this will only happen if there is text in addition to @method in the roxygen block, or if you assign the @method block to a different Rd file with @rdname.

data.table vs. dplyr in Split Apply Combine Style Analysis


In this post I will compare the use and performance of dplyr and data.table for the purposes of “split apply combine” style analysis, with some comparisons to base R methods.

Skip to the bottom of the post if you’re just interested in the benchmarks.

Both packages offer similar functionality for “split apply combine style” analysis. Both packages also offer additional functionality (e.g. indexed merges for data.table, SQL data base interface for dplyr), but I will focus only on split apply combine analysis in this post.

Performance is comparable across packages, though data.table pulls ahead when there is a large number of groups in the data, particularly when using aggregating computations (e.g. one row per group) with low overhead functions (e.g. mean). If the computations you are using are slow, there will be little difference between the packages, mostly because the bulk of execution time will be the computation, not the manipulation to split / re-group the data.

NOTE: R 3.1 may well affect the results of these tests, presumably to the benefit of dplyr. I will try to re-run them on that version in the not too distant future.

Split Apply Combine Analysis

Data often contains sub-groups that are distinguishable based on one or more (usually) categorical variables. For example, the iris R built-in data set has a Species variable that allows you to separate the data into groups by species. A common analysis for this type of data with groups is to run a computation on each group. This type of analysis is known as “Split-Apply-Combine” due to a common pattern in R code that involves splitting the data set into the groups of interest, applying a function to each group, and recombining the summarized pieces into a new data set. A simple example with the iris data set:

values ind
1 5.01 setosa
2 5.94 versicolor
3 6.59 virginica

Implementations in Base R

Base R provides some functions that facilitate this type of analysis:

setosa versicolor virginica
5.01 5.94 6.59

Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.326
3 virginica 6.59 2.97 5.55 2.026

I will not get into much detail about what is going on here other than two highlight some important limitations of the built in approaches:

  • tapply only summarizes one vector at a time, and grouping by multiple variables produces a multi-dimensional array rather than a data frame as is often desired
  • aggregate applies the same function to every column
  • Both tapply and aggregate are simplest to use when the user function returns one value per group; both will still work if the function return multiple values, but additional manipulation is often required to get the desired result

Third party packages


Very popular 3rd party Split Apply Combine packages. Unfortunately due to R inefficiencies with data frames it performs slowly with large data sets with many groups. As a result, we will not review plyr here.


An optimized version of plyr targeted more specifically to data frame like structures. In addition to being faster than plyr, dplyr introduces a new data manipulation grammar that can be used consistently across a wide variety of data frame like objects (e.g. data base tables, data.tables).


data.table extends data frames into indexed table objects that can perform highly optimized Split Apply Combine (stricly speaking there is no actual splitting for efficiency reasons, but the calculation result is the same) as well as indexed merges. Disclosure: I am a long time data.table user so I naturally tend to be biased towards it, but I have run the tests in this posts as objectively as possible except for those items that are a matter of personal preference.

Syntax and Grammar

Both plyr and dplyr can operate directly on a data.frame. For use with data.table we must first convert the data.frame to data.table. For illustration purposes, we will use:

Here we will quickly review the basic syntax for common computations with the iris data set.


Both dplyr and data.table interpret variable names in the context of the data, much like subset.

Modify Column

One major difference between the two packages is that data.table can modify objects by reference. This is against the general R philosophy of avoiding side effects, but does have the advantage of being faster as it skips a memory re-allocation. For example, in this case we modified the iris.dt object.


The philosophical differences between the two packages become more apparent with this task:
dplyr appears to favor a grammar that conveys the meaning of the task in something resembling natural language, while data.table is looking for compact expressions that achieve the analytical objective.
Now, let’s compute on groups without aggregating, and then filter the results:
Here you can see that both dplyr and data.table support chaining, but in somewhat different ways. dplyr can keep chaining with %.%, and data.table can chain [.data.table. The main difference is that dplyr chains for every operation, whereas [.data.table only needs to chain if you need to compute on the result of the by operation.

Indirect Variable Specification

Both dplyr and data.table are designed to primarily work by users specifying the variable names they want to compute on. Sometimes it is desirable to set-up computations that will operate without direct knowledge of the variable names. In this example, we attempt to group by a column specified in a variable and compute the mean of all other columns:
So this can be done, but it takes a bit of effort and quickly gets complicated if you are trying to do more interesting computations. Explaining what’s going on here is a topic for another post. Note that the dplyr example doesn’t work with the current version (see discussion on SO).

Parting Thoughts on Syntax / Grammar

As noted earlier data.table favors succint syntax whereas dplyr favors a grammar that more closely follows common language constructs. Which approach is better is ultimately a matter of personal preference. Interestingly both dplyr and data.table depart from the base R paradigms in their own ways. data.table‘s syntax is much closer to base R functions, but it gleefully employs side effects to achieve it’s efficiency objectives. dplyr‘s grammar is completely different to base R, but it does adhere to the no side effects philosophy.


As of this writing, the only noteworthy difference in the context of split apply combine analysis I’ve noticed (outside of the summarise_each issue noted earlier) is that dplyr does not allow arbitrary sized group results. The results must either be 1 row per group when using summarise or the same number of rows as the original group when using mutate, and the number of columns must be explicitly specified.

data.table allows arbitrary numbers of rows and columns (the latter provided each group has the same number of columns). dplyr will potentially add this feature in the future as documented on github.



We will test how the following factors affect performance:

Dimension Values Tested
Number of rows in data.frame 10K, 100K, 1MM, 10MM (also tried 500K, 5MM, and 20MM)
Number of columns in data.frame 1 or 3 numeric data columns + grouping column
Groups 10, 100, 1000, …, 1MM with min group size == 10 rows
Group size constancy tested with group sizes exactly as above as well as group sizes on average as above but for any given group random normally distributed with SD == 0.2 mean group size
Operation Type Aggregating (using mean), Non-aggregating (using rev)


The data set was structured to contain on single factor grouping column under the assumption that grouping by multiple columns is unlikely to be a big driver of performance. The data columns are all numeric. I did not test other data types mostly because numeric columns are the common use case and I was running into waay too many permutations already. Here is an example data frame:

‘data.frame’: 10000 obs. of 4 variables:
$ G.10: Factor w/ 1000 levels “1”,”2″,”3″,”4″,..: 690 188 414 595 665 933 405 851 516 439 …
$ V.1 : num 0.736 0.702 0.691 0.377 0.161 …
$ V.2 : num 0.0112 0.0763 0.175 0.3586 0.2254 …
$ V.3 : num 0.516 0.268 0.484 0.822 0.989 …

The G.10 column is the one we will group by. The V.# columns are random uniform in c(0, 1). Data frames will change in size/dimensions as described in the “Dimensions” section above, but will otherwise look pretty much like this one. Here we show that in this example the groups are on average of size 10 rows, but do vary in size:

[1] 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2 9 29 26 56 53 78 99 118 117 85 81 73 58 32 25 25 14 11 5 2 1 1

It turns out that having every group the same size or having them varying in size as shown above has very little impact on performance, so I’m only showing results for the unequal size tests.


I pre-generated all the combinations of data as described above, and then ran five iterations of each test for both dplyr and data.table, discarding the slowest of the five tests. Tests were timed with system.time and the values I report are the “elapsed” times.

The computations chosen (mean and rev) are purposefully simple low overhead functions to ensure that the bulk of the execution is related to the splitting and grouping of the data sets and not evaluating the user functions. These are the command I used (one column versions):

The bulk of the tests were run on Mac OS 10.8 with R 3.0.2, dplyr 0.1.3, and data.table 1.9.2 (the vs. base tests were run on a different machine).

You can look at the full data generation and benchmarking code if you wish.


The short of it is that dplyr and data.table perform comparably for all data frame sizes unless you have a very large number of groups. If you do have high group numbers then data.table can be substantially faster, particularly when using computations that lead to one row per group in the result. With fewer groups dplyr is slightly faster.

The following chart summarizes the results on data frames with one column of data. Please note that the X axis in the plots is log10. The facet rows each correspond to a particular data frame size. The columns distinguish between an aggregating operation (left, using mean) and non-aggregating one (right, using rev). Higher values mean slower performance. Each point represents one test run, and the lines connect the means of the four preserved test runs.

high level results

dplyr and data.table are neck and neck until about 10K groups. Once you get to 100K groups data.table seems to have a 4-5x speed advantage for grouping operations and 1.5x-2x advantage for non-grouping ones. Interestingly it seems that the number of groups is more meaningful in the performance difference as opposed to the size of the groups.

Adding columns seems to have very little effect on the grouping computation, but a substantial impact on the non-grouping one. Here are the results for the 10MM row data frame with one vs. three columns:

testing more columns

Vs. Base

I also ran some tests for base, focusing only on the 1MM row data frame with 100K groups (note these tests were run on a different machine, though same package versions). Here are the commands I compared:

And the results:

base packages

Suprisingly for non-aggregating tasks base performs remarkably well.


As I’ve noted previously I’m currently a data.table user, and based on these tests I will continue to be a data.table user, particularly because I am already familiar with the syntax. Under some circumstances data.table is definitely faster than dplyr, but those circumstances are narrow enough that they needn’t be the determining factor when chosing a tool to perform split apply combine style analysis.

Both packages work very well and offer substantial advantages over base R functionality. You will be well served with either. If you find one of them is not able to perform the task you are trying to do, it’s probably worth your time to see if the other can do it (please comment here if you do run into that situation).

Related Links

Setting Up A Programming Blog

My Specs

When I set out to find a blogging platform for writing about coding I had some simple requirements:

  • Must work in plain text editing mode (I hate WYSIWYG editors)
  • Markdown or wiki style mark-up language
  • Monospace code blocks
  • Monospace code output blocks distinguishable from the code blocks
  • Inline code (supported as part of markdown with backticks)

And some nice to have:

  • Code highlighting (simple is fine, even just distinguishing comments from code)
  • Markdown preview
  • Fenced code blocks with syntax recognition

WordPress seemed to be the only blog platform I could find out much about, and with its cornucopia of plug-ins I assumed I would have no trouble meeting my requirements. Little did I know…

What I Tried

WP Markdown

Initially the most promising solution appeared to be WP Markdown, which pretty much does everything I want with one major limitation: the distinguishable code output blocks.

I am willing to use a little HTML, though nothing that requires me to set attributes for individual tags. So the solution I came up with is using the <SAMP> tag, which is spec’ed as being exactly what I’m looking for. And this appeared to work great, except that after initial submission, any attempts to edit the post would open the editor with all newlines stripped out of my <SAMP> block! For example:


I’m guessing this is something easy to address so I’ll follow up with the author at some point.

Crayon And Syntax Highlighter Evolved

Both Crayon and Syntax Highlighter Evolved are beautiful plug-ins that really bring your code to life. Unfortunately, neither supports markdown directly.

Crayon has a slight advantage for my use case because it catches inline backticked code, and I prefer the styling, but neither does quite what I need.

Markdown On Save Improved

Looked like a promising implementation of Markdown, but has been deprecated in favor of an implementation within Jetpack.


Jetpack looks to be a swiss army knife plug-in, that among other things supports markdown. Since I did not need the extra functionality, and since the package requires a account and apparently wants to advertise on some levels, I steered clear of it. Brian Krogsgard has an interesting post on the topic.

Jetpack Markdown

Hurrah, someone liberated the markdown module from Jetpack!

The module is great and does everything I need. The only drawback is that it doesn’t natively syntax highlight and doesn’t have a preview mode like WP Markdown. Also, since this seems to be somewhat unofficial I am not sure what the long term maintenance prospect for the plug-in is, but it fits the bill so far.

My Setup

I ended up settling on a combination of Jetpack Markdown, Crayon Highlighter, and custom CSS implemented through Custom CSS Manager.

Issues / Caveats

Jetpack Markdown Installation

One installation note for Jetpack Markdown: in order to get it to work (v2.9) I had to extract the folder from the zip file and manually install it via FTP by copying it into wp-content/plugins/. The zip file I downloaded had an extra folder level that prevented the word press “Add New” plug-in install from working properly.

Fencing and mini-tags

Jetpack Markdown (I think it is Jetpack Markdown, but could be something else) converts non-standard characters to their HTML entities if they are part of fenced code blocks (i.e. code following triple backticks), indented code blocks, or within mini tags before putting them inside <pre> / <code> tags. This then trips up syntax highlighters leading to things such as &lt;- instead of <- inside code blocks.

As a workaround, I use <pre> tags for code blocks. This works well for Crayon, but not so much for Syntax Highlighter Evolved which doesn’t appear to recognize simple <pre> tags. I suspect the reason for this is due to the WP plug-in as opposed to the JS library, but I gave up trying to figure it out after a couple of minutes browsing through the source code.

SAMP tags

Unfortunately wordpress likes to replace newlines with <br> tags inside <samp> blocks which is fine unless you try to set the CSS to be white-space: pre; for the <samp> blocks, which you have to do if you want tab alignment and all that to work. One workaround I used was to reduce line spacing so that it looks like there aren’t empty lines due to the <br> tags. Total hack, but close enough for my purposes.

Custom CSS

For reference, this is what I settled on:

I’ve made several other adjustments, but those are mostly related to header formatting, spacing, etc.

Performance Impact of S3 and S4 Dispatch

The Setup

I have known for a while that S3 and S4 dispatch carries a performance penalty, but I have never run into a situation where it is a glaring problem. I am currently contemplating writing some functions that could potentially be called many times, so I decided to benchmark S3 and S4 dispatch. The quick answer is that the performance impact of method dispatch is unlikely to matter most of the time (for an example of an exception see the “When Does it Matter?” section).

The tests are structured to evaluate direct calls to the functions, as well as a couple of variations on dispatch. They were executed on a OSX 10.8 / 2.0GHz Dual Core i7 / 8GB RAM system as well as on Win7 / 2.8GHz Quad Core Xeon / 6GB RAM system (both with R3.0.2-Rstudio).

S3 Methods

Unit: nanoseconds
expr min lq median uq max neval
myfun(x) 2886 3535.5 3990.5 5194.0 7427 100
myfun(y) 2298 2769.0 3095.5 3796.0 5499 100
myfun(z) 5125 6034.0 6730.0 7744.5 38117 100
myfun.default(x) 346 446.0 588.5 722.0 1317 100 # baseline

Clearly method dispatch carries a performance penalty relative to the baseline case. The simplest dispatch scenarios (first and second here) add on the order of 3μs to execution. While the extra amount of time is small on an absolute basis, it is a ~5x increase over baseline. Interestingly, the unclassed object (i.e. the one that ends up at myfun.default) is almost 25% slower to dispatch than the classed one1. Invoking NextMethod as myfun(z) results in another roughly 2x increase. This makes sense since we’re using S3 dispatch twice.

S4 Methods

Unit: nanoseconds
expr min lq median uq max neval
myfun2(w) 11751 13943.0 15127.5 16231.0 42846 100
myfun2(u) 146687 150092.5 155058.5 162427.0 332575 100
myfun2S41(w) 456 621.5 709.5 803.5 1580 100 # baseline

The simplest S4 dispatch creates a ~20x increase over baseline. Using callNextMethod you go up to a ~200x increase in execution time over base line. This is on a clean workspace with only the default packages and microbenchmark loaded.

Method look up in S4 is a lot more complex than in S3, so one would expect a performance penalty between S3 and S4. What is truly surprising is what happens with callNextMethod. The logic within callNextMethod is fairly complex and uses several calls to functions such as is that presumably do lookups on the S4 tables, so perhaps this should be expected2.

When Is This a Problem?

You are unlikely to notice S3 and S4 dispatch in most use cases, especially if you properly vectorize your functions. Even our slowest benchmark ran in about an eight of a millisecond. One type of situation that could cause problems is if your function is used as part of a split-apply-combine analysis. Consider this example that takes advantage of S4 dispatch to apply the correct method to each column of a data frame. We benchmark that approach to a more traditional if control flow based function:

Unit: seconds
expr min lq median uq max neval
aggregate(df[-1], df[1], vecconcat) 5.66 5.70 5.71 5.71 5.71 5
aggregate(df[-1], df[1], vecconcat2) 2.91 2.98 3.01 3.02 3.04 5

So about 2x slower using S4 dispatch. It turns out that a fair chunk of the ~3 seconds of the traditional function run is taken up by aggregate itself. If we use the more efficient data.table, the difference is closer to 10x because now there is a lot less overhead from the aggregation function itself:

Unit: seconds
expr min lq median uq max neval
dt[, lapply(.SD, vecconcat), by =] 2.87 2.883 2.908 2.911 2.922 5
dt[, lapply(.SD, vecconcat2), by =] 0.25 0.255 0.256 0.258 0.258 5

Some might argue that there is no benefit whatsoever to the S4 approach so the slowness is moot, but even in this example there are some disguised benefits. For example, the S4 code will work with both numeric and integer columns, whereas the “traditional” approach only works with integer. More importantly, there may be functions that have more legitimate internal use for S4 methods that would still carry the same performance overhead as this top level dispatch.


Most of the time you won’t have to worry about S3/S4 dispatch, but if you plan on developing with S3 or S4 make sure you carefully think through the use cases to ensure there won’t be any that will cause the extra overhead from method dispatch to add up to substantial extra time.

One big caveat is that we are using fairly simple S4 dispatch here. It is possible that with complex webs of S4 classes dispatch times could become an issue in more traditional use cases.

1 At first I thought this could indicate that S3 generics with lots of methods could be particularly affected, but the slowdown seems to be specifically with the invocation of the default method, not with looking through a large method list
2 I replicated this result on a windows machine as well, suggesting this overhead is real.