Should We Care About R Object Internals?

R does a pretty good job of abstracting away the memory management aspect of programming. We don’t need to know anything about malloc, free, memory leaks, illegal access, and other terrible things that can happen when one tries to manually manage memory.

So why pull back the curtain and peer into the memory internals of R objects? Under some circumstances there can be important performance implications.

R implements a copy-on-modify optimization that allows multiple variables to point to the same object / memory location until one of them is modified. In particular, this prevents copies when variables are passed to functions that read but do not modify their arguments. If the function were to modify the argument in the function frame¹, then it would be copied first so that the reference² from the calling frame is not affected. This preserves R’s call-by-value³ semantics while only copying objects when needed.

One problem with the copy-on-modify optimization is that as of R3.5.2 it is based on a very conservative reference counting heuristic⁴. This can cause unexpected and/or unnecessary object copies.

We can use tracemem to detect object copies, unexpected or otherwise:

## Under R 3.5.2.
x <- sample(10)
tracemem(x)          # tracemem will print to screen when `x` is copied

[1] "<0x7fcc8a9b7c00>"

c(x)                 # 'touch' x, with no changes

 [1]  0  2  1  9  3  5  7 10  8  6

x[1] <- 0

We know the memory referenced by x was not copied despite being modified because tracemem was quiet. Let’s try again:

identity(x)          # 'touch', with more emotion

 [1]  0  2  1  9  3  5  7 10  8  6

x[2] <- 0

tracemem[0x7fcc8a9b7c00 -> 0x7fcc841cfc00]:

This time there was a copy as evidenced by the tracemem output. What we really need is a mechanism to understand why the first assignment did not lead to a copy, but the second one did. In both cases there is only one reference to x so it should be safe to modify x without copy.

.Internal(inspect(...)) comes in handy here; it allows us to inspect the guts of our R objects:

x <- sample(10)
.Internal(inspect(x))

@7fa5826a0008 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 2,5,10,6,8,...

c(x)

[1]  2  5 10  6  8  9  4  7  1  3

.Internal(inspect(x))

@7fa5826a0008 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 2,5,10,6,8,...

identity(x)

 [1]  2  5 10  6  8  9  4  7  1  3

.Internal(inspect(x))

@7fa5826a0008 13 INTSXP g0c4 [NAM(3)] (len=10, tl=0) 2,5,10,6,8,...

If you’re wondering what that gobbledygook is, don’t worry, we’ll explain it in detail shortly. In essence, it is internal meta data associated with the R object. For now notice the highlighted [NAM(#)] bit. That is the “reference” counter. c(x) did not increment it, but identity(x) did⁵. x was copied on the second assignment because by that point the NAM reference counting heuristic⁶ suggests there could be more than one reference to the object.

It turns out that there is a substantive difference between c and identity. The former is a primitive function, while the latter is a closure function:

function (...)  .Primitive("c")

typeof(identity)

[1] "closure"

Closures are complex. Among other things they each have their own evaluation frame. Because R does not currently track whether this frame is destroyed on function exit, R assumes it could persist along with its references to the parameters. For this reason the “NAM” value on arguments to closures is always incremented. Primitives on the other hand are straight-to-compiled C code functions that have full control of memory and reference count so they can leave it unchanged when warranted.

And what about .Internal(inspect(...))? It’s the R equivalent of a speakeasy: undocumented, unpublicized, and as shown here occasionally useful to know about. Only the cool kids on r-devel are clued in, and I happen to know about it because my older cousin’s college room mate’s brother is one of the cool kids.

A Note About .Internal, and Some History

.Internal is an interface used to call compiled C code routines. There are several other interfaces that do similar things, including .Primitive, .Call and .External. Unlike the last two, .Internal is not intended for “public use”:

Only true R wizards should even consider using this function, and only R developers can add to the list of internal functions. ?.Internal, R-core

I’m probably more in the Mickey-mouse-in-Fantasia category than “Wizard”, but specifically for .Internal(inspect(...)) in interactive use, there is a long precedent of actual wizards and non-wizards using it on the r-devel mailing-list for just this purpose.

The inspect internal was quietly added by Simon Urbanek at revision 48129 to what was then R under development 2.9.0:

$ svn log -r48129
------------------------------------------------------------------------
r48129 | urbaneks | 2009-03-16 11:25:35 -0400 (Mon, 16 Mar 2009) | 1 line

add .inspect debugging tool
------------------------------------------------------------------------

The function appears to originally have been part of the inspect package, the only trace of which I can find is on Rforge. It does not show up in the CRAN archives, so perhaps it never made it on there. The package does have some terse documentation.

As far as I can tell the only documentation outside of Rforge is this informal announcement:

FWIW inspect is now part of R itself but as an internal function so you can either use it directly via .Internal or for compatibility
inspect <- function(...) .Internal(inspect(...))
the arguments are (x, max.depth=-1, max.elements=5) [the last one is only supported in R-devel].
- Simon Urbanek on R-devel

Unofficial Documentation

Overview

Consider this output of .Internal(inspect(...)):

x <- sample(10)
invisible(gc(full=FALSE))   # run garbage collector.
.Internal(inspect(x))

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

It can be broken up into several different tokens:

Address          Type Name   Extra                  True Length
|                |           |                      |
+-----------+    +----+      +-----------+          +--+
@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...
              ++        +--+               +------+       +------------+
              |         |                  |              |
              Type      GC Info            Length         Data

All of the information after the address token is object meta data. It is part of the underlying C-level representation of the R object. We’ll review each piece of output next.

Note: What follows is my interpretation of what is in R internals and in the sources⁷. I could be wrong about some or all of it, and what I’m right about could change in the future. Do not make rash life decisions based on what follows.

Address Token

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

7fa582fa7248 represents the memory location of the R object as a hexadecimal offset. I know of no legitimate use within R of this number other than to confirm whether two symbols point to the same R object or not.

One interesting point I noticed while writing this post is that the address only uses 12 hexadecimal digits, or 48 bits ( $16^{12} = (2^{4})^{12} = 2^{48}$ ), despite my processor being 64 bit. It turns out this because current x86-64 bit processors address 48 bits of memory space.

Type and Type Name Tokens

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

In R internals object types are integer values. For example, NULL objects are type 0, with type name NILSXP, and integer vectors are type 13, with type name INTSXP. Generally the types have corresponding values as returned by the typeof R function although some types like CHARSXP are not typically visible from R.

For a full listing of the types and type names see the the SEXP section of the R Internals Manual.

GC Info Token, Part 1: Generation Data

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

Software like R that manages memory allocation will be responsible for freeing previously-allocated-but-no-longer-used memory for re-use. This process is known as garbage collection (GC). R’s garbage collector is generational: it adjusts collection frequency depending on object age. The idea is that “most objects die young”⁸ so you can improve garbage collection performance by first collecting the younger generation in the hopes that frees up enough memory to continue without further collection.

The first two highlighted characters partly encode the generation of the R object. The 0 or 1 following the “g” corresponds to the value of the gcgen⁹ bit of the meta data. Additionally, R uses the mark bit, and whether it is set is shown by the “MARK” token appearing as highlighted here. This allows tracking of up to three generations of objects¹⁰, where objects that have a 1 value for the mark bit are considered older than those that don’t.

The mark bit is set on an object if at the time the garbage collector runs the object is referenced by a variable. I ran gc() in the example so it would show up here.

GC Info Token, Part 2: Node Class

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

The next two characters starting with a “c” and followed by a number in 0-7 represent the “node class” of the R object, which appears to be a rough measure of size:

c == 0: Non-vector nodes (e.g. NULL, pairlists, closures, etc.).
0 < c < 6: Small vectors of size up to $8 \times 2^{c - 1}$ bytes.
c == 6: Vectors with custom allocators¹¹ (rare).
c == 7: Vectors larger than $8 \times 2^{c - 1}$ bytes (> length 32 logical/integers, > length 16 numerics/character/lists).

Each of the node classes in 0 through 5 are allocated from memory pages that are approximately 2,000 or 8,000 bytes¹² depending on the system. New objects can be created from these pages quickly, at least until the page for that class is filled and a new one needs to be allocated by the OS. Large vector allocations are requested directly from the OS. Custom allocators will obviously depend on their implementation.

Extra Token

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

A comma separated list of sub-tokens with additional information. The meanings of the sub-tokens if they appear follow:

OBJ: Has a “class” attribute.
NAM(#): The “named” value of the object, a heuristic used to determine whether the memory underlying and object needs to be copied if the object is modified. If # == 0 it will not appear in the inspect() output and it is safe to modify the memory in-place as there are no references to it. If # == 1 the memory is referenced by one symbol (variable) and can be modified in-place by some primitive functions. If # > 1 the memory must be copied if it is modified. Note that constants have # > 1 so that there values cannot be changed¹³. “named” values on an object can only ever be incremented. This token is mutually exclusive with the “REF” token discussed next.
REF(#): A true reference counter that can be both incremented and decremented. In particular this resolves one of the biggest drawbacks of the “named” heuristic: when variables are passed to closure functions their “named” value automatically becomes greater than one, requiring copy for any modifications that happen later, even though there likely are no references remaining after the closure is done evaluating. This could be implemented as early as R3.6.0, and would replace the “named” system. This token is mutually exclusive with the “NAM” token discussed previously.
MARK: The object was referenced by a variable at the time of garbage collection, i.e. it could not be collected because it was in use. In conjunction with the rest of the GC info data this helps define the garbage collection generation of the object.
DBG: (closures only) has been debugged.
TR: (closure only) has been traced.
STP: (closure only) has been debugonced, but once full reference counting is implemented will be used on non-closure objects that reference counting should be turned off.
S4: Is S4 (also implicit in the “gp” code).
AB: Is an active binding (also implicit in “gp” code), i.e. does typing the symbol name trigger an action.
LCK: Is a locked environment (also implicit in “gp” code), e.g. package namespaces.
gp=0x####: Hexadecimal, value of the “General Purpose” 16 bit code associated with the object. This is used to encode things such as whether a promise has been seen, the encoding of a character string, whether an object is S4, and others¹⁴. Many of the flags that are implied by these codes also are highlighted by other flags, so in some ways this is redundant except for the more unusual gp bit combinations.
GL: Is the Global Environment.
ATT: Has attributes.

Length And True Length Tokens

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

The length of vectors. For true length:

This is almost unused. The only current use is for hash tables of environments (VECSXPs), where length is the size of the table and truelength is the number of primary slots in use, and for the reference hash tables in serialization (VECSXPs), where truelength is the number of slots in use.
- R Core in R Internals

Additionally, as of R3.4.0 R will over-allocate vectors that have values assigned past the end to try to mitigate the growing vectors¹⁵ problem:

x <- sample(100)
.Internal(inspect(x))

@7f8529893de0 13 INTSXP g0c7 [NAM(1)] (len=100, tl=0) 41,97,16,65,49,...

x[101] <- 101L
.Internal(inspect(x))

@7f85245b0f70 13 INTSXP g0c7 [NAM(1),gp=0x20] (len=101, tl=106) 41,97,16,65,49,...

Currently the default is to grow the vector to 1.05 times the length. The fifth gp bit is set ( $0 x 20 = 2^{5}$ ) presumably marking that this vector is “growable”.

Data Token

@7fa582fa7248 13 INTSXP g0c4 [MARK,NAM(1)] (len=10, tl=0) 10,9,2,6,5,...

A small snippet of the data.

Encoding and Cache Tokens

Not seen in the integer example are two tokens that show up with character vectors:

.Internal(inspect(letters[1:2]))

@7fd28e4958c8 16 STRSXP g0c2 [] (len=2, tl=0)
  @7fd28e4958c8 09 CHARSXP g0c1 [MARK,gp=0x61] [ASCII] [cached] "a"
  @7fd28e3a7fb8 09 CHARSXP g0c1 [MARK,gp=0x60] [ASCII] [cached] "b"

The first line represents the character vector itself, and looks a lot like the integer vector. The next two indented lines represent the vector elements, so there is one line for “a”, and one line for “b”. Breaking down the “a” line:

  Address          Type Name    Extra                  Cached
  |                |            |                      |
  +-----------+    +-----+      +------------+         +------+
  @7fd48995b920 09 CHARSXP g0c1 [MARK,gp=0x61] [ASCII] [cached] "a"
                ++         +--+                +-----+          +-+
                |          |                   |                |
                Type       GC Info             Encoding         Data

Encoding: “bytes”, “latin1”, “UTF8”, or “ASCII”: the encoding of the string.
Cached: string is part of the global string hash table, which should almost always be the case.

Aside: the keen-eyed observer might notice that the “general purpose” values are not the same for “a” (0x61), and for “b” (0x60). 0x60 in hexadecimal is $2^{5} + 2^{6}$ , which according to R Internals means the string is in the global string cache (as it should be) and that it is ASCII encoded. That is obviously true of both strings. What about the additional $2^{0}$ present in the 0x61, but only for “a”? This is probably related to symbol names (a.k.a. variable names) of loaded environments, which are hashed for faster lookup¹⁶. However, because the symbol names are also stored in the string pool, if you use a string that happens to be associated with a hashed symbol, it will show up marked in the $2^{0}$ general purpose bit:

b <- 42
.Internal(inspect(letters[1:2]))

@7fd28e4958c8 16 STRSXP g0c2 [] (len=2, tl=0)
  @7fd28e187920 09 CHARSXP g0c1 [MARK,gp=0x61] [ASCII] [cached] "a"
  @7fd28e3a7fb8 09 CHARSXP g0c1 [MARK,gp=0x61] [ASCII] [cached] "b"

Notice how the creation of the b variable changed the state of the “b” CHARSXP.

ALTREP Token

As of R3.5.0 R supports alternative representations (ALTREP) of some object types or states. For example, the : operator generates an ALTREP object instead of an integer vector:

.Internal(inspect(1:1e8))

@7fef6290dd90 13 INTSXP g0c0 [NAM(3)]  1 : 1000000000 (compact)

Instead of materializing a vector with 100 million sequential integer values, R just records that a sequence starting at 1 and ending at 1e8 in 1 increments is implicitly represented by the object. Some functions can take advantage of this to sequentially or otherwise use the values of the vector without allocating the memory required to hold it in its entirety¹⁷.

Another use of ALTREP is storing the sorted and missingness state in a wrapper:

.Internal(inspect(sort(c(3, 1, 2))))

@7fd29113bd90 14 REALSXP g0c0 [NAM(1)]  wrapper [srt=1,no_na=1]
  @7fd290e72588 14 REALSXP g0c3 [NAM(3)] (len=3, tl=0) 1,2,3

I haven’t looked at the specifics of how this is implemented, but functions that are ALTREP aware should be able to tell that the vector is sorted and has no NAs just by consulting the meta data instead of scanning the entire vector.

As ALTREP becomes more prevalent, understanding what objects are ALTREPs and which functions can use them without materializing them will become increasingly important to write more efficient code.

Inspecting Complex Objects

.Internal(inspect(...)) will dump out the meta data for recursive objects recursively. You can control the depth of the recursion with the first parameter:

.Internal(inspect(getNamespace('stats'), max.depth=1))

@7fe2431e40a8 04 ENVSXP g1c0 [MARK,NAM(3),LCK,gp=0x4000] <namespace:stats>
ENCLOS:
  @7fe2431e7688 04 ENVSXP g1c0 [MARK,NAM(3),LCK,gp=0x4000,ATT] <0x7fe2431e7688>
  ATTRIB:
    @7fe2431e3e08 02 LISTSXP g1c0 [MARK] 
HASHTAB:
  @7fe243affa00 19 VECSXP g1c7 [MARK] (len=1018, tl=720)

Absent the max.depth parameter you would risk substantial screen output. There are some slight differences in output here from what we have seen before, primarily from inspect labeling different components of the complex object.

There is also the max.elements parameter, although in order to pass it you must also pass max.depth as the .Internal call will only match positionally:

.Internal(inspect(as.list(1:1e3), max.depth=-1, max.elements=2))

@7fe247a3ec00 19 VECSXP g0c7 [] (len=1000, tl=0)
  @7fe2471eea20 13 INTSXP g0c1 [] (len=1, tl=0) 1
  @7fe2471ee9e8 13 INTSXP g0c1 [] (len=1, tl=0) 2
  ...

Aside: notice how similar the structure of the vector list above is to the character vector structure.

Parting Thoughts

While I wrote this mostly as a reference to myself, I hope it will be useful to others.

Questions, comments, or just want to share this post? Post about it on Twitter. Follow @brodiegaslam for updates and/or new post notifications.

Acknowledgements

Jim Hester for pointing me to the named/reference discussion on R-devel and to the new usage of truelength for over-allocated vectors.
Image Credit: A Pile of RAM, by Blake Patterson, under CC-BY 2.0, background removed and cropped.

Updates

2/25/19 8pm: added docs for MARK in extra section (h/t Jim Hester).
2/25/19 8pm: truelength is also used for growable vectors (h/t Jim Hester).

I use the term “frame” somewhat loosely to mean the location were variables reside, and you can use the term to mean “environment”. The “frame” is the part of the environment that stores the variable names. The global environment is the default place that variables you create at the R prompt reside. Functions have their own frames distinct from the global and other environments. Note that where the variables reside and where the objects they reference are stored in memory are separable concepts. You can think of variables as index entries in a book, and the objects as content in the book. The index is the frame. The book can have many indices.↩
When I say an object is referenced, I mean that there is a symbol you can use to access it. For example, after we run x <- c(42, 24), the object c(42, 24) in memory is referenced by the symbol x. If I were to type y <- x then there would be two references to the object, x and y. If I call var(x) then there will be an additional reference inside the var function while it is evaluating.↩
Call-by-value means that the value of arguments passed to functions is not changed in the frame that calls the function, even if it is changed in the function frame. Contrast to call-by-reference where any modifications within the function frame are reflected in the calling frame.↩
Future versions of R, possibly as early as R3.6.0, will implement more sophisticated reference counting that will better detect whether objects truly need to be copied or not.↩
For the NAM (named) reference counting heuristic, the amount of the increment doesn’t matter, only whether it is incremented to be greater than one or not.↩
Future versions of R, possibly as early as R3.6.0, will implement more sophisticated reference counting that will better detect whether objects truly need to be copied or not.↩
I looked primarily in src/main/inspect.c and src/main/memory.c of the r-devel sources at revision 76003 (R-devel 3.6.0 2019-01-21).↩
I unfortunately cannot recall exactly where that quote comes from. That aside: Imagine a script that calls many functions sequentially. Each of those calls is likely to generate many internal R objects that are no longer needed as soon as the function ends evaluation. Only the return values are preserved in variables. These function specific objects are likely to be the younger generation that is eliminated first. Preserved objects in variables will then be aged into an older generation and only be reviewed for freeing if eliminating the young generation is insufficient.↩
The gcgen and mark bits are part of the rest of header meta-data.↩
As best I can tell the generations are “not marked”, “marked and gen bit is 0”, “marked and gen bit is 1”. Since the generations are changed by the garbage collector, and the garbage collector has no reason to preserve any non-referenced object, it kind of make sense that the distinction between 0 and 1 for the gen bit is meaningless if an object is not also considered marked.↩
It is possible to implement custom memory allocators, but in that case R is no longer responsible for managing the memory.↩
There is some small overhead for each page, and also some rounding to multiples of the underlying R object sizes. Systems with long vector support (64 bit?) appear to use the 8000 byte base page size.↩
This came up in the Matt Dowle - Peter Daalgard thread.↩
For a full accounting of the general purpose bits see R Internals.↩
Growing vectors as in x <- 0; for(i in 1:100) x[i] <- i used to require allocating a new vector at each assignment, copying the old vector, and appending the new value. This is pretty slow. As of R3.4.0 R allocates some extra memory to the vector each time we write to the end, so that re-allocation is not necessary at each assignment. The improvement helps but growing vectors this way is still bad practice compared to e.g.: x <- numeric(100); for(i in 1:100) x[i] <- i.↩
This distinct from the string pool hash table. I am not certain of this, but presumably the symbol hash returns the address of the object it is bound to, as opposed to the global string pool, which just returns the CHARSXP with the same string value.↩
e.g. for can loop through the values of an ALTREP sequence without materializing the full vector.↩

The Secret Lives of R Objects