Notes on the Generational Generational GC for R

Luke Tierney
School of Statistics
University of Minnesota

Background

The code enabled by defining USE_GENERATIONAL_GC implements a non-moving generational collector with two or three generations. The generational collector divides allocated nodes into generations based on some notion of age. Younger generations are collected more frequently than older ones.

Memory allocated by R_alloc is maintained in a stack. Code that R_allocs memory must use vmaxget and vmaxset to obtain and reset the stack pointer.

Tuning Constants

There are a number of tuning constants. Most of these could be made settable from R, within some reasonable constraints at least. Since there are quite a lot of constants it would probably make sense to put together several ``packages'' representing different space/speed tradeoffs (e.g. very aggressive freeing and small increments to conserve memory; much less frequent releasing and larger increments to increase speed).

Collection Levels

There are three levels of collections. Level 0 collects only the youngest generation, level 1 collects the two youngest generations, and level 2 collects all generations. Higher level collections occur at least after specified numbers of lower level ones. After LEVEL_0_FREQ level zero collections a level 1 collection is done; after every LEVEL_1_FREQ level 1 collections a level 2 collection occurs. Thus, roughly, every LEVEL_0_FREQ-th collection is a level 1 collection and every (LEVEL_0_FREQ * LEVEL_1_FREQ)-th collection is a level 2 collection.

When a level N collection fails to produce at least MinFreeFrac * R_NSize free nodes and MinFreeFrac * R_VSize free vector space, the next collection will be a level N + 1 collection.

Page Release

Space for non-vector nodes and small vector nodes is allocates using malloc in chunks called pages. Pages are grouped into classes according to the size of the vector component they contain. Occasionally an attempt is made to release unused pages back to the operating system. When pages are released, a number of free nodes equal to R_MaxKeepFrac times the number of allocated nodes for each class is retained. Pages not needed to meet this requirement are released. An attempt to release pages is made every R_PageReleaseFreq level 1 or level 2 collections.

Heap Size Adjustment

The heap size constants R_NSize and R_VSize are used for triggering collections. The initial values set by defaults or command line arguments are used as minimal values. After full collections these levels are adjusted up or down, though not below the minimal values or above the maximum values, to towards maintaining heap occupancy within a specified range. When the number of nodes in use reaches R_NGrowFrac * R_NSize, the value of R_NSize is incremented by R_NGrowIncrMin + R_NGrowIncrFrac * R_NSize. When the number of nodes in use falls below R_NShrinkFrac, R_NSize is decremented by R_NShrinkIncrMin * R_NShrinkFrac * R_NSize. Analogous adjustments are made to R_VSize.

This mechanism for adjusting the heap size constants is very primitive but hopefully adequate for now. Some modeling and experimentation would be useful. We want the heap sizes to get set at levels adequate for the current computations. The present mechanism uses only the size of the current live heap to provide information about the current needs; since the current live heap size can be very volatile, the adjustment mechanism only makes gradual adjustments. A more sophisticated strategy would use more of the live heap history.

Node Sorting

At program startup, successive allocations tend to produce memory locations that are close together. However, the non-moving allocation strategy used by the generational collector means that eventually locations of successive allocations become essentially unrelated and are therefore very likely to be quite far apart. This results in a degrading of performance, presumably due to cache misses being more frequent. A good example to demonstrate this is the nuke example from the help page for the boot library:

 library(boot) data(nuclear) nuke <- nuclear[,c(1,2,5,7,8,10,11)] nuke.lm <- glm(log(cost)~date+log(cap)+ne+ ct+log(cum.n)+pt, data=nuke) nuke.diag <- glm.diag(nuke.lm) nuke.res <- nuke.diag$res*nuke.diag$sd nuke.res <- nuke.res-mean(nuke.res) nuke.data <- data.frame(nuke,resid=nuke.res,fit=fitted(nuke.lm)) new.data <- data.frame(cost=1, date=73.00, cap=886, ne=0, ct=0, cum.n=11, pt=1) new.fit <- predict(nuke.lm, new.data) nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred) { assign(".inds", inds, envir=.GlobalEnv) lm.b <- glm(fit+resid[.inds] ~date+log(cap)+ne+ct+ log(cum.n)+pt, data=dat) pred.b <- predict(lm.b,x.pred) remove(".inds", envir=.GlobalEnv) c(coef(lm.b), pred.b-(fit.pred+dat$resid[i.pred])) } for (i in 1:20) print(system.time(boot(nuke.data, nuke.fun, R=999, m=1, fit.pred=new.fit, x.pred=new.data)))

The results for the first, tenth and twentieth runs on several architectures are given in the table below.

We can improve locality of reference by a simple trick of sorting the free lists either on each older generation collection or on each full collection. Sorting on each full collection seems sufficient, so I have adopted that approach.

PPC PPC, bytecode AMD-K6

1st 10th 20th 1st 10th 20th 1st 10th 20th

no sort 50.91 51.67 51.82 40.86 48.84 49.88 64.10 70.36 72.87

all old 51.62 50.96 50.95 39.96 39.82 39.94 62.72 61.95 61.51

full only 51.47 50.94 51.13 40.16 40.41 40.77 62.59 62.27 62.47

PPC PPC, bytecode AMD-K6

1st 10th 20th 1st 10th 20th 1st 10th 20th

no sort 38.58 41.75 42.86 72.05 79.03 82.98 152.63 164.84 177.63

all old 37.21 36.86 36.97 72.16 72.23 72.48 151.28 150.63 151.02

full only 37.66 37.43 37.57 72.95 73.16 74.13 151.25 150.37 151.30

Open Issues

Hard or Soft Heap Limits

The values of R_VSize and are R_NSize are restricted to remain below the values of R_MaxVSize and R_MaxNSize. Initially these maximal values are both set to INT_MAX to insure that the size counters will not wrap on systems that have that much memory. On other systems this means the heap will continue to grow until a malloc call fails. In many cases this may be adequate, but there are situations where paging can lead to serious system performance degradation before malloc can fail.

The functions R_SetMaxVSize and R_SetMaxNSize can be used to set new maximal values; the values will be kept above the current use values. A problem with the way things currently are is that before initialization R_VSize is measured in bytes; after initialization it is measured in vector cells (typically 8 bytes). Also, the values are reported to the user in bytes. These conversions create the possibility of integer overflows and are messy. We should change this to always make the internal representation be in vector cells.

Hitting a memory limit will raise an error that aborts the current computation. With a more sophisticated error handling system it would be possible to allow the limit to be soft in the sense of raising a recoverable error that would allow a user-defined error handler to specify an increase in the limit and allow the calculation to continue.

It might be a good idea to bump the max limit a bit when it is hit to insure there is enough headroom to do any necessary cleanup. We haven't done this yet since it is a pain to do and so far it looks like enough room to delete big variables it usually available even in artificial test cases. Here is one that triggers a node cell overflow:

 e<-NULL while(TRUE){ ee<-new.env() assign("x", e, env=ee) e<-ee }

Low Memory Settings

We need to test things out on low memory setting such as Windows boxes with 32M. It may be necessary to use memory limits or more aggressive release settings or both in those cases.

Huge Heap Support

Currently the variables R_NSize and R_VSize, as well as several other size-related variables in the collector are defined as int. This is not ideal for 64-bit architectures where pointers and longs are 64 bits but int is often 32 bits. These definitions should be changed to long (or perhaps something like size_t); I believe there are gcc warnings options that can help track down incompatibilities.

There area a few places where heap adjustments use occupancy fractions expressed as double's. I think these will work fine even for 64-bit integers, but this should be reviewed.

The current upper limits of of INT_MAX on R_VSize and R_NSize are intended to prevent wrapping on systems with huge amounts of memory. They limit the number of nodes and the number of vector cells to about 2,000,000,000 each.

We should review all uses of integer variable types for appropriate size and for possible overflows.

Issues in Limiting Memory Use

There are (at least) three measures of heap size. The one that is visible form R at present is (R_NSize, R_VSize). They represent limits up to which allocation can occur without a GC.

Second, there is the total amount of memory allocated by malloc. This will fluctuate more than the (R_NSize, R_VSize) pair because of efforts to release pages back to the OS. It will usually be well below the limit implied by (R_NSize, R_VSize) right after a GC. Right before before a GC it could be above due to granularity in the page allocation for small nodes but will usually be below because the GC will typically be triggered by only one of the thresholds being reached. Currently the total malloc allocation is not recorded but could be computed easily if we wanted to (and we may if we want to control total malloc allocation directly).

Third there is the actual memory footprint of R in the OS. This would be the number top or the Windows task manager should be reporting if they are accurate (the task manager may be but top I think may need to be taken with a grain of salt.) What happens here depends a lot on the quality of the malloc--whether calls to free are able to result in reducing memory footprint.

For limiting memory footprint we would ideally like to control the third measure, but this is quite difficult on Unix and Windows, or any system with real VM. Controlling malloc is a surrogate that may be adequate.

For preventing a computation from going out of control, using R_MaxNSize and R_MaxVSize should be adequate. It might even be sufficient for controlling memory footprint.

The current approach assumes that a malloc ought to succeed if it is permitted to occur by the current R_NSize and R_VSize settings. if it fails, an error is signaled immediately. With a malloc limit (and maybe even without one) it would make sense to go through a few intermediate steps after a malloc fails and before signaling an error:

First force a full collection and try again.
If the malloc still fails, try to release all possible pages and try again.
If the malloc still fails, then signal an error.

As always with resource exhaustion errors, an issue is that after signaling the error the problem may still exist and it might therefore not be possible to do anything about it. A scheme of allowing a small reserve for processing the error and cleaning up may be needed.

Documentation

Documentation changes have now been made in a few places but more is needed. The Memory help page needs to be adjusted properly once the interface to tuning the GC is established.

Testing

Still need lots more of it.

	PPC			PPC, bytecode			AMD-K6
	1st	10th	20th	1st	10th	20th	1st	10th	20th
no sort	50.91	51.67	51.82	40.86	48.84	49.88	64.10	70.36	72.87
all old	51.62	50.96	50.95	39.96	39.82	39.94	62.72	61.95	61.51
full only	51.47	50.94	51.13	40.16	40.41	40.77	62.59	62.27	62.47
	PPC			PPC, bytecode			AMD-K6
	1st	10th	20th	1st	10th	20th	1st	10th	20th
no sort	38.58	41.75	42.86	72.05	79.03	82.98	152.63	164.84	177.63
all old	37.21	36.86	36.97	72.16	72.23	72.48	151.28	150.63	151.02
full only	37.66	37.43	37.57	72.95	73.16	74.13	151.25	150.37	151.30