That addresses the issues of the paths with long tails inside a path tracer. Since everything is done in warps under the hood, keeping threads working will increase the occupancy.
Lets say you have a long path which terminates after 6 bounces and you have one which terminates after 3, then the latter could potentially shoot another path of a different sub-frame which might also take only 3 bounces and therefore would keep the thread occupied longer.
That would have calculated two samples per pixel where the longer path had only one, which means to get to a consistent sample per pixel count you would need to track that count as well and fill the ones which are under-sampled somehow, which in turn means more memory accesses which is the enemy of performance.
The issue with this is that this needs to happen with some global heuristic or based on the clocks spent so far, since, as said, no information is available about neighboring rays.
Another approach would be to pick new rays from other pixels from a pool, that means, for example, when rendering a 3840x2160 image you don’t actually launch with the full size but with a smaller 2D launch e.g. like a quarter, and each thread would pick rays from the list of remaining rays as soon as it’s done with one path. That would require atomics. Shorter paths would pick more rays.
(I doubt that will be faster than just launching with the full size and letting the implementation schedule that internally.)
The described wavefront approach is slightly different. There you would also shoot fewer rays per launch and depending on the live state of the ray, you would either keep launching with the additional path segment or replace it with another not yet handled path.
That would require an analysis step between each launch of the wavefront which will exactly run into the memory bandwidth limits I described earlier.
The question is always “Is it worth it?”, and esp. on RTX boards you need to keep the RT cores busy which works better when using the shader pipepline and not stop after each wavefront.
(When reading that article, note that OptiX Prime has been discontinued in OptiX 7 for that reason.)
Another approach is to shoot all samples per pixels at once and work over the image in tiles like many final frame renderers do. That improves the locality of the paths per pixel, means the path length should be relatively similar per tile.
In praxis my experiments showed that RTX boards handled full screen launches similarly well though.
Means the underlying scheduler is pretty sophisticated already, so all of this would need to be determined on a case by case basis.
Divergence hurts quite a lot and it makes sense to sort rays into buckets with similar directions, for example, into octants based on their direction component signs.
Mind that you cannot do arbitrarily much work per launch since under Windows with GPUs in WDDM mode that has a 2 second Timeout Detection and Recovery (TDR) mechanism. Not so on GPUs dedicated to compute work, running the Tesla Compute Cluster (TCC) driver mode (not available on GeForce).