Hashed Mailboxing

17th May 2012

The ideal situation in terms of efficiency for a raytracer when tracing rays around a scene is that the acceleration structures take the brunt of the work (assuming a simple use-case with no complex shading, displacement/subdivision or asset paging) - that is, if you profile the raytracer, the acceleration structure intersection functions are being hit more than the actual intersection functions of the primitives within the acceleration structures. This depends on quite a few things, like the layout of the scene, the acceleration structure, the geometry and how tightly boundary boxes (of either the object or any the acceleration structures use) fit around geometry or other areas.

There’s a balancing point between the density of the geometry being intersected against and how efficient the acceleration structure being used is. With a KDTree (and to a lesser extent with BVH or BIH), this is generally governed by the depth of the acceleration structures. With a Grid acceleration structure, this is done by the number of grid cells the scene area or mesh is covered by. Up to a certain point (and assuming the acceleration structure works well) increasing either of these variables increases the efficiency of the overall ray intersection process, as in ideal situations (generally most of the time), this allows less intersection tests to be needed, because the cells of the acceleration structure storing the primitives store less primitives, so less primitives have to be tested against the ray. However, increasing these numbers comes at a price - the build time of the acceleration structure goes up, as does the amount of memory required to store it. In the KDTree’s case, both of these can quite quickly become prohibitive.

Something that is common with acceleration structures (especially KDTrees, Grids and BIHs) is the fact that primitives within the acceleration structure can often be in multiple nodes or cells at once if they straddle the boundary. Due to the way most ray traversal algorithms work with the respective acceleration structures, they generally involve walking the ray through the structure until it finds nodes or cells with primitives in, and then testing each primitive in that node/cell against the ray. In the case when a ray didn’t intersect with any primitives in one node/cell and so the ray moves to the next node/cell, if some of the primitives in this next node were in the previous node as well, they will have already been intersected against the ray, so they’re not going to intersect that primitive this time. This can be quite wasteful.

In A fast voxel traversal algorithm for raytracing, J. Amanatides and A. Woo introduced a method to avoid this inefficiency, whereby each ray and each primitive have a unique ID, and when a ray is intersected against a primitive, a 2D matrix is marked with the respective ray and primitive IDs, so that a fast lookup can be done in the future to skip the test if the situation arises again. This works very nicely when there are a few number of rays (e.g. only one for physics collision tests or similar) or when there are a fairly low number of primitives and rays. But when the number of primitives and rays are both in the millions, then the amount of memory required to store this matrix becomes ridiculous, not to mention causing huge CPU cache latency problems (due to cache misses), which significantly outweigh any benefit that using the technique might provide.

In Realtime Ray Tracing on Current CPU Architectures, C. Benthin introduced (section 4.4.4) the “Hashed Mailbox”, whereby, because for a particular ray intersection test (assuming a fairly good acceleration structure) the ray will only have to be intersected against a small subset of the total number of primitives that exist, a much smaller amount of memory is actually required in order to still give a useful efficiency boost. So a small amount of memory (with around 64 slots) is allocated, and the slots to use for marking primitives that have been checked are selected based on hashing the primitive ID. This gave on average between 15% to 25% speed increases. This new technique still requires a fair amount of memory (1KB) to be allocated and freed however, which can itself still have an overhead, especially if it is done on a per-ray-intersection basis.

In Ray-triangle intersection algorithm for modern cpu architectures, A. S. Maxim Shevtsov and A. Kapustin introduced “Inverse Mailboxing”, which uses even less memory (32 bytes for an 8-mailbox slot structure), which simply stores the last eight intersected objects on the stack, and so can be easily used within the ray traversal code/algorithm. This method doesn’t guarantee that duplicate object intersection tests will not be made per-ray for any object, but assuming a decent acceleration structure and a limited number of primitives per node or cell, the chances of duplicate checks will be very small, while still providing a useful speedup.

I’ve found using this last technique has for general meshes given between 7%-23% speedups, and in the case of compound objects (objects with sub-objects within them) - where the cost of intersecting the sub-objects can be much more expensive than just a simple primitive intersection - up to a 40% speed increase.