Curve Rendering for Fur and Hair

Furry Killeroo render

I’ve implemented Curve primitive ray intersection in Imagine based off Koji Nakamaru and Yoshio Ohno’s 2002 paper “Ray Tracing For Curves Primitive”. Basically, it involves projecting each curve’s ControlPoint positions into orthographic ray-space, so that the main intersection test can be done as a curve width test in two dimensions down the ray, and then the depth t-test can be worked out.

For straight curve primitives, this is sufficient, but for actual curves with any curvature down the length, splitting the projected curve recursively and performing the intersection test on these split curves is necessary. The recursion level needed to ensure accurate intersection depends on the curvature of each curve.

This recursive splitting obviously has an effect on the performance of the algorithm, so while intersecting straight curves is fairly fast, for a curve that curves gently at around 45 degrees from the root to the tip, a recursive splitting depth of six is needed, which results in 32 recursive splits, and a total of 64 intersections on both the original curve and the recursively split curves. Which is unfortunate, as to some extent it makes rendering non-straight curves unpractical for reasonable levels (+100,000) of curves.

For the moment, I’m setting the resultant geometric and shader normals from any intersection as facing back along the original camera ray, so that the normal always faces the camera. This is sufficient for very thin curves.

Hairy Cube render

I’ve also implemented a set of Hair BSDFs for Diffuse and Specular, based on the 1989 Kajiya and Kay paper, which is commonly used. This is optimised for very thin curves, with no effective normal change across the curve, but with a tangent value which can be calculated from the intersection position on the ray.

For the moment, I’m storing curves in an acceleration structure, which works well for very short curves or longer curves which are axis-aligned, but for anything else (long curves going diagonally across dimensions) is bordering on ineffectual, as the resulting axis-aligned boundary boxes for each curve are extraordinarily large, with multiple curves often overlapping each other. I had hoped that spatial partitioning (with curve clipping to boundary boxes) would improve this considerably (it’s fairly useful for triangles), but the improvement for using curve clipping with spatial partitioning is not anywhere near as good as I would have hoped (it provides a slight intersection speedup though compared to object partitioning).

So for longer hairs, I’m going to have to think about how to speed this up considerably, as currently rendering long curved curves is orders of magnitude more expensive than I would have liked. It’s also going to either involve work to model and simulate hair strand interaction more, or import curves from elsewhere, as the method of generating hairs around meshes (importance sampling positions on each triangle for the root positions) and giving a random tilt or curve only really works with fur-type curves.

Carpet



KDTree Benchmarks

I’ve been meaning to do some benchmarks on KDTree building / rendering for some time, given that although I’ve arrived at a set of build criteria which generally give the best intersection performance, I’d never actually done comparisons of the pros/cons of the different variables when building KDTrees and their cost/benefit ratio regarding intersection performance, and had never seen any published comprehensive benchmarks of KDTrees which tested more than one or two variables at a time. I’m also unlikely to play around with KDTrees much more, as I have to some degree moved over to using SSE-optimised BVH acceleration structures which generally provide slightly better render times, but orders-of-magnitude faster build time, thanks to standard binning.

Caveats: First of all, these tests are of my implementation of a KDTree. In certain configurations, it’s very close to PBRT’s reference implementation, but I’ve optimised the traversal code to reduce branching, and added “Perfect Split” clipping and min/max binning approximation as options while building. I’ve also added inverse hashed mailboxing to ray intersection. However, there’s no guarantees as to the correctness of my implementation, although in testing against other renderers which use KDTrees, it seems highly competitive, and is definitely faster than PBRT’s implementation - ~2x faster with an Empty Bonus of 0.9, clipping and using inverse hashed mailboxing.

So while the following results are fairly comprehensive, there’s certain build criteria I didn’t bother testing, and doesn’t in any way test the variable of leaf node threshold size, which I’m pretty certain will have a strong influence on both build time and intersection time. In addition, due to the time it’s taken to accumulate the results, I’ve had to reduce the number of test scenes down to three from the planned six, which also means these tests could be more thorough. I also recorded acceleration structure size, from which - because I know the KDTree nodes are 8 bytes each - I can work out the number of nodes the resulting KDTree had. Unfortunately, I didn’t record the breakdown of interior / leaf nodes, which would have been useful. I’m not showing those numbers in these results, but I use them in the results section to analyse some of the timing numbers.

The results of acceleration structure build time are also slightly misleading, although still relevant - I’ve disabled multi-threaded building of the KDTree for a few reasons: first of all, it’s not a completely parallelisable problem, and while it definitely is possible to parrallelise KDTree building to a degree, depending on the criteria used for building the KDTree, different amounts of parallelism are possible. For example using “Greedy” n log n SAH splitting, each thread needs to compile its own copy of the list of edges, and this either requires a lot more memory or if you break the task down into separate chunks a merge-sort stage at the end. It also causes a lot of cache thrashing, generally limiting the scalability of multithreading. So while the build times are not realistic in terms of “wall clock” time (generally I tend to get 3-4x speedup from multithreading the KDTree builds), given that they are consistently single-threaded for all configurations and the render times are consistently completely parallel, it allows a consistent comparison of the cost/benefit ratio for building a KDTree in a particular configuration vs. the render time.

Testing methodology / criteria

The theoretical best way to test ray intersection performance of an acceleration structure tends to be by casting well-distributed rays at the centre of the acceleration structure from a sphere around it, simulating incoherent rays, in such a way that path tracing will produce off diffuse surfaces. However, I’d prefer to keep tests more practical, so I’ve decided to test different meshes by putting them into a fully-enclosed room which has diffuse materials and a single area light and time how long it takes to build the acceleration structure for the main test mesh and render the resulting image. The number of ray bounces for all ray types will be four, and the render resolution and number of samples-per-pixel will be the same for all tests: 1024 x 768 and 128 respectfully.

Instead of baking down all the triangles into one main scene acceleration structure, I’m going to use the two-level acceleration structure hierarchy I use for instancing, such that the scene has an acceleration structure containing all objects, and each object’s geometry instance has an acceleration structure of the triangle mesh. This allows me to only alter the configuration of the central test object’s KDTree, while keeping the other acceleration structures for the walls, floor, etc and scene constant. The acceleration structures for the wall, floor and ceiling will each only have two triangles anyway, so will immediately produce leaf nodes.

For building the acceleration structure, I’m going to test two different methods: min/max binning SAH approximation, and the standard “Greedy” n log n SAH version which PBRT implements. For both methods, I’m going to test always splitting on the largest volume extent axis vs. checking all three available axes. For the “Greedy” method, I’m also going to test the effect the “Empty bonus” value has, and what effect clipping triangles (“Perfect splits”) has.

The only variable that will be tested for rendering itself is enabling/disabling Inverse Hashed mailboxing.

Test Scenes:

I’ll be testing the following three scenes:

Scene 1: Stanford Dragon: 871K triangles, in axis-aligned orientation

Scene 1 example render

Scene 2: Robot: 1.6M triangles, at 45-degree rotation to axis-alignment

Scene 2 example render

Scene 3: XYZ Dragon: 7.2M triangles, axis-aligned orientation

Scene 3 example render

Because of the fact I’m using a two-level hierarchy of acceleration structures, in order to test Scene 2 fairly with regards to the axis alignment rotation, I’ve re-baked out the mesh in world-space with a 45-degree rotation around the Y (up) axis, so that its triangles are rotated within object-space as well.

KDTree leaf node threshold size is always 6, and for min/max binning approximation, when the number of objects left to be split is less than 8192, I switch over to the normal “Greedy” implementation - otherwise, I’ve found that min/max binning does a pretty poor job of splitting at that level. Published papers on the min/max binning algorithm for KDTrees seem to show similar issues.

In terms of recording the specific times, this is being done within the code using gettimeofday() to record the respective start/end times, calculating the difference and convert it to double precision, giving a number in seconds. That number is rounded to 5 decimal places.

For all test configurations, I’ve run six independent tests in a clean state, and the results shown here are the mean average of those results, again rounded to 5 decimal places. For the last two tests of Scene 3, there were differences in time for some of the tests outside the margin of error, so I ran a few more tests for those. The tests were all run on the same computer (Dual Xeon, Sandy Bridge micro-arch), running Linux 3.5 kernel, and all test executables were built with GCC 4.7.1 with the same options. I also tried to ensure that the CPU core temperatures were below 60 degC before running each test.

Test Results / Analysis

Scene 1

Specs Time (seconds)
All Axes Build Render Render (mailboxing)
No 2.67848 100.58898 71.75651
Yes 3.92627 98.72142 71.91679

With min/max binning approximation, build times are relatively fast (a fair amount of the time is taken up by the “Greedy” partitioning which takes over after 8192 triangles), and surprisingly, while checking all three axes instead of just the single largest extent axis takes longer, it doesn’t take three times as long, which is what I would have expected. I guess some of the time would be due to memory allocation, which when checking all three axes is only done once.

On the rendering side, the fact that all axes were checked while building seems to have very little effect in this case, indicating that the extra time penalty for building might not always be worth the effort. Enabling Inverse Hashed Mailboxing has a huge effect on render times, effectively reducing the render times to ~73% of their previous times.

Specs Time (seconds)
All Axes Empty bonus Clipping Build Render Render (mailboxing)
No 0.0 No 4.81748 84.54154 63.03533
No 0.5 No 4.88900 83.79326 62.14473
No 0.9 No 7.51690 79.24561 59.32178
Yes 0.5 No 8.11435 86.87896 67.30279
Yes 0.9 No 10.87438 72.90415 56.69851
No 0.9 Yes 14.48284 73.29534 54.64541
Yes 0.9 Yes 25.01317 67.64028 52.54923

With the “Greedy” n log n partitioning method, build times are more expensive, and the “Empty bonus” value, used when building to maximise - or cut away - empty space has a very big effect on both the build time and the render time. Based on the memory allocation numbers I noted, and thus the number of nodes in each KDTree, I can see that having a bigger empty bonus leads to deeper (and therefore bigger) trees, as expected, but also better rendering times. Checking all three axes gives slightly better render times with an Empty Bonus of 0.9, but worse render times with one of 0.5. Triangle clipping is very expensive at build-time in this test, and only provides slight speed improvements at render time.

Although faster to build, the min/max binning approximation cannot generate trees as well as the “Greedy” one, so render times are significantly faster with trees built with the “Greedy” method. Once again, Enabling Inverse Hashed Mailboxing significantly reduces render times across the board.

Scene 2

Specs Time (seconds)
All Axes Build Render Render (mailboxing)
No 12.36240 170.46582 131.55602
Yes 14.05898 169.83621 131.14230

Min/max binning really struggled with this mesh, not really partitioning the triangles very well, and leading to very large leaf nodes (over 400 triangles in some cases) once the depth limit (35) had been hit. Checking all three axes gave barely detectable improvements at render time.

Again, Enabling Inverse Hashed Mailboxing has the same effect as seen previously.

Specs Time (seconds)
All Axes Empty bonus Clipping Build Render Render (mailboxing)
No 0.0 No 18.18691 133.51584 93.14001
No 0.5 No 18.32624 133.04663 93.46723
No 0.9 No 71.08561 141.30970 98.75770
Yes 0.5 No 24.46969 129.35730 95.41422
Yes 0.9 No 85.27671 131.80109 96.49541
No 0.9 Yes 31.75843 90.94120 70.42518
Yes 0.9 Yes 52.58952 87.22592 63.87435

Once again, with the “Greedy” partitioning method, build times are more expensive, but with this mesh, a more aggressive Empty Bonus value produced deeper trees with no benefit - I think this is down to the fact that the mesh has been rotated at 45 degrees to the axis alignment, leading to non-tight AABBs. This is backed up by the fact that the tests with “Clipping” enabled (which spatially clips triangles to test split criteria as well as clipping a node’s BB to its parent node’s, thus removing overlap) not only build faster despite doing more work, but have significantly reduced render times. Checking all three axes in the final test with clipping enabled is much more expensive at build time over testing a single axis, but gives a reasonable render time improvement.

Once again, Enabling Inverse Hashed Mailboxing significantly reduces render times across the board.

Scene 3

Specs Time (seconds)
All Axes Build Render Render (mailboxing)
No 28.78550 142.52660 118.61174
Yes 40.60875 141.60918 117.62730

In this test with min/max binning, checking all three axes has a fairly moderate penalty at build time over checking only one axis and provides a very slight render time improvement. Enabling Inverse Hashed Mailboxing provides a significant render time improvement once again.

Specs Time (seconds)
All Axes Empty bonus Clipping Build Render Render (mailboxing)
No 0.0 No 51.80268 128.99604 105.48060
No 0.5 No 52.92773 128.41903 104.65014
No 0.9 No 71.33298 123.14448 100.40070
Yes 0.5 No 91.35005 138.43419 112.33473
Yes 0.9 No 109.65947 123.00479 100.35440
No 0.9 Yes 112.55201 117.00627 95.81172
Yes 0.9 Yes 209.72551 117.77966 95.89430

The “Greedy” partitioning method takes a significant amount of time to build with this mesh, sometimes taking longer to build than it does to render (although building is single threaded as I’ve disabled multi-threaded building). Except for a strange result with the test with Empty Bonus of 0.5 and checking all axis, which for some reason produced huge trees (based on the number of nodes allocated) but still with large leaf nodes, leading to slow performance, the same general picture appears, although this time the “Empty Bonus” variable has much less of an effect, as does the clipping and checking all axes. This is probably because with this mesh it is axis-aligned in general, and the triangles are extremely small. Based on the build-time penalty for clipping and checking all axes, the pay-off for build-time vs render time in this test doesn’t seem as worth it.

Inverse Hashed Mailboxing once again brings down the render times in all cases.

Conclusion

KDTree building is very expensive, especially with “Greedy” n log n building, but in all cases this generates faster trees for rendering than min/max binning approximation does. The “Empty Bonus” variable seems to have variable results, with more aggressive values giving deeper trees but faster render times for some meshes, but producing worse trees in other cases. With “Perfect Split” clipping enabled, build time is even more expensive, but with meshes that don’t have tight AABBs, seems to be well worth the extra work. Checking all axes doesn’t really seem to provide that much benefit, but with min/max binning, it’s possible to parallelise checking all three axes at once with SSE intrinsics, so the cost/benefit payoff is possibly there, although that’s not possible for KDTree building with the “Greedy” method or when spatially clipping triangles, so standard scalar code needs to be used, increasing the work that needs to be done.

Using Inverse Hashed Mailboxing provides no build-time penalty and significantly reduces render time in all situations.

Future investigation: Given how much better the results of KDTree building are with “Greedy” building, I’d be interested in what benefit using min/max binning over standard binning has with BVH acceleration structures, and whether the huge potential penalty for doing “Greedy” building with BVHs would provide any render time benefits.



Hashed Mailboxing

The ideal situation in terms of efficiency for a raytracer when tracing rays around a scene is that the acceleration structures take the brunt of the work (assuming a simple use-case with no complex shading, displacement/subdivision or asset paging) - that is, if you profile the raytracer, the acceleration structure intersection functions are being hit more than the actual intersection functions of the primitives within the acceleration structures. This depends on quite a few things, like the layout of the scene, the acceleration structure, the geometry and how tightly boundary boxes (of either the object or any the acceleration structures use) fit around geometry or other areas.

There’s a balancing point between the density of the geometry being intersected against and how efficient the acceleration structure being used is. With a KDTree (and to a lesser extent with BVH or BIH), this is generally governed by the depth of the acceleration structures. With a Grid acceleration structure, this is done by the number of grid cells the scene area or mesh is covered by. Up to a certain point (and assuming the acceleration structure works well) increasing either of these variables increases the efficiency of the overall ray intersection process, as in ideal situations (generally most of the time), this allows less intersection tests to be needed, because the cells of the acceleration structure storing the primitives store less primitives, so less primitives have to be tested against the ray. However, increasing these numbers comes at a price - the build time of the acceleration structure goes up, as does the amount of memory required to store it. In the KDTree’s case, both of these can quite quickly become prohibitive.

Something that is common with acceleration structures (especially KDTrees, Grids and BIHs) is the fact that primitives within the acceleration structure can often be in multiple nodes or cells at once if they straddle the boundary. Due to the way most ray traversal algorithms work with the respective acceleration structures, they generally involve walking the ray through the structure until it finds nodes or cells with primitives in, and then testing each primitive in that node/cell against the ray. In the case when a ray didn’t intersect with any primitives in one node/cell and so the ray moves to the next node/cell, if some of the primitives in this next node were in the previous node as well, they will have already been intersected against the ray, so they’re not going to intersect that primitive this time. This can be quite wasteful.

In A fast voxel traversal algorithm for raytracing, J. Amanatides and A. Woo introduced a method to avoid this inefficiency, whereby each ray and each primitive have a unique ID, and when a ray is intersected against a primitive, a 2D matrix is marked with the respective ray and primitive IDs, so that a fast lookup can be done in the future to skip the test if the situation arises again. This works very nicely when there are a few number of rays (e.g. only one for physics collision tests or similar) or when there are a fairly low number of primitives and rays. But when the number of primitives and rays are both in the millions, then the amount of memory required to store this matrix becomes ridiculous, not to mention causing huge CPU cache latency problems (due to cache misses), which significantly outweigh any benefit that using the technique might provide.

In Realtime Ray Tracing on Current CPU Architectures, C. Benthin introduced (section 4.4.4) the “Hashed Mailbox”, whereby, because for a particular ray intersection test (assuming a fairly good acceleration structure) the ray will only have to be intersected against a small subset of the total number of primitives that exist, a much smaller amount of memory is actually required in order to still give a useful efficiency boost. So a small amount of memory (with around 64 slots) is allocated, and the slots to use for marking primitives that have been checked are selected based on hashing the primitive ID. This gave on average between 15% to 25% speed increases. This new technique still requires a fair amount of memory (1KB) to be allocated and freed however, which can itself still have an overhead, especially if it is done on a per-ray-intersection basis.

In Ray-triangle intersection algorithm for modern cpu architectures, A. S. Maxim Shevtsov and A. Kapustin introduced “Inverse Mailboxing”, which uses even less memory (32 bytes for an 8-mailbox slot structure), which simply stores the last eight intersected objects on the stack, and so can be easily used within the ray traversal code/algorithm. This method doesn’t guarantee that duplicate object intersection tests will not be made per-ray for any object, but assuming a decent acceleration structure and a limited number of primitives per node or cell, the chances of duplicate checks will be very small, while still providing a useful speedup.

I’ve found using this last technique has for general meshes given between 7%-23% speedups, and in the case of compound objects (objects with sub-objects within them) - where the cost of intersecting the sub-objects can be much more expensive than just a simple primitive intersection - up to a 40% speed increase.




Archive
Full Index

2024 (5)
2023 (7)
2022 (3)
2021 (5)
2020 (4)
2019 (7)
2017 (1)
2016 (2)
2015 (1)
2014 (9)
2013 (10)
2012 (7)


Tags List