C++ Virtual Function Overhead
I have recently been trying to change Imagine’s Acceleration Structure infrastructure to be more dynamic, allowing different objects and geometry instances to have different acceleration structures either automatically based on heuristics, or based on user-specification within the interface.
Imagine’s acceleration structures have for the past two years been implemented with the Curiously Recurring Template design pattern, which allows virtual function-like ability to some extent while enabling the compiler to inline the functions. I chose this way of doing things as I had assumed that Virtual Functions would have some overhead, despite previously doing some experimentation with Virtual Functions and surprisingly finding they don’t seem to have any overhead compared to fully-inline-able functions.
The “Curiously Recurring Template” pattern doesn’t easily allow a templated base class (for Triangles, Shapes or Objects in Imagine’s case) to be specified as a pointer and then the actual implementation to be instantiated as a derived templated acceleration structure, which is what I needed for this dynamic flexibility.
So I decided to do some more benchmarks to investigate any overhead again.
First of all, I tried a synthetic benchmark of just a simple for loop of 1,073,741,824 iterations, calling getHitObject()
on the acceleration structure pointer. For the “Curiously Recurring Template” implementation, the pointer type was of KDTreeVolume<T>
, the actual derived class and the object allocated for that pointer was of the same type.
In the Virtual Function implementation, the pointer type was to AccelerationStructure<T>
, the base class and the actual object allocated for the pointer was KDTree<T>
, a duplicate of the KDTreeVolume<T>
class but which used standard C++ virtual inheritance from the abstract AccelerationStructure<T>
base class.
The code was run in a single thread.
Eight pairs of runs were done, alternating each implementation, and I tried to make sure the CPU core temperatures were under 55 degC before starting each test to ensure that there was no disadvantage to be had by a core not being able to Turbo Boost overclock.
Test Run | CRT | Virtual Function |
---|---|---|
1 | 138.73941 | 137.11993 |
2 | 141.10697 | 137.10513 |
3 | 136.96121 | 137.07798 |
4 | 136.67388 | 139.22481 |
5 | 136.91679 | 138.22481 |
6 | 137.20460 | 138.49725 |
7 | 138.77106 | 139.43341 |
8 | 136.85819 | 137.13652 |
Mean Average | 137.90402 | 137.97748 |
Other than the first two results for the “Curiously Recurring Template” implementation, the Virtual Function method seems very slightly slower, but this difference is within the margin of error. This surprised me, but given the simple test case, it’s possible that the processor’s branch-predictor was negating the overhead of the v-table lookup, and thus might not be showing the difference.
So I decided to do some more realistic general purpose rendering benchmarks with the two different implementations, to see if any difference could be spotted there.
I created a scene with a fully-enclosed room, with 32 cubes inside and 2 area lights, and all surfaces fully diffuse. All geometry objects had less than or equal to 12 triangles, so the acceleration structures would be very shallow and thus any Virtual Function overhead would not be dwarfed by the work each function was doing on the acceleration structure.
The first test was at 1024x768, with 256 samples per pixel, and 10 ray bounces.
With this test, the getHitObject()
function would be called 2,013,265,920 times (for each camera ray and each diffuse bounce ray) and the doesOcclude()
function would be called 4,026,531,840 times, once for each light (no light importance sampling was used) for each surface evaluation. All threads were being used for rendering.
Test Run | CRT | Virtual Functions |
---|---|---|
1 | 05:52.720 | 05:41.340 |
2 | 05:53.420 | 05:40.790 |
3 | 05:51.870 | 05:41.620 |
4 | 05:49.580 | 05:42.200 |
5 | 05:46.380 | 05:41.790 |
6 | 05:49.150 | 05:44.380 |
Average | 05:50.520 | 05:42.020 |
The above results are in minutes, and show that the Virtual Function implementation was consistently slightly faster. I decided to do these tests again with double the number of samples per pixel (to double the above numbers), and again the results were pretty much the same:
Test Run | CRT | Virtual Functions |
---|---|---|
1 | 11:41.970 | 11:40.880 |
2 | 11:41.380 | 11:37.710 |
3 | 11:47.790 | 11:31.560 |
4 | 11:37.530 | 11:26.680 |
5 | 11:45.270 | 11:31.530 |
Average | 11:42.788 | 11:33.672 |
I’m still surprised by this, but I’m putting it down to either compilers being a lot more clever than they used to, or the inlined “Curiously Recurring Template” implementation creating bloated instruction size. Or more likely, the fact that things like cache misses, load dependencies and branch miss-predictions within triangle and boundary box ray intersection code hide any overhead there may be.
Regardless, it appears that to all intents and purposes in my use-case, Virtual Function overhead is practically non-existent. I’m sure things would change with deeper inheritance hierarchies and multiple inheritance, but at least in my use-case it seems safe to move back to Virtual Functions. What’s more, Intel’s Embree high-performance ray tracing kernels use Virtual Functions, and they’re pretty-much state-of-the-art.