The topic of fast float-to-int conversion is one of the favorite among game developers, optimization freaks and other scum like that - so I was somewhat puzzled that I never encountered this problem in practice. Until today, that is.
We have some code which takes a bunch of mesh instances, transforms them and sticks them into a large vertex buffer in order to save draw calls. The source vertex data includes, among other things, a float4 containing in its xyz components a normal (range -1..1), which is transformed and compressed into the three of the four bytes of a uint32. We have a function called CompressNormalToD3DCOLOR, which takes a D3DXVECTOR4 and outputs a uint32, processing all four components. It was written long ago by the happy elves of the Land of Perfect Compilers. This is how it looked:
inline void CompressNormalToD3DCOLOR(const D3DXVECTOR4 & normal, uint32 & n )
{
В В В D3DXVECTOR4 norm( normal );
В В norm += D3DXVECTOR4( 1, 1, 1, 1 );
В В В norm *= 127.5f;
В В В uint32 r = uint32(norm.x);
В В В uint32 g = uint32(norm.y);
В В В uint32 b = uint32(norm.z);
В В В uint32 a = uint32(norm.w);
В В В n = (a<<24) + (r<<16) + (g<<8) + b;
}
Innocent enough?
In the land of Real Compilers, this straightforward piece of code compiles to 73 instructions, most of which are memory accesses. (VisualВ C++В from Visual Studio 2005).В В The function is called a zillion times, and in a certain worst case scenario which occurs on the 10th second after you run the game, it starts taking up to 25% of the frame time.
Some CPUs have a fscking instruction that does this, FFS.
My first reaction (after panic, which was the zeroeth) was to make a special-case function which only processes three of the components, since in this case we are sure we don’t need the fourth. At the cost of an additionalВ branch, half of the calls to the function were eliminated, all of which led to about a 3x reduction in the time taken just to compress normals.
I remembered I recently read something like an overview of the various float-to-int techniques on Mike Herf’s site. (Go there and read it. He’s the person behind the wonderfully smooth UI of Kai’s Power Tools, and 10 years later, Picasa.) I whipped up a simple synthetic benchmark with the original code, the three-component version, inline assembly FISTP, the double-magic-numbers trick, and the direct conversion trick (especially useful because in my case the normals are guaranteed to be in a certain range) - you can read the descriptions here. On my aging home Athlon 2 GHz, the original monstrosity takes about 200 clocks for one float4->uint32 conversion. The 3-component version is 136 clocks. Directly using FISTP via inline assembly, which requires manual setting of the FPU control word is only 28 clocks. The real superstars are the last two techniques, magic numbers which takes 22 clocks, and direct conversion, which only takes 20 clocks, a tenfold improvement of the original code!
Of course, all is not rosy. Wise men say FLD/FISTP doesn’t pipeline well. Magic numbers require that you keep your FPU in 53-bit precision mode - which a) isn’t a good idea and b) DirectX won’t honor. Direct conversion works if you can easily get your data in e.g. the [2, 4) range - notice how 4 is excluded: normals are [-1, 1], and it’s not trivial to get them into [-1, 1) without inflicting more clocks.
So, what turned out to be the best type of float-to-int conversion?
The one that takes zero clocks, of course. It turned out that most of the meshes which undergo this packing have fixed (0, 0, 1) normals in modelspace for all their vertices, which means the transformation and packing of their normals can happen per-instance, not per-vertex. Of course, I realized this only after spending an hour or so in reading Herf’s papers and benchmarking his suggested variants.
Well, at least I’ll be damn well prepared next time it comes up.