Archive for October, 2007

The best type of float-to-int conversion

Tuesday, October 16th, 2007

The topic of fast float-to-int conversion is one of the favorite among game developers, optimization freaks and other scum like that - so I was somewhat puzzled that I never encountered this problem in practice. Until today, that is.

We have some code which takes a bunch of mesh instances, transforms them and sticks them into a large vertex buffer in order to save draw calls. The source vertex data includes, among other things, a float4 containing in its xyz components a normal (range -1..1), which is transformed and compressed into the three of the four bytes of a uint32. We have a function called CompressNormalToD3DCOLOR, which takes a D3DXVECTOR4 and outputs a uint32, processing all four components. It was written long ago by the happy elves of the Land of Perfect Compilers. This is how it looked:

inline void CompressNormalToD3DCOLOR(const D3DXVECTOR4 & normal, uint32 & n )
{
В В В D3DXVECTOR4 norm( normal );
В В norm += D3DXVECTOR4( 1, 1, 1, 1 );
В В В norm *= 127.5f;

В В В uint32 r = uint32(norm.x);
В В В uint32 g = uint32(norm.y);
В В В uint32 b = uint32(norm.z);
В В В uint32 a = uint32(norm.w);

В В В n = (a<<24) + (r<<16) + (g<<8) + b;
}

Innocent enough?

In the land of Real Compilers, this straightforward piece of code compiles to 73 instructions, most of which are memory accesses. (VisualВ C++В from Visual Studio 2005).В В The function is called a zillion times, and in a certain worst case scenario which occurs on the 10th second after you run the game, it starts taking up to 25% of the frame time.

Some CPUs have a fscking instruction that does this, FFS.

My first reaction (after panic, which was the zeroeth) was to make a special-case function which only processes three of the components, since in this case we are sure we don’t need the fourth. At the cost of an additionalВ branch, half of the calls to the function were eliminated, all of which led to about a 3x reduction in the time taken just to compress normals.

I remembered I recently read something like an overview of the various float-to-int techniques on Mike Herf’s site. (Go there and read it. He’s the person behind the wonderfully smooth UI of Kai’s Power Tools, and 10 years later, Picasa.) I whipped up a simple synthetic benchmark with the original code, the three-component version, inline assembly FISTP, the double-magic-numbers trick, and the direct conversion trick (especially useful because in my case the normals are guaranteed to be in a certain range) - you can read the descriptions here. On my aging home Athlon 2 GHz, the original monstrosity takes about 200 clocks for one float4->uint32 conversion. The 3-component version is 136 clocks. Directly using FISTP via inline assembly, which requires manual setting of the FPU control word is only 28 clocks. The real superstars are the last two techniques, magic numbers which takes 22 clocks, and direct conversion, which only takes 20 clocks, a tenfold improvement of the original code!

Of course, all is not rosy. Wise men say FLD/FISTP doesn’t pipeline well. Magic numbers require that you keep your FPU in 53-bit precision mode - which a) isn’t a good idea and b) DirectX won’t honor. Direct conversion works if you can easily get your data in e.g. the [2, 4) range - notice how 4 is excluded: normals are [-1, 1], and it’s not trivial to get them into [-1, 1) without inflicting more clocks.

So, what turned out to be the best type of float-to-int conversion?

The one that takes zero clocks, of course. It turned out that most of the meshes which undergo this packing have fixed (0, 0, 1) normals in modelspace for all their vertices, which means the transformation and packing of their normals can happen per-instance, not per-vertex. Of course, I realized this only after spending an hour or so in reading Herf’s papers and benchmarking his suggested variants.

Well, at least I’ll be damn well prepared next time it comes up.

PC Gaming Must Die

Saturday, October 6th, 2007

My seemingly innocent question about querying the video card memory on Vista turned into a 42-post bloodbath, giving little in the way of useful answers, but illustrating perfectly how there is no such thing as a “PC gaming platform”, and why I want to get out of the PC mess ASAP.

The problem with the video card memory size isn’t new. It’s a question Microsoft are actively trying to lie to you when answering; the simple query function in DirectX 9, IDirect3DDevice9::GetAvailableTextureMem, has always lied - it returns the memory on the video card plus the available physical system memory plus the current distance between the Mars moons Phobos and Deimos in Russian versts, divided by the temperature of the water in Sammamish Lake expressed in Rheomur degrees. An infiltrated enemy agent managed to sneak in the IDxDiag interface which works more or less reliably on XP, but in the run up to Vista he was discovered, shot down, and the oversight was corrected: on Vista the same IDxDiag code returns rubbish too - to the extend that even the DxDiag tool shipped with DirectX, and which countless QA staff and even users have been trained to run and send dumps, has become useless in that regard. So now you have to resort to quality software engineering techniques such as using DirectX 7 or DirectX 10 in an otherwise top-to-bottom DirectX 9 application. Or running GetAvailableTextureMem and subtracting the physical memory. Or dividing it by three. Or assuming that everybody with Vista has 256 MB of RAM on the videocard - hey, it’s the current mode, why not?

Apparently the Microsoft position is that it’s no business of the developer to know how much fast memory he can use. Please pretend that you can use as many textures as you like, we’ll take care of it. If we gave you this number, you’d only do stupid things with it… we’re here from the OS vendor, and we’re here to help you! Relax and watch the blinkenlights. People went so even as far as to suggest ridiculous things like start the game with the options screen so the user can pick the best settings for him (the “all my friends are geeks with $600 videocards” solution), or start the game with the ugliest settings by default (the “who cares about review scores” solution). What’s interesting is the clear demarcation line between people who are actually shipping games to be sold to real-world humans for a living and find real value knowing the video memory size, and the rest - technical evangelists, high-end demo creators and academics, who’s idea of development is pushing high-end hardware around and occasionally presenting to enthusiast users.

The PC as a platform is hopelessly fragmented. The rift between the high end and the low end is bigger than ever, from the Crysis crowd who consider 6800-class hardware “low-end”, and the Zuma clone crowd who don’t even have GPUs to speak of. The vendors - ATI^H^H^HAMD, NVIDIA, Microsoft - are pulling the rug each in their own direction, with little to no support to developers trying to stick to the rapidly disappearing “middle ground”, what was the mainstream PC gaming of a few years ago. (The rumored Intel intrusion in the field, trying to push raytracing on multi-multicore CPUs will make things much worse in this regard.) The publishers demand support for hardware (in our case, DX81-class GPUs) which has long ago fallen off the radar of Microsoft and isn’t even targeted by the shader compilers released with the DirectX SDKs. The reviewers demand graphics quality rivaling the multimillion 6-hour cinematic fests subsidized by console vendors and passed off as “AAA games”. The users demand not to to think. And a pony.

If only there were platforms where the hardware was cheap and powerful, the drivers appeared three or four times a year, the vendor was eager to help your development, and there were tens of millions of users eager to buy games. I would gladly accept the lack of a GetAvailableTextureMemory function - I’d replace it with a compile-time constant in a heartbeat.