Archive for the ‘Gamedev’ Category

The best type of float-to-int conversion

Tuesday, October 16th, 2007

The topic of fast float-to-int conversion is one of the favorite among game developers, optimization freaks and other scum like that - so I was somewhat puzzled that I never encountered this problem in practice. Until today, that is.

We have some code which takes a bunch of mesh instances, transforms them and sticks them into a large vertex buffer in order to save draw calls. The source vertex data includes, among other things, a float4 containing in its xyz components a normal (range -1..1), which is transformed and compressed into the three of the four bytes of a uint32. We have a function called CompressNormalToD3DCOLOR, which takes a D3DXVECTOR4 and outputs a uint32, processing all four components. It was written long ago by the happy elves of the Land of Perfect Compilers. This is how it looked:

inline void CompressNormalToD3DCOLOR(const D3DXVECTOR4 & normal, uint32 & n )
{
В В В D3DXVECTOR4 norm( normal );
В В norm += D3DXVECTOR4( 1, 1, 1, 1 );
В В В norm *= 127.5f;

В В В uint32 r = uint32(norm.x);
В В В uint32 g = uint32(norm.y);
В В В uint32 b = uint32(norm.z);
В В В uint32 a = uint32(norm.w);

В В В n = (a<<24) + (r<<16) + (g<<8) + b;
}

Innocent enough?

In the land of Real Compilers, this straightforward piece of code compiles to 73 instructions, most of which are memory accesses. (VisualВ C++В from Visual Studio 2005).В В The function is called a zillion times, and in a certain worst case scenario which occurs on the 10th second after you run the game, it starts taking up to 25% of the frame time.

Some CPUs have a fscking instruction that does this, FFS.

My first reaction (after panic, which was the zeroeth) was to make a special-case function which only processes three of the components, since in this case we are sure we don’t need the fourth. At the cost of an additionalВ branch, half of the calls to the function were eliminated, all of which led to about a 3x reduction in the time taken just to compress normals.

I remembered I recently read something like an overview of the various float-to-int techniques on Mike Herf’s site. (Go there and read it. He’s the person behind the wonderfully smooth UI of Kai’s Power Tools, and 10 years later, Picasa.) I whipped up a simple synthetic benchmark with the original code, the three-component version, inline assembly FISTP, the double-magic-numbers trick, and the direct conversion trick (especially useful because in my case the normals are guaranteed to be in a certain range) - you can read the descriptions here. On my aging home Athlon 2 GHz, the original monstrosity takes about 200 clocks for one float4->uint32 conversion. The 3-component version is 136 clocks. Directly using FISTP via inline assembly, which requires manual setting of the FPU control word is only 28 clocks. The real superstars are the last two techniques, magic numbers which takes 22 clocks, and direct conversion, which only takes 20 clocks, a tenfold improvement of the original code!

Of course, all is not rosy. Wise men say FLD/FISTP doesn’t pipeline well. Magic numbers require that you keep your FPU in 53-bit precision mode - which a) isn’t a good idea and b) DirectX won’t honor. Direct conversion works if you can easily get your data in e.g. the [2, 4) range - notice how 4 is excluded: normals are [-1, 1], and it’s not trivial to get them into [-1, 1) without inflicting more clocks.

So, what turned out to be the best type of float-to-int conversion?

The one that takes zero clocks, of course. It turned out that most of the meshes which undergo this packing have fixed (0, 0, 1) normals in modelspace for all their vertices, which means the transformation and packing of their normals can happen per-instance, not per-vertex. Of course, I realized this only after spending an hour or so in reading Herf’s papers and benchmarking his suggested variants.

Well, at least I’ll be damn well prepared next time it comes up.

Deferred Pixel Shading on Cell

Thursday, September 13th, 2007

I came across an interesting paper called Deferred Pixel Shading on the Playstation 3, by Alan Heirich and Louis Bavoil. They used the RSX as a pure rasterizer to build the G-buffers, then ran a pretty complex shadowing algorithm on five SPUs. They achieved 30 giga-ops (note that they don’t quote GFlops, which are much more commonly used to measure performance in this field - this is surely intentional) and around 11 GBytes/sec data transferred around the system.

Let’s convert this to more familiar terms, pretending that their “ops” are actually “flops” (there shouldn’t be much difference anyway, from what I know about the SPU instruction set). A game running at 30 fps in a 1280×720 resolution, without antialiasing, needs to shade 27.6 MPixels/sec. If you use 5 SPUs, like the authors of the paper, and achieve the same throughtput, this means you’d have about 1000 operations per pixel; given that traditional GPU pixel shader instructions are usually four-wide, this would be roughly equivalent to a 200-250 instruction pixel shader. On the bandwidth side, you would have about 400 bytes per pixel. If you use, say, four 32-bit surfaces for your G-buffers - which is what I remember as normal from the deferred shading papers I’ve read - and want to write another 32 bits to the final framebuffer, this leaves you with over 300 bytes of extra data to shuffle around - various shadowbuffers, several passes etc. 250 instructions for the lighting shader itself is also pretty generous, even though it would have to be divided among several passes. (You’d realistically want to do MSAA or even SSAA for a real game, which would raise the bandwidth and computational cost significantly - but on the other hand, neither the 30 Gops nor the 11 GB/s are anywhere near the theoretical throughput of Cell.)

All in all, I fully expect to see games doing deferred shading on the Cell before the generation is over. You “just” need to come up with a renderer, scene, world and game design which can utilize the strengths of the deferred shading fully - so the title would stand apart from the forward-rendering crowd, which would justify the pain of getting this to work. But on paper (pun intended), the numbers add up - it definitely seems possible.

DXTn Compression

Monday, September 3rd, 2007

Since quite a lot of the search strings leading people to this blog are related to DXT compression, I feel obliged to link the best papers on the subject I’ve had the pleasure to read (but, unfortunately, not to implement). Both are published on Intel’s site, both are written by an id Software programmer called J.M.P van Waveren; for some reason, googling “waveren site:intel.com” doesn’t find both of them, although other search strings find each of them individually.

One of them is called Real-Time DXT CompressionВ and presents a heavily optimized SSE2 DXT compressor, the other is called Real-Time Texture Streaming & Decompression and presentsВ a similarly SSE2-ified JPEG-alike scheme. Since Intel’s webmasters can’t be trusted to keep the URLs alive, you’ll be better off looking by keywords for them at site:intel. Both achieve very impressive rates - e.g. 200 MPixels/sec RGB to DXT1 compression and 190 MPixels/sec DCT decompression, bothВ on a beefy Conroe chip.

Go ahead, dig in the SIMD intrinsics of your [employer’s] platform of choice. You have no excuse to load zlibbed DXTs anymore. (I hope no one is loading uncompressed TGA or BMPs in 2007, right? RIGHT?)

Don’t Fear The God Object

Thursday, August 30th, 2007

I’m still not convinced in the rightness of the thou shalt have no single GameObject school of thought. Having the objects present an uniform interface towards all objects present in the game world decouples many object-handling tasks from the internal representation of the object, the two chief examples being the editor and the script interface. We had to add very little code to the editor, for example, when we added new, different types of objects - grass, particle systems and terrain decals - for them the entire sets of operations such as move/scale/rotate, undo/redo, edit properties and save/load to file worked automagically just because they were simply GameObjects. We really have three classes of GameObjects - very lightweight, simple and numerous (think grass), lightweight-dumb-static (think trees and rocks) and full-blown (think units and buildings) - but 90% of the engine code and 99% of the game code doesn’t know the distinction between them. (And we use normal Lua objects, not game-engine living-in-the-world GameObjects for abstract gameplay entities like the economy simulation of the city, so that would count maybe as a fourth one.) So far the system seems to work OK, with significant advantages in code simplicity and memory footprint compared to the previous two systems we used. Two years ago we built a game around a single type of GameObjects, even for abstract gameplay entities, and there was an invisible “economy” object placed somewhere on the map. And seven years ago we had a full-blown OOP-textbook-madness, seven layers deep hierarchy of GameObjects, complete with a mess of virtual functions and semi-complete implementations jumping back and forth between the layers of the inheritance tree.

DVD-Hostile File Formats

Thursday, August 30th, 2007

Seriously, who came up with the brilliant idea of storing the mip levels of a texture bigger-to-smaller after the header in a DDS file? This way, when you want to load just, say, mip level 256×256 and all the smaller ones, you need to do two disjoint reads from the file - one to parse the DDS header, and another for the mip subchain. Nothing would be simpler than storing them smaller-to-bigger - this way you’d be able to read with a single read operation, or, at worst, with two adjacent ones. When you’re trying to read asynchronously textures, the bigger-to-smaller order forces you to either keep a preloaded table of all the DDS headers in your game (which is a bad idea in many ways - it consumes memory proportionally to the entire data set of the game, instead of the currently needed data set, introduces additional asset build steps, and messes with the ability to let artists change texture formats and size on the fly while the game is running), or to two two asynchronous reads, doubling the latency for textures appearing on the screen.

I try hard not to be one of those not-invented-here guys who insist on having their own data processing tools and file formats for everything, but it’s not easy…

The Four Screens

Wednesday, August 29th, 2007

I’ve been thinking lately howВ a game programmer should have not the (hopefully) standard two, but four monitors on his PC. Besides the usual two for the game itself and his favorite IDE, he must have one for the generated machine code, because you can’t really trust a compiler even with the most basic of tasks. And one for his profiler of choice, to always keep an eye of what the game really is doing.

Real-world example:В a misunderstood feature for dynamic remapping of terrain layer textures leads to a per-frame resolving of texture filenames to device texture objects. Textures are looked up by filename, which must be handledВ without regard for case; and stored in a hash map. The unfortunate combination of three seemingly simple, convenient tools - dynamic reloading of the terrain table, a case-insensitive std::string wrapper, and a hash-map - lead to 40 000 calls to tolower() per frame. Not a big deal - nothing you’d even think about without a profiler. But here goes 1% of the total CPU time of a completely CPU-bound game - a percent which you’d have a hard time to shave off somewhere else.

Another one: an innocent, smallish 12-byte structure is augmented with a “dirty” flag, to keep track of yet another type of dynamically modifiable data. The structure is put in a three-dimensional array - which have to be banned outright, by the way. Then the array is “sliced” - iterated over (the wrong) two of the dimensions, and the dirty flag is checked.В This particular access pattern and the particular arrayВ and structureВ sizesВ lead to about of 4000 successive cache misses per-frame. Which take up to 2.5% of the frame time on the target platform, if you’re going for 30 fps. See it in a profiler, and spend 5 minutes to move the dirty flags to a separate array (or keep a “modified data” queue aside if you have 30 minutes), and shave off these 2.5%. Or don’t, andВ shipВ a seemingly bottleneck-free, but severely underpeforming game, full of such pearls of wisdom.

Normal Maps Compression

Saturday, January 20th, 2007

Once again I set out to prove people smarter than me are wrong by trying various things for tangent-space normal map compression, and once again I found myself wrong.

Lesson one: you can’t stuff normals in DXT1, no matter how much voodoo you do on the reconstruction side. The blocks show through, painting little 4×4 bricks on top of the bricks of my test texture.

Lesson two: if you resort to 8 bits per pixel, try the so-called DXT5_NM format. It’sВ X component in alpha, Y component replicated in the color components, Z = sqrt(1 - X*X - Y*Y). It looks OK, and has no visible blocking artifacts.

The problem with reconstructing Z from X and Y is that it’s a non-linear operation, andВ breaks down under bilinear filtering - just think of what Z component you’d reconstruct when fetching right in the middle between two pixels, one with (1,0) and the other with (0,1) XY components. The error is greatest near the equator (small Z values), but it is always there. Thankfully, real-world normal maps rarely have sharp normal boundaries, and most of the normals point straight up, so the general appearance of bumpiness is preserved. And it sure beats the blocks you get if you keep the Z values somewhere in the DXT5 block, or dropping the resolution of the normal mapВ in halfВ if you go to two bytes per texel.

The Case for 64-bit OSes

Friday, September 8th, 2006

I did a small experiment in memory allocation today. My work PC has 2 GB of physical RAM. I tried allocating:

  • 1 GB in one go: failed
  • 2 x 512 MB : failed
  • 1 x 900 MB: succeeded
  • 2048 x 1 MB: failed around the 1800th allocation

Clearly something is fragmenting the address space - a fixed loading address DLL, the AGP aperture, maybe some stubborn driver - I don’t know enough about the Windows internals to find out what it is. But I know I want to be able not to worry about whether it will be possible to allocate 1 GB on a machine with 2 GB of physical RAM!

Iron Lua, please!

Thursday, September 7th, 2006

Via Let’s Kill Dave come the news that Iron Python 1.0 is out. And you could even use it to build XNA games, as it compiles to the same MSIL that C# compiles to.

Python is sooo 2004, however. Lua’s where all the cool kids, sorry, responsible game developers are. Several years ago His Moonness, Roberto himself, had worked on a CLR incarnation of Lua; maybe it’s time to dust it off, bring it up to .NET 2.0 and Lua 5.1, and release it. For great justice.

Active vs. Passive Observer

Monday, September 4th, 2006

An interesting, and somewhat obvious in hindsight distinction made by John Carmack in his Quakecon keynote between active and passive observers of a game’s visuals. The passive observer is someone who is watching a screenshot, or watching another person play; the active observer is the user in control of the game. The passive observer is much more likely to notice bad texel/pixel ration, aliasing on the trees in the distance or wobbly shadows; for him, the image is the end. The active observer treats the image as a means to an end, which is the game itself. Essentially, we’re marketing games to passive observers, who’ll buy them and then experience them as active observers.