A bit more than a year ago I found ATI Tootle, an interesting mesh preprocessing tool for simultaneous vertex cache AND overdraw optimization. Somebody wisely commented then that “there’s probably something that is 1000x faster and 99% as good”. Well, that thing is here today and it’s bearing the equaly ridiculous name Tipsy. I won’t get around to actually implementing it, just as I didn’t with Tootle, but I sincerely hope the fine purveyors of the relevant middleware we use will find the time.
The topic of fast float-to-int conversion is one of the favorite among game developers, optimization freaks and other scum like that - so I was somewhat puzzled that I never encountered this problem in practice. Until today, that is.
We have some code which takes a bunch of mesh instances, transforms them and sticks them into a large vertex buffer in order to save draw calls. The source vertex data includes, among other things, a float4 containing in its xyz components a normal (range -1..1), which is transformed and compressed into the three of the four bytes of a uint32. We have a function called CompressNormalToD3DCOLOR, which takes a D3DXVECTOR4 and outputs a uint32, processing all four components. It was written long ago by the happy elves of the Land of Perfect Compilers. This is how it looked:
inline void CompressNormalToD3DCOLOR(const D3DXVECTOR4 & normal, uint32 & n )
В В В D3DXVECTOR4 norm( normal );
В В norm += D3DXVECTOR4( 1, 1, 1, 1 );
В В В norm *= 127.5f;
В В В uint32 r = uint32(norm.x);
В В В uint32 g = uint32(norm.y);
В В В uint32 b = uint32(norm.z);
В В В uint32 a = uint32(norm.w);
В В В n = (a<<24) + (r<<16) + (g<<8) + b;
In the land of Real Compilers, this straightforward piece of code compiles to 73 instructions, most of which are memory accesses. (VisualВ C++В from Visual Studio 2005).В В The function is called a zillion times, and in a certain worst case scenario which occurs on the 10th second after you run the game, it starts taking up to 25% of the frame time.
Some CPUs have a fscking instruction that does this, FFS.
My first reaction (after panic, which was the zeroeth) was to make a special-case function which only processes three of the components, since in this case we are sure we don’t need the fourth. At the cost of an additionalВ branch, half of the calls to the function were eliminated, all of which led to about a 3x reduction in the time taken just to compress normals.
I remembered I recently read something like an overview of the various float-to-int techniques on Mike Herf’s site. (Go there and read it. He’s the person behind the wonderfully smooth UI of Kai’s Power Tools, and 10 years later, Picasa.) I whipped up a simple synthetic benchmark with the original code, the three-component version, inline assembly FISTP, the double-magic-numbers trick, and the direct conversion trick (especially useful because in my case the normals are guaranteed to be in a certain range) - you can read the descriptions here. On my aging home Athlon 2 GHz, the original monstrosity takes about 200 clocks for one float4->uint32 conversion. The 3-component version is 136 clocks. Directly using FISTP via inline assembly, which requires manual setting of the FPU control word is only 28 clocks. The real superstars are the last two techniques, magic numbers which takes 22 clocks, and direct conversion, which only takes 20 clocks, a tenfold improvement of the original code!
Of course, all is not rosy. Wise men say FLD/FISTP doesn’t pipeline well. Magic numbers require that you keep your FPU in 53-bit precision mode - which a) isn’t a good idea and b) DirectX won’t honor. Direct conversion works if you can easily get your data in e.g. the [2, 4) range - notice how 4 is excluded: normals are [-1, 1], and it’s not trivial to get them into [-1, 1) without inflicting more clocks.
So, what turned out to be the best type of float-to-int conversion?
The one that takes zero clocks, of course. It turned out that most of the meshes which undergo this packing have fixed (0, 0, 1) normals in modelspace for all their vertices, which means the transformation and packing of their normals can happen per-instance, not per-vertex. Of course, I realized this only after spending an hour or so in reading Herf’s papers and benchmarking his suggested variants.
Well, at least I’ll be damn well prepared next time it comes up.
My seemingly innocent question about querying the video card memory on Vista turned into a 42-post bloodbath, giving little in the way of useful answers, but illustrating perfectly how there is no such thing as a “PC gaming platform”, and why I want to get out of the PC mess ASAP.
The problem with the video card memory size isn’t new. It’s a question Microsoft are actively trying to lie to you when answering; the simple query function in DirectX 9, IDirect3DDevice9::GetAvailableTextureMem, has always lied - it returns the memory on the video card plus the available physical system memory plus the current distance between the Mars moons Phobos and Deimos in Russian versts, divided by the temperature of the water in Sammamish Lake expressed in Rheomur degrees. An infiltrated enemy agent managed to sneak in the IDxDiag interface which works more or less reliably on XP, but in the run up to Vista he was discovered, shot down, and the oversight was corrected: on Vista the same IDxDiag code returns rubbish too - to the extend that even the DxDiag tool shipped with DirectX, and which countless QA staff and even users have been trained to run and send dumps, has become useless in that regard. So now you have to resort to quality software engineering techniques such as using DirectX 7 or DirectX 10 in an otherwise top-to-bottom DirectX 9 application. Or running GetAvailableTextureMem and subtracting the physical memory. Or dividing it by three. Or assuming that everybody with Vista has 256 MB of RAM on the videocard - hey, it’s the current mode, why not?
Apparently the Microsoft position is that it’s no business of the developer to know how much fast memory he can use. Please pretend that you can use as many textures as you like, we’ll take care of it. If we gave you this number, you’d only do stupid things with it… we’re here from the OS vendor, and we’re here to help you! Relax and watch the blinkenlights. People went so even as far as to suggest ridiculous things like start the game with the options screen so the user can pick the best settings for him (the “all my friends are geeks with $600 videocards” solution), or start the game with the ugliest settings by default (the “who cares about review scores” solution). What’s interesting is the clear demarcation line between people who are actually shipping games to be sold to real-world humans for a living and find real value knowing the video memory size, and the rest - technical evangelists, high-end demo creators and academics, who’s idea of development is pushing high-end hardware around and occasionally presenting to enthusiast users.
The PC as a platform is hopelessly fragmented. The rift between the high end and the low end is bigger than ever, from the Crysis crowd who consider 6800-class hardware “low-end”, and the Zuma clone crowd who don’t even have GPUs to speak of. The vendors - ATI^H^H^HAMD, NVIDIA, Microsoft - are pulling the rug each in their own direction, with little to no support to developers trying to stick to the rapidly disappearing “middle ground”, what was the mainstream PC gaming of a few years ago. (The rumored Intel intrusion in the field, trying to push raytracing on multi-multicore CPUs will make things much worse in this regard.) The publishers demand support for hardware (in our case, DX81-class GPUs) which has long ago fallen off the radar of Microsoft and isn’t even targeted by the shader compilers released with the DirectX SDKs. The reviewers demand graphics quality rivaling the multimillion 6-hour cinematic fests subsidized by console vendors and passed off as “AAA games”. The users demand not to to think. And a pony.
If only there were platforms where the hardware was cheap and powerful, the drivers appeared three or four times a year, the vendor was eager to help your development, and there were tens of millions of users eager to buy games. I would gladly accept the lack of a GetAvailableTextureMemory function - I’d replace it with a compile-time constant in a heartbeat.
I came across an interesting paper called Deferred Pixel Shading on the Playstation 3, by Alan Heirich and Louis Bavoil. They used the RSX as a pure rasterizer to build the G-buffers, then ran a pretty complex shadowing algorithm on five SPUs. They achieved 30 giga-ops (note that they don’t quote GFlops, which are much more commonly used to measure performance in this field - this is surely intentional) and around 11 GBytes/sec data transferred around the system.
Let’s convert this to more familiar terms, pretending that their “ops” are actually “flops” (there shouldn’t be much difference anyway, from what I know about the SPU instruction set). A game running at 30 fps in a 1280×720 resolution, without antialiasing, needs to shade 27.6 MPixels/sec. If you use 5 SPUs, like the authors of the paper, and achieve the same throughtput, this means you’d have about 1000 operations per pixel; given that traditional GPU pixel shader instructions are usually four-wide, this would be roughly equivalent to a 200-250 instruction pixel shader. On the bandwidth side, you would have about 400 bytes per pixel. If you use, say, four 32-bit surfaces for your G-buffers - which is what I remember as normal from the deferred shading papers I’ve read - and want to write another 32 bits to the final framebuffer, this leaves you with over 300 bytes of extra data to shuffle around - various shadowbuffers, several passes etc. 250 instructions for the lighting shader itself is also pretty generous, even though it would have to be divided among several passes. (You’d realistically want to do MSAA or even SSAA for a real game, which would raise the bandwidth and computational cost significantly - but on the other hand, neither the 30 Gops nor the 11 GB/s are anywhere near the theoretical throughput of Cell.)
All in all, I fully expect to see games doing deferred shading on the Cell before the generation is over. You “just” need to come up with a renderer, scene, world and game design which can utilize the strengths of the deferred shading fully - so the title would stand apart from the forward-rendering crowd, which would justify the pain of getting this to work. But on paper (pun intended), the numbers add up - it definitely seems possible.
This wouldn’t be a real blog if I didn’t bitch about life in general, and post pictures of my cat. Unfortunately, I don’t have cat. Fortunately, although there has been much to bitch about around me lately, I had two bright spots in my day today, both vaguely music-related. First, one of my favorite games, Rez, is coming to XBLA. I have spent lots of hours in Rez, trying to beat this or that boss; usually I lose quickly interest in games which are mostly about being hard and presenting a challenge (that’s why I’m not so excite about the other oldskool legend announced today for XBLA, Ikaruga). And in the best of times, I’m indifferent by the trance/electronic/whatchamacallit type of music in Rez. But there’s something about it that hypnotizes me for hours, chasing the lines on the screen. If there was a game that could benefit from HDTV, it would be Rez; in a perfect world it should come with a vector screen, even. Eh, maybe I’m just a sucker for rail shooters, and the next announcement that would make me just as happy would be a Panzer Dragoon Orta spiritual sequel. By the way, Rez can’t work on the PS3 in its present form - it really needs the throbbing of the controller in your hands as part of the experience. (Please, no trance vibrator jokes… if at all possible.)
The second bright spot was this video of somebody called Richard Lewis (sorry, I even tried to learn who this guy was, without luck) singing a gentle love song accompanied by his Nintendo DS, strumming virtual chords with his stylus in a game called Jam Sessions. If this is not a killer app, I don’t know what is.
Both perform roughly the same job - they allow us to enter data in a central database, then present us with different views of the data, let us run queries and summarize the results. (One of them has more serious obligations handling one particular form of local view of the data, but I’m mostly talking about its other duties here.) Both are used by virtually the same set of people, most of them sitting in the same room; both of them are occasionally used by people around the globe, which are given access to our databases, and connect to them thanks to the wonders of the Internet.
But these two applications aren’t created equal. One of them presents a rich, responsive interface, with all kinds of filtering, sorting, cross-referencing of relevant data. The other is slow, clunky, takes a constant amount of time (couple of seconds) to query the server even for the most mundane of tasks, and is chained to the UI conventions of an ancient presentation framework designed 10 years ago to fulfill completely different tasks.
The two applications fulfill similar needs to similar groups of users. It makes exactly the same sense for both of them to be implemented as Web apps, rendering inefficient HTML, doing needless roundtrips to the server, relying on the mercy of not one, but two intermediaries (the browser and the web server). Thankfully, the first application is a native Win32 app.
I love the UI of Gmail, but I would gladly switch to a desktop email reader with the same UI conventions, connecting to a database somewhere in the world. What I like about Gmail is not the fact that it renders through my browser, but its nice, unorthodox UI. I use a great little local-client Gmail on my mobile phone, written in Java; it beats the crap out of even the mobile-optimized server-side Gmail running through Opera Mini. For purely political reasons Google will never release anything like that for the desktop - but I bet the experience would be vastly superior to even the snappiest AJAX-rendering browser.
Web applications are a wonderful thing, but they are not the only solution to everything. Having more than one user to an application, and even having remote, off-site users, is not a good reason by itself to suffer through HTML forms and stateless HTTP request/responses. AJAX tricks may make the user interface slightly more responsive, but it won’t ever turn Flickr into Picasa. Doing a quick and dirty job through a browser might be OK for something I do once or twice a month (e.g. paying a bill online, or ordering books), but for something that I use dozens of times a day - e.g. email, or bugtracking, or code reviews - I prefer a native client.
The good native application in the true story above is TortoiseSVN. The crappy web application is the Mantis bugtracker. Any comments suggesting that I replace Mantis with superior bugtracking brand XXX must include offers of assistance with converting about a dozen home-grown tools around it, with migrating around 10k bugs from 8 projects, and retraining on the order of 50 people, most of them fairly conservative artists.
Since quite a lot of the search strings leading people to this blog are related to DXT compression, I feel obliged to link the best papers on the subject I’ve had the pleasure to read (but, unfortunately, not to implement). Both are published on Intel’s site, both are written by an id Software programmer called J.M.P van Waveren; for some reason, googling “waveren site:intel.com” doesn’t find both of them, although other search strings find each of them individually.
One of them is called Real-Time DXT CompressionВ and presents a heavily optimized SSE2 DXT compressor, the other is called Real-Time Texture Streaming & Decompression and presentsВ a similarly SSE2-ified JPEG-alike scheme. Since Intel’s webmasters can’t be trusted to keep the URLs alive, you’ll be better off looking by keywords for them at site:intel. Both achieve very impressive rates - e.g. 200 MPixels/sec RGB to DXT1 compression and 190 MPixels/sec DCT decompression, bothВ on a beefy Conroe chip.
Go ahead, dig in the SIMD intrinsics of your [employer’s] platform of choice. You have no excuse to load zlibbed DXTs anymore. (I hope no one is loading uncompressed TGA or BMPs in 2007, right? RIGHT?)
Somebody over at Beyond3D linked an interesting paper evaluating the tradeoffs between a software-managed local store (or scratchpad memory, as it was known on earlier Sony platforms) and the hardware-managed caches known from the desktop CPUs:
There are two basic models for the on-chip memory in CMP systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology,
area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly man-
aged. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an applicationвЂ™s code.
I’m still not convinced in the rightness of the thou shalt have no single GameObject school of thought. Having the objects present an uniform interface towards all objects present in the game world decouples many object-handling tasks from the internal representation of the object, the two chief examples being the editor and the script interface. We had to add very little code to the editor, for example, when we added new, different types of objects - grass, particle systems and terrain decals - for them the entire sets of operations such as move/scale/rotate, undo/redo, edit properties and save/load to file worked automagically just because they were simply GameObjects. We really have three classes of GameObjects - very lightweight, simple and numerous (think grass), lightweight-dumb-static (think trees and rocks) and full-blown (think units and buildings) - but 90% of the engine code and 99% of the game code doesn’t know the distinction between them. (And we use normal Lua objects, not game-engine living-in-the-world GameObjects for abstract gameplay entities like the economy simulation of the city, so that would count maybe as a fourth one.) So far the system seems to work OK, with significant advantages in code simplicity and memory footprint compared to the previous two systems we used. Two years ago we built a game around a single type of GameObjects, even for abstract gameplay entities, and there was an invisible “economy” object placed somewhere on the map. And seven years ago we had a full-blown OOP-textbook-madness, seven layers deep hierarchy of GameObjects, complete with a mess of virtual functions and semi-complete implementations jumping back and forth between the layers of the inheritance tree.
Seriously, who came up with the brilliant idea of storing the mip levels of a texture bigger-to-smaller after the header in a DDS file? This way, when you want to load just, say, mip level 256×256 and all the smaller ones, you need to do two disjoint reads from the file - one to parse the DDS header, and another for the mip subchain. Nothing would be simpler than storing them smaller-to-bigger - this way you’d be able to read with a single read operation, or, at worst, with two adjacent ones. When you’re trying to read asynchronously textures, the bigger-to-smaller order forces you to either keep a preloaded table of all the DDS headers in your game (which is a bad idea in many ways - it consumes memory proportionally to the entire data set of the game, instead of the currently needed data set, introduces additional asset build steps, and messes with the ability to let artists change texture formats and size on the fly while the game is running), or to two two asynchronous reads, doubling the latency for textures appearing on the screen.
I try hard not to be one of those not-invented-here guys who insist on having their own data processing tools and file formats for everything, but it’s not easy…