Archive for September, 2007

Deferred Pixel Shading on Cell

Thursday, September 13th, 2007

I came across an interesting paper called Deferred Pixel Shading on the Playstation 3, by Alan Heirich and Louis Bavoil. They used the RSX as a pure rasterizer to build the G-buffers, then ran a pretty complex shadowing algorithm on five SPUs. They achieved 30 giga-ops (note that they don’t quote GFlops, which are much more commonly used to measure performance in this field - this is surely intentional) and around 11 GBytes/sec data transferred around the system.

Let’s convert this to more familiar terms, pretending that their “ops” are actually “flops” (there shouldn’t be much difference anyway, from what I know about the SPU instruction set). A game running at 30 fps in a 1280×720 resolution, without antialiasing, needs to shade 27.6 MPixels/sec. If you use 5 SPUs, like the authors of the paper, and achieve the same throughtput, this means you’d have about 1000 operations per pixel; given that traditional GPU pixel shader instructions are usually four-wide, this would be roughly equivalent to a 200-250 instruction pixel shader. On the bandwidth side, you would have about 400 bytes per pixel. If you use, say, four 32-bit surfaces for your G-buffers - which is what I remember as normal from the deferred shading papers I’ve read - and want to write another 32 bits to the final framebuffer, this leaves you with over 300 bytes of extra data to shuffle around - various shadowbuffers, several passes etc. 250 instructions for the lighting shader itself is also pretty generous, even though it would have to be divided among several passes. (You’d realistically want to do MSAA or even SSAA for a real game, which would raise the bandwidth and computational cost significantly - but on the other hand, neither the 30 Gops nor the 11 GB/s are anywhere near the theoretical throughput of Cell.)

All in all, I fully expect to see games doing deferred shading on the Cell before the generation is over. You “just” need to come up with a renderer, scene, world and game design which can utilize the strengths of the deferred shading fully - so the title would stand apart from the forward-rendering crowd, which would justify the pain of getting this to work. But on paper (pun intended), the numbers add up - it definitely seems possible.

Some days are better than others

Wednesday, September 12th, 2007

This wouldn’t be a real blog if I didn’t bitch about life in general, and post pictures of my cat. Unfortunately, I don’t have cat. Fortunately, although there has been much to bitch about around me lately, I had two bright spots in my day today, both vaguely music-related. First, one of my favorite games, Rez, is coming to XBLA. I have spent lots of hours in Rez, trying to beat this or that boss; usually I lose quickly interest in games which are mostly about being hard and presenting a challenge (that’s why I’m not so excite about the other oldskool legend announced today for XBLA, Ikaruga). And in the best of times, I’m indifferent by the trance/electronic/whatchamacallit type of music in Rez. But there’s something about it that hypnotizes me for hours, chasing the lines on the screen. If there was a game that could benefit from HDTV, it would be Rez; in a perfect world it should come with a vector screen, even. Eh, maybe I’m just a sucker for rail shooters, and the next announcement that would make me just as happy would be a Panzer Dragoon Orta spiritual sequel. By the way, Rez can’t work on the PS3 in its present form - it really needs the throbbing of the controller in your hands as part of the experience. (Please, no trance vibrator jokes… if at all possible.)

The second bright spot was this video of somebody called Richard Lewis (sorry, I even tried to learn who this guy was, without luck) singing a gentle love song accompanied by his Nintendo DS, strumming virtual chords with his stylus in a game called Jam Sessions. If this is not a killer app, I don’t know what is.

Web Applications Must Die

Saturday, September 8th, 2007

Close behind the obligatory text editor and IDE, there are two applications everybody on our software team uses a lot on a daily basis.

Both perform roughly the same job - they allow us to enter data in a central database, then present us with different views of the data, let us run queries and summarize the results. (One of them has more serious obligations handling one particular form of local view of the data, but I’m mostly talking about its other duties here.) Both are used by virtually the same set of people, most of them sitting in the same room; both of them are occasionally used by people around the globe, which are given access to our databases, and connect to them thanks to the wonders of the Internet.

But these two applications aren’t created equal. One of them presents a rich, responsive interface, with all kinds of filtering, sorting, cross-referencing of relevant data. The other is slow, clunky, takes a constant amount of time (couple of seconds) to query the server even for the most mundane of tasks, and is chained to the UI conventions of an ancient presentation framework designed 10 years ago to fulfill completely different tasks.

The two applications fulfill similar needs to similar groups of users. It makes exactly the same sense for both of them to be implemented as Web apps, rendering inefficient HTML, doing needless roundtrips to the server, relying on the mercy of not one, but two intermediaries (the browser and the web server). Thankfully, the first application is a native Win32 app.

I love the UI of Gmail, but I would gladly switch to a desktop email reader with the same UI conventions, connecting to a database somewhere in the world. What I like about Gmail is not the fact that it renders through my browser, but its nice, unorthodox UI. I use a great little local-client Gmail on my mobile phone, written in Java; it beats the crap out of even the mobile-optimized server-side Gmail running through Opera Mini. For purely political reasons Google will never release anything like that for the desktop - but I bet the experience would be vastly superior to even the snappiest AJAX-rendering browser.

Web applications are a wonderful thing, but they are not the only solution to everything. Having more than one user to an application, and even having remote, off-site users, is not a good reason by itself to suffer through HTML forms and stateless HTTP request/responses. AJAX tricks may make the user interface slightly more responsive, but it won’t ever turn Flickr into Picasa. Doing a quick and dirty job through a browser might be OK for something I do once or twice a month (e.g. paying a bill online, or ordering books), but for something that I use dozens of times a day - e.g. email, or bugtracking, or code reviews - I prefer a native client.

The good native application in the true story above is TortoiseSVN. The crappy web application is the Mantis bugtracker. Any comments suggesting that I replace Mantis with superior bugtracking brand XXX must include offers of assistance with converting about a dozen home-grown tools around it, with migrating around 10k bugs from 8 projects, and retraining on the order of 50 people, most of them fairly conservative artists.

DXTn Compression

Monday, September 3rd, 2007

Since quite a lot of the search strings leading people to this blog are related to DXT compression, I feel obliged to link the best papers on the subject I’ve had the pleasure to read (but, unfortunately, not to implement). Both are published on Intel’s site, both are written by an id Software programmer called J.M.P van Waveren; for some reason, googling “waveren site:intel.com” doesn’t find both of them, although other search strings find each of them individually.

One of them is called Real-Time DXT CompressionВ and presents a heavily optimized SSE2 DXT compressor, the other is called Real-Time Texture Streaming & Decompression and presentsВ a similarly SSE2-ified JPEG-alike scheme. Since Intel’s webmasters can’t be trusted to keep the URLs alive, you’ll be better off looking by keywords for them at site:intel. Both achieve very impressive rates - e.g. 200 MPixels/sec RGB to DXT1 compression and 190 MPixels/sec DCT decompression, bothВ on a beefy Conroe chip.

Go ahead, dig in the SIMD intrinsics of your [employer’s] platform of choice. You have no excuse to load zlibbed DXTs anymore. (I hope no one is loading uncompressed TGA or BMPs in 2007, right? RIGHT?)

Local storage vs. caches

Monday, September 3rd, 2007

Somebody over at Beyond3D linked an interesting paper evaluating the tradeoffs between a software-managed local store (or scratchpad memory, as it was known on earlier Sony platforms) and the hardware-managed caches known from the desktop CPUs:

There are two basic models for the on-chip memory in CMP systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology,
area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly man-
aged. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application’s code.