The Four Screens

August 29th, 2007

I’ve been thinking lately howВ a game programmer should have not the (hopefully) standard two, but four monitors on his PC. Besides the usual two for the game itself and his favorite IDE, he must have one for the generated machine code, because you can’t really trust a compiler even with the most basic of tasks. And one for his profiler of choice, to always keep an eye of what the game really is doing.

Real-world example:В a misunderstood feature for dynamic remapping of terrain layer textures leads to a per-frame resolving of texture filenames to device texture objects. Textures are looked up by filename, which must be handledВ without regard for case; and stored in a hash map. The unfortunate combination of three seemingly simple, convenient tools - dynamic reloading of the terrain table, a case-insensitive std::string wrapper, and a hash-map - lead to 40 000 calls to tolower() per frame. Not a big deal - nothing you’d even think about without a profiler. But here goes 1% of the total CPU time of a completely CPU-bound game - a percent which you’d have a hard time to shave off somewhere else.

Another one: an innocent, smallish 12-byte structure is augmented with a “dirty” flag, to keep track of yet another type of dynamically modifiable data. The structure is put in a three-dimensional array - which have to be banned outright, by the way. Then the array is “sliced” - iterated over (the wrong) two of the dimensions, and the dirty flag is checked.В This particular access pattern and the particular arrayВ and structureВ sizesВ lead to about of 4000 successive cache misses per-frame. Which take up to 2.5% of the frame time on the target platform, if you’re going for 30 fps. See it in a profiler, and spend 5 minutes to move the dirty flags to a separate array (or keep a “modified data” queue aside if you have 30 minutes), and shave off these 2.5%. Or don’t, andВ shipВ a seemingly bottleneck-free, but severely underpeforming game, full of such pearls of wisdom.

T-Rex Playstation Demo from 1993

February 5th, 2007

I’ve heard long time ago legends of the original Playstation tech demo featuring a T-Rex, but couldn’t find it. Finally somebody uploaded it to YouTube for all the world to admire:

Comments from the thread on B3D:


This demo was first shown off on Playstation hardware back in 1993… 3dfx didn’t exist (as a company), Matrox hadn’t released *any* of it’s consumer cards with 3D acceleration yet (they were the first), and the fastest piece of x86 you could lay your hands on was a 66MHz P5 (which typically ranged between $5K-$8K for a machine with one).

Roberto Ierusalimschy on threads in Lua

February 5th, 2007

…We did not (and still do not) believe in the standard multithreading model, which is preemptive concurrency with shared memory: we still think that no one can write correct programs in a language where ‘a=a+1’ is not deterministic.

Making Of Lost Planet / Dead Rising

January 31st, 2007

Since “making of sotc” remains as the most important search phrase leading people here, I will attempt to whore some more hits by linking another interesting article, this time about Capcom’s new engine, used for Dead Rising and Lost Planet now, and for Devil May Cry 4 and Resident Evil 5 in the future. The Japanese original is here, and a semi-translated version is here. (For some reason Google Translate doesn’t translate all of the text - possibly on purpose.)

Several highlights, translated by kind people on various forums:

  1. This MT engine work started in Sept. 2004 and they had something up and running by January 2005. It’s based on the Onimusha 3 engine. The project started with just one engineer, then ramped up to 3, and now they have 5 people maintaining and upgrading the code. They added 4 people now just for the PS3 port which started in Oct. 2005. The MT engine is currently used for Dead Rising and Lost Planet, but future cross-platform next-gen games will also use it.
  2. They evaluated UE3, and they see they appreciate the strengths of that engine. But they were worried about some of the performance limitations at the time, and the lack of support personnel in Japan. They have high hopes for UE3 in the future. But they decided this time to go at building their own tech.
  3. They started with Xbox 360 since it is so close to the PC platform and mostly compatible.
  4. There have been requests from developers to license their engine due to the success of Lost Planet and Dead Rising. But Capcom feels that it would take too much effort to hire the appropriate support staff. They would rather put more effort into developing even better games for their users.
  5. They talk a bit about the multithreading techniques they are using to get the power out of the asynchronously multicore CPUs in the 360 and PS3.
  6. They give a detailed description of the technique and provide screenshots to show they are using to do motion blur on the Xbox 360. The algorithms are based on a talk given by Simon Green at NVIDIA at GDC back in 2003. (This is the one aspect of Lost Planet that looks truly next-gen, and makes the game really stand out and look unbelievably beautiful.)
  7. More info on Xenon performance, 1 3.2Ghz core = 2/3 power of P4 3.2Ghz, but if use all 6 threads, you can get 4X performance.so Xenon 3.2Ghz 3 cores 6 SMT = P4EE 3.2Ghz 2 cores 4 SMT. And the memory access latency, yes the latency is bad, but like current gen GPU, more threads means you can hide the memory access problem better. so that is not a big problem.
  8. Each character has 10k to 20k ploy, VS has 30k to 40k poly, background 500k. Normally there are around 160mb texture on the memory, 60mb - 80mb is background.
  9. Dynamic MSAA adjust, base on the scene FPS. i.e. < 30fps, 4AA -> 2AA.
  10. MT stands for Multi-Target, Multi-Threaded, Meta Tools engine.
  11. Dead Rising scenes render about 4M polys, and Lost Planet scenes max out at about 3M. This is because of the demanding number of particles needed for Lost Planet. They said that drawing bleeding zombies is technologically much easier than creating a rich organic world filled with smoke, fire, and snow.
  12. They are very proud of the techniques they been able to employ to get a tremendous amount of good looking particle effects on screen without causing slowdown. They said that utilizing the Xbox 360 EDRAM for certain screen effects gives them great speed without hurting frame rate. They said that this EDRAM, along with learning to properly use the multithreaded processors, are the two “tricks to making Xbox 360 games run well”.
  13. They also mention that although their MT engine is being launched with Dead Rising and Lost Planet on Xbox 360, next it will be used for the PS3 title Devil May Cry 4, and then they plan a Xbox 360/PS3 multiplatform game next, Resident Evil 5.

Edit: Here’s a more complete translation.

Switching out of Gear

January 30th, 2007

You’ve played too much modern shooters when you launch Unreal Tournament (the original, not some newfangled drive-fest) and:

* you constantly try to reload

* when you get hit, you instinctively try to hide to regenerate

* … and you keep wondering is it _really_ worth it to walk over the weird boxes with crosses on them

* somebody gets close, you try to melee him… where’s the melee button?

And, seriously, dude, where’s my textures? You call this high-resolution textures? Is that the Wii version or something?

Normal Maps Compression

January 20th, 2007

Once again I set out to prove people smarter than me are wrong by trying various things for tangent-space normal map compression, and once again I found myself wrong.

Lesson one: you can’t stuff normals in DXT1, no matter how much voodoo you do on the reconstruction side. The blocks show through, painting little 4×4 bricks on top of the bricks of my test texture.

Lesson two: if you resort to 8 bits per pixel, try the so-called DXT5_NM format. It’sВ X component in alpha, Y component replicated in the color components, Z = sqrt(1 - X*X - Y*Y). It looks OK, and has no visible blocking artifacts.

The problem with reconstructing Z from X and Y is that it’s a non-linear operation, andВ breaks down under bilinear filtering - just think of what Z component you’d reconstruct when fetching right in the middle between two pixels, one with (1,0) and the other with (0,1) XY components. The error is greatest near the equator (small Z values), but it is always there. Thankfully, real-world normal maps rarely have sharp normal boundaries, and most of the normals point straight up, so the general appearance of bumpiness is preserved. And it sure beats the blocks you get if you keep the Z values somewhere in the DXT5 block, or dropping the resolution of the normal mapВ in halfВ if you go to two bytes per texel.

50 Books For Everyone In the Game Industry

October 11th, 2006

For those with a bunch of Amazon gift certificates and a month to spare, here’s Next Generation’s list of 50 Books For Everyone In the Game Industry. I’ve read, what, four? Too bad I don’t have the month, or the gift certificates.

Single-channel DXT1 compression

October 3rd, 2006

I did some experiments recently with “fast” (runtime) compression of single-channel images into the green channel of DXT1 blocks (6 bits of
endpoint precision). We use the so-called splat mapping technique of terrain texturing, and the splat maps, which get regenerated every time something changes the terrain textures (e.g. your mouse moves in the editor, when the terrain texture brush is active), occupy 8 bits per pixel and are pretty large.

I tried the so-called “range fit“, where the extremal points of the
input luminance data are used for endpoints, which is very fast, but
produced 20 units of mean square error over my (admittedly unrealistic) purely random test image. (Here and below, the error unit = 1/255 of max possible value, so literally units in a 8-bit channel.) For comparison, the mean square error from simply encoding into 6 bits is around 2 units. What does “very fast” mean? Without any effort spent in making the code run faster, the straightforward range fit takes around 600 ns per DXT1 block, which would mean on the order of 40 ms for a 1024×1024 image, or 25 MB/sec, if that sounds more familiar to you. This probably can be improved at least twice, if you tinker with it. (All numbers taken on a 2 GHz Athlon64. Its official moniker is “X2 3800+”, but since my tests were single-threaded, the X2 doesn’t matter, and it probably would have been called “3000+” if it was a single-core CPU.)

I tried brute force search, where all 4096/2 combinations of endpoints
are used, which produced around 15 units of MSE, but was very, very
slow.

Next I tried brute forcing not the entire endpoint space, but just
“wobbling” the endpoints around the extremal points of the input
range; varying the amplitude of this wobbling exhibited the typical
“90% of the benefit for 10% of the effort” curve with the sweet spot
around 8 units of the 6-bit endpoint range producing 15-ish MSEs for
less than 10% of the brute force time.

Of course, you wouldn’t want to compress purely random data; on reasonable images, like a splatmap, or the luminance channel of a diffuse color texture, the numbers are much, much more reasonable: 2.25 units of MSE for 6-bit luminance; 4.6 units of MSE for range fit into DXT1, which is virtually instant; and 3.4 units of MSE for brute force DXT1 search, which takes forever. Again, 8 units of “wobbling” the endpoints gets you within 1% of the MSE for 10% of the running time of the brute force search.

Moral of the story: it’s perfectly reasonable to compress single-channel images into DXT1 at runtime using the range fit algorithm. If you need better results, you can look several steps around the range fit endpoints. And, by the way, the x64 versions ran about 5% faster.

Update: Range fit on the X2 3800+ reaches 75 MB/s, not 25.

Token Filters for Lua

September 17th, 2006

In an attempt to at least get a chance at making those haughty Lisp fanatics shut up for a minute about their macros, people have been playing with a patch for Lua which will make some of the same tricks possible: token filters. The idea is that you register with the Lua interpreter your own Lua function which gets the chance to examine and modify the token stream for any subsequent pieces of Lua code that get to be executed. Your function gets called for each token and outputs one or many tokens, but thanks to the tricks made possible by Lua coroutines, you can pretend that you can simply get() any further tokens you need and put() them to the output stream at your own pace.

Of course, this is much more inconvenient than Lisp macros, because a token stream is an order of magnitude harder to examine and make intelligent decisions upon than the no-syntax S-expressions of Lisp, but still, it’s a nice hack. People have started doing truly interesting/useful things, like this function parameter checking facility.

The main problem I see is not with the awkwardness of the token stream mechanics; if this catches on, I think sensible frameworks which allow work with higher-level Lua constructs will emerge. The problem is with the uber-simplistic Lua compiler, which does virtually no optimizations. Most often than not, the idea of macros (and other metaprogramming techniques, such as template metaprogramming in C++) is to introduce simpler syntax for things which will be expanded/calculated, then most of the inefficiency will be optimized away, and all that will happen at compile time. With Lua, there’s nothing to optimize your code, so often you’ll have nothing to gain by running some calculation at compile time instead runtime. And the lack of optimization is a feature - it comes with the “tiny, simple, embeddable” part of the Lua bundle.

FILE_FLAG_SEQUENTIAL_FTW

September 13th, 2006

I’ve been toying with file system optimization recently as we’re starting to implement a background loading system. At pdimov’s advice, I tried setting FILE_FLAG_SEQUENTIAL_SCAN to all Windows file open operations, and got an amazing 10% of absolutely automatic reduction in load time… an excellent case of money left on the table. (The load process has already been optimized in several iterations, so shaving off 10% by another means would not be trivial.)

It’s of great educational value to examine all kinds of logs and statistics produced by the file system. For example, I estimated that more than 70% of high-level file read operations would consume an entire file; the real number turned out to be around 10%. I found all sorts of weird leftover tiny reads, like skeletons being read one bone at a time, or scripts being loaded via a 512 byte buffer. We probably have a wristwatch programmer embedded undercover in our team…

The very existence of functions like fread() is some kind of lie-to-children, like the fact that the earth is round (it is not) or that the atom is a little planet. People who pride themselves of writing in a “to-the-metal” language like C or C++ to squeeze maximal performance out of the hardware, and still use fread (or, heaven forbid, iostreams), should take some time to read about low-level esoterica like FILE_FLAG_NO_BUFFERED with its sector-aligned reads bypassing the system cache, IO completion ports for file access, or even the scatter/gather API used to build your own cache. This thesis by Jan Wassenberg  quotes 10x improvement in effective reading throughput for a real-world loading sequence over the plain read().