Sparse Voxel Octrees for Realtime Raytracing!

•July 26, 2010 • 2 Comments

So I’ve been doing research on sparse voxel octrees off and on for about a year now. The more I look into using them, the more practical they become. After recently generating the first render with my current sparse voxel octree implementation, I’ve gathered a fairly clear understanding of why they’re such a practical approach. That said, I’m going to use this blog post to kind of outline my experiences with the technology.

One of the biggest benefits of this technology is the level of detail that it allows. Because everything is effectively treated as a sparse 3D texture, you get all of the benefits of standard 2D texturing. Like regular textures, they’re very suitable for streaming. They also have implicit level-of-detail through the use of mipmapping. That said, the end result is completely unique 3D texture data everywhere, and the amount of detail you can render in a very short period of time is quite astounding (I’ll post some screenshots/video within the next few months to back up this claim =) ).

Another major benefit comes from the fixed maximum render time. When ray-casting for the primary ray hit, there is a maximum number of steps you need to take in your 3D DDA (3D DDA stands for 3D Digital Differential Analysis and is simply a fast way to traverse through all voxels intersected by a ray in the order that they’re intersected). In my current implementation, for a 90 degree field-of-view, there is a maximum of 3,840 steps required in my 3D DDA. This allows me to guarantee a minimum-case performance! Rasterization certainly can’t guarantee that!

One major drawback of this technology is the lack of research available. NVidia released a great paper called Efficient Sparse Voxel Octrees. Id has also released some research along with a video that demonstrates the tech in action. However, none of the research that I’ve found covers the practical implementation details. That said, I find myself implementing a lot of things that I don’t really know if they’ll work or not, but I suppose that’s just the nature of research.

One thing that I’ve found that really helps me in terms of my mental model, as well as my code, is dividing the sparse voxel octree into two different octrees. For me, I have one sparse octree (a resolution of 1024^3) that represents a high level node partition of my scene. I call this octree the “indirection octree”. At each node in the indirection octree, I have a pointer to a “brick”. A brick is a sparse octree of size 128^3 whose leaves represent a voxel in the 3D texture. While this complicates my tree traversal somewhat, it makes management and streaming *MUCH* easier. It also allows me to implement some interesting high-level optimizations, such as a coarse DDA that I can then use to adjust the LOD sooner, and therefor take far fewer 3D-DDA steps overall.

Overall, I find working with voxels to be extremely fascinating. It’s quite a challenge, but the results are pretty interesting. Given my current results, I definitely think that this technology can utilize low-end PC graphics hardware. I’m also holding out hope that this can run at interactive rates on a high-end desktop CPU. I’ll be sure to post more info when I get it! Watch out for a video in the next few months as well!

Creative Technology, Part II

•July 19, 2010 • Leave a Comment

Last week I wrote an article about the need for graphics programmers to be more creative, rather than simply going out and grabbing existing tech. This week, I’d like to add to that with an example of this kind of methodology in action. One major technology that fits in this category is Megatextures. It can run on very low end hardware, and provides a massive boost in visual quality. It also puts a cap on how much graphics memory is required for texturing, and subsequently, texture bandwidth usage. It makes things like decals and detail-textures totally irrelevant!

Another technology that is a slightly less relevant example is real-time ray tracing of voxel environments. Because all static scene elements are treated like textures, you get all of the benefits of Megatextures along with the upper-cap on performance that ray tracing provides! In my current project, I find that even the performance of an unoptimized ray-tracer runs quite fast (1024 x 1024, approx 1 sec in my test cases, single-threaded and unoptimized). I’m hopeful that even a CPU-based ray-tracer would be possible, but a GPU implementation is certainly doable. When I first started tinkering with the technology, it seemed like it would require a very high-end graphics card to run it, but as time goes by and I implement more algorithmic-level optimizations, it seems more and more likely that even low-end hardware can support this technology (I intend to post results of some of my experimental optimizations in the near future).

That said, I’m very hopeful that we’ll see more people playing with uncommon technology and hopefully bringing in some very cool solutions to some very real problems such as micropolygons and streaming. I personally think that voxels will begin to take over realtime rendering in the next 10 years, and it’s pretty much unexplored territory as far as our industry is concerned. Hopefully there will be plenty of similar unique technologies picking up steam in our industry!

Creative Technology…

•July 15, 2010 • Leave a Comment

One thing that the game industry deals with on a fairly regular basis is new technology. In graphics, it’s especially true that there are new techniques coming out all of the time. These techniques are pretty varied and focus on everything from animation to texturing. Not only that, but the pace of innovation has exploded over the last few years due to the introduction of extremely powerful and extremely *flexible* graphics cards. In the end, it’s always a rat race to keep up in an attempt to be the first to market with a particular feature.

On the console side of the industry, the platforms are a lot more static than on the PC side. Matter of fact, modern graphics hardware in PCs is roughly 10-20x more powerful than the graphics hardware in the current batch of consoles. However, there are a few kinds of graphics cards that are commonly ignored by current PC games: cheap and low performance integrated graphics. I used to believe that supporting this kind of hardware wasn’t really that important, but with the ever-increasing popularity of laptops, that kind of ideology no longer holds up.

Lucky for us graphics guys, even the low end hardware is seeing some very important advances. These advances mostly manifest themselves in the form of increased flexibility. We can now do things on even low end hardware that would have been impossible only a few short years ago. That leads me to the point of this post: I believe that the industry has simply taken a “more features equals better” approach towards graphics. Not only does this approach typically fail to effectively support low-end hardware, it’s simply not a sustainable approach for the future.

Given the number of new graphics technologies coming out every year, it will be impossible to maintain the current approach without completely astronomical budgets. Even then, would all of those new features contribute in a truly meaningful way? Let’s pretend for a second that we’re not talking about graphics any more… we’re now talking about cars. Lets presume that you want to build the fastest race car ever. If you go out and grab the fastest engine, the best tires, the best car frame, the best everything currently in existance… would you get the best car? In theory, it seems like you would. But in reality, you’d probably have lots of things that don’t really fit well together. Well guess what: graphics technologies are no different!

That said, I think that the game industry needs to stop this “more equals better” approach and start taking a smarter, more creative, and holistic approach. We should be thinking “smarter equals better”. We should be focusing on developing creative uses for all of that new flexibility that graphics cards are now giving us! We should be focusing on technologies that don’t just increase the visual quality, but also enhance practicality! It’s only a matter of time before it’s no longer about how your game looks on high-end hardware… it will be about how it looks and performs on low-end hardware.

*EDIT*:
Sorry for not posting on my usual Monday schedule… I’m just getting over being sick, and it turns out that it’s really tough to write with any real capacity whilst not feeling well! =X

Deferred Shading.

•June 28, 2010 • Leave a Comment

One very popular rendering technique these days is deferred shading.  It’s a very simple concept, and it comes with plenty of benefits.  It’s generally very easy to implement, and it avoids a whole slew of problems on some platforms.  To top it all off, it has been successfully used in several high profile games!

Due to its list of benefits, its drawbacks are often overlooked.  However, I personally think it’s important to examine both sides of any technology, good and bad, to determine if it’s the best fit for a given situation.  That said, I’d like to present to you my observations on deferred shading.

For people who are not familiar with the deferred shading, I’m going to give a brief overview.  The premise is that you render out scene details, rather than render fully shaded/lit colors.  The only details that you render are the ones needed for your shading model.  For instance, you need surface normals for lighting, so therefore, you render out surface normals to a buffer (typically called a g-buffer, which I *think* is short for “graphics buffer”).  Due to the ability of modern graphics cards to render to multiple buffers simultaneously, it becomes very simple to render out all of the relevant scene details in a single pass.  Once this data is available, all shading calculations occur using the data stored in the buffers to light the scene.  Now that we’ve covered the idea behind deferred shading, I can talk about its pros and cons.

On the positive side for deferred shading, it solves several key problems:

- On some platforms, such as the PC, draw calls are very expensive.  By using deferred shading, you separate texture blending from shading.  A separate draw call is required for objects that have different shaders, different textures, or different shader parameters.  Because texturing and shading are broken up into two distinct stages, the number of parameters that need to change on a per object basis decreases.  Therefore, more objects can be rendered in the same draw call.  This is one of deferred shading’s biggest wins!

- Geometry doesn’t have to be processed multiple times.  In multi-pass renderers (renderers that draw an object once per light — Doom 3 is a good example of this), you have to render an object once for every light that affects it.  This requires geometry to be retransformed, and on platforms that have unified shading units (such as the X-Box 360 or any DX10 hardware), this gives you more shading units during the lighting phase.  Not only do you save time processing geometry, you actually gain more processors for use with shading!

- It helps with small triangles.  Because triangles only have to be rendered once, small triangles won’t hit your framerate as hard as they would in a multipass or forward renderer.

- It’s simple.  It’s very easy to write a deferred shading engine.  Because you don’t have to write complex batching systems or figure out how to manage shaders, you can simply avoid those things entirely.  Those take time and energy, and they’re typically not very fun to write.

Now then, deferred shading isn’t all rainbows and ponies… it’s time to cover the problems!

- On the XBox 360, you don’t have enough framebuffer space to render to high-precision render targets.  Therefore, you end up with very limited framebuffer space and have to put in all kinds of hacks to make the most of what you have.  The list of these hacks is quite extensive, but it generally means that you lose color specular (or that you end up with incorrect specular lighting colors).

- You have to transform normals into view-space or world-space.  If you use tangent-space normal maps, then you have to pass a rotation matrix from your vertex shader into your pixel shader.  In multipass lighting engines, you can simply perform lighting directly in tangent space.  This means that you only need to transform lighting vectors in the vertex shader, which can then be passed directly to the pixel shader for lighting (this is quite efficient).  Also, simply interpolating the rotation matrix from the vertex shader to the pixel shader has a cost.  Interpolation of variables is not free!

- You have to render to multiple render targets.  This is one of *the* biggest bottlenecks.  This requires a high write bandwidth (imagine a simple low-precision case of 4 RGBA buffers… that’s 16 bytes per pixel, plus an additional 4 bytes for depth/stencil) and the video card can have trouble keeping up.  This means that rendering to your g-buffers is not cheap!  Even when reading data from the g-buffers, because it’s uncompressed, it requires much more bandwidth than the compressed textures that would typically be used.  Note that on some platforms, requiring a high write throughput is not a problem.  For instance, the X-Box 360 has 10MB of very fast RAM that acts as the framebuffer.  Thanks to that high speed RAM, it’s very unlikely that you will become framebuffer-write limited.

- You lose the ability to change lighting models for objects.  Because you render all of your scene data to g-buffers, in order to get different lighting models on objects, you have to encode the required lighting model into your g-buffer somewhere.  In a shader that calculates lighting, you would have to branch based on the lighting model for a given pixel.  Because branching is prohibitively expensive in a case like this, it’s almost always avoided.

- When shadowing a scene, you have to transform individual pixels into shadow space to determine whether or not they should be shadowed.  This increases the cost of shadowing considerably.

To sum everything up, deferred shading cuts down a lot of the management complexity, but it typically does so at a cost to fill rate.  In an environment where bandwidth is at a high premium, I wouldn’t necessarily recommend going with a deferred shading engine.  However, it really all depends on your requirements.  If there is a chance that you’re going to be draw call limited, then deferred shading is a very practical way of circumventing that entire problem.  If you’re going to be fill rate limited and you have programmer time to burn, then you might want to consider a multipass renderer.  Not many people realize this, but I worked on a shipped title that was of the “Builder” genre that used a multipass renderer.  Thanks to a very sophisticated batching system, it could handle a whole lot of lights — more than most deferred shading engines could achieve!  That said, I personally believe you can achieve better performance and better quality through a multipass renderer in almost all cases.  However, implementing a multipass renderer can require a great deal of infrastructure and frankly, it’s sometimes just better to put your time towards something else such as special effects, particles, or other eye candy.

Memory performance.

•June 20, 2010 • Leave a Comment

So a friend of mine pointed me to a really good article about the performance characteristics of RAM.  The article itself is really long and goes into a whole lot of detail about how RAM works, complete with circuit schematics and all.  While this article covers a ton of useful topics, there are really only a few that apply to most people on a day-to-day basis.  That said, I figured I’d write a quick a summary so that people can avoid reading this massive article.  Please note that my re-hash isn’t meant to be a replacement for this article, but rather it’s meant to be a summary (along with a quick code sample for the adventurous =)).

So probably the most important thing that this article covers is the cache.  The cache is basically a small amount of very high performance memory (called SRAM) that contains a copy of data that resides in RAM.  The point of the cache is to keep the processor from having to go out and actually get data from RAM all of the time.  Because the cache is usually much faster than RAM, organizing data for cache efficiency can be a *huge* win.  One common optimization for cache usage is called strip-mining, which is the process of breaking up one loop, and replacing it with many smaller loops.  If you then re-organize the data needed by each loop so that it is tightly packed, you can achieve even more cache hits!  The idea is to both limit and localize the number of memory accesses per loop iteration, which causes more cache hits, which causes *much* better performance.  To give you an idea, this optimization can sometimes speed up code by a full order of magnitude!

The second most important thing that this article covers is the performance characteristics of RAM itself.  While RAM stands for “random access memory”, it turns out that non-sequential access is actually slower than sequential access.  One way to deal with this is to actually make a copy of data that you’re going to be reading from.  If the data is small enough to fit in the cache, then you’ll effectively spend a bunch of time reading data from RAM into the cache, and then be working on the data in the cache.  Because the processor’s caching mechanism is functionally transparent to the application, the best way to do this is to simply allocate a buffer on the stack, copy the data, then read from that buffer.  It’s very counter-intuitive given that you end up doing extra work to gain speed, but in some cases, this can really help out your performance.  In the end, you’re simply performing CPU operations to prevent extra RAM operations in the future, and because the CPU is much faster than RAM, it ends up being a (sometimes big) win =).

In order to illustrate RAM timings, I’ve put together a quick demo that shows random access vs. sequential access.  Because I need to bypass the cache in order to efficiently time RAM access speed, I end up having to use some inline assembly language.  The instruction that I’m using (I use movntdqa to perform cache-bypassing reads for all of you assembly people out there =) ) is only supported on Core 2 chips and better.  Please note that this code is designed for and has only been tested on an Intel i7 processor.

Disclaimer:  Processors are *very* complex these days, and they contain a lot of hardware dedicated to optimizing certain operations.  Because this hardware is entirely automatic from the programmer’s point of view, it is often difficult or impossible to bypass all of the hardware optimizations that might be going on.  That said, this test may not be completely accurate, or it may not provide exact numbers.  Also, interference from the operating system and other applications can greatly affect the performance of this code, and results can sometimes vary by a noticeable margin.  However, in all of my runs, sequential RAM access is *always* faster than non-sequential RAM access.  If anyone notices any bugs, please feel free to point them out and I’ll make the necessary fixes ASAP!

The rasterization problem.

•June 13, 2010 • 6 Comments

I’m constantly surprised at how few people realize that small polygons kill rasterization.  If you’ve ever had the joy of writing a software rasterizer, you’ll find that it immediately becomes apparent that small triangles are extremely hard to optimize for, and this problem translates over to graphics cards as well.  In this blog post, I’m going to explain exactly why this is!

So what are the advantages of rasterization?  It turns out that there are many!  The biggest strength of rasterization is memory access.  In low-poly graphics (such as games), a single polygon tends to cover a large number of pixels.  This means that any textures assigned to that polygon are going to be accessed in a fairly localized way (assuming you have mipmapping turned on), and the end result is a streaming memory-access pattern… memory from texture data is read in a mostly linear way, and then the shaded data is written out in a mostly linear way.  This effectively turns rasterized polygons into this nice streaming operation, which both memory and processors are setup to deal with very, very efficiently.  Because of this, modern GPU’s are optimized to contain banks of processors that effectively exploit these patterns.  The end result is that images can rendered *very* quickly.

So why do small triangles kill rasterization?  For starters, you can’t assume that large numbers of pixels will be generated by a single polygon.  This means that having a huge bank of processors to handle pixel processing is no longer an effective optimization.  For instance, lets say you have a GPU with 24 fragment shading units (or as Direct3D calls them, pixel shading units).  If a triangle covers only one pixel, then you end up throwing away the work of 23 of the 24 fragment shading units.  As you can probably see, this is not the most efficient way to render a frame.

In order to deal with this problem, recent GPUs have been designed with generic processors that can move back and forth between vertex processing and pixel processing.  While this helps solve the problem, it’s not a full solution.  For instance, you can’t assume that triangles won’t overlap, so you can’t process multiple triangles at the same time.  You also can’t assume that their memory access will be reasonable.  In the worst case, you end up scattering pixels all over the framebuffer.  This means texture data will likely be gathered from all over texture memory.  Remember that nice streaming pattern I was talking about?  That’s totally gone in this case, and performance will be severely impacted as a result.  So what now?

Curved surfaces with displacement maps are one possible solution to this problem.  The curved surfaces are adaptively subdivided into triangles, which keeps triangles a relatively uniform size in screenspace (which prevents them from becoming tiny).  This allows those huge banks of processors to once again become a good optimization strategy.  This largely brings back the streaming behavior seen when rendering large triangles.  However, once again, this isn’t a bullet proof solution.  While games would technically be able to represent more detail in less data, we’d still run into the small triangle problem with very distant surfaces (in this case, we’d have small patches).  Also, evaluating curved surfaces is expensive, and requiring displacement maps only increases this cost as well.  All is not lost however.  Since GPUs are still quickly growing in performance, this will likely become the most popular rendering technique for games in the near future.

Another technique aimed at dealing with this problem is sparse voxel octrees.  This particular representation has the added advantage of allowing for global scene access within pixel shaders, so this might very well become the technology that replaces rasterization as the most popular rendering algorithm for games.  Only time will tell I suppose. =)

printf( “Hello world!” );

•June 8, 2010 • 5 Comments

Hello everybody!  This is the very first post to my brand-spankin-new Blog!  From this point on, I plan on updating this every Monday with rants about programming, graphics, and life in the game industry!  If you can’t tell by all of these exclamation points, then I’ll have you know that I’m really excited to have this blog!