Archive for December, 2012

Fast collisions

Writing your own physics engine is no good unless the code runs fast enough. This is how I got 1000 boxes in a stack running at 60 fps.


Very Sleepy is a free sampling profiler, easy to install and simple to use. I used it to find the hotspots, and I would have got nowhere without it.

Sparse Matrices

The matrices involved in collision calculations generally have only a few non-zero elements in each row, so using NxN storage space is very wasteful of memory and expensive to iterate through. It’s still important to store all the data contiguously. I store the inverse of each diagonal element as well to save a division in the solver.

Accelerating Gauss-Seidel

Gauss-Seidel is reliable, but it couldn’t be described as fast. It crawls to convergence in linear steps. It’s possible to improve on this by scaling the step taken on each iteration. But scale it too much and the convergence is worse, perhaps disastrously so. There’s no simple solution to finding the right scaling factor (the relaxation parameter). To be honest, I tried to follow the maths but gave up when the problem looked harder than solving the equation in the first place. In the end I settled on increasing the parameter slowly as long as the system seemed to be converging, and reducing it at the first sign of trouble. While perhaps not optimal, this was measurably faster than standard Gauss-Seidel.

Warm starting the solver

In a stable stack of objects the constraint forces don’t change much from one frame to the next. Caching the solution and using it a starting point for the next frame can dramatically cut the number of iterations required per frame.

Limiting memory allocations

I’m careful to avoid O(N^2) algorithms throughout, but when it comes to memory, you can’t even afford O(N).

For example, in my first attempt at caching the solver results, the overhead was bigger than the savings! I used std::unordered_map and it was allocating memory for every element. I wrote a basic hash table to use instead. With small elements it makes sense to store them in the table itself rather than allocating memory for each one as STL does.

Sleeping Objects

I’ve mentioned this before, but inactive objects can go to sleep and use zero processor time. When it comes to stacks, the entire stack has to sleep in one go or it will become unstable. In theory it should be possible to put a stack to sleep in stages starting at the bottom, but I haven’t quite worked that out.


A physics upgrade

Naturally, once I had the fluid dynamics running I wanted to hook it up to the physics engine and have the explosions move objects around. It didn’t quite work at first because

  • The characters moved by setting their velocity directly, so the force from the blast didn’t affect them.
  • Other objects had no friction and would fly forever once set in motion.
  • Collisions only resolved overlapping bodies and didn’t affect velocity.

It turns out that resolving positions in a collision is pretty easy but resolving velocities is a tricky business. Well, I can’t resist this kind of thing so I had to have a go.

Suppose we have a large number of objects that need to be updated in a time step. In that time, some of them may collide, and the positions and velocities at the end of the time step should respect certain constraints, such as non-overlap and positive or zero relative velocity at every contact point. Also, basic laws of collisions such as conservation of momentum and the coefficient of restitution should apply. Numerical artefacts such as jitter should be minimised. And it needs to be fast.

I’ll start with a method for resolving positions, because, as I’ve said, it’s quite straightforward and leads on to part of the velocity solution. First, we need a list of potential collisions. Why not actual collisions? At this stage, it’s just not possible to tell. Object A may appear to miss object B, but then object C comes along and knocks B into the path of A. I collect collisions using a tolerance based on the maximum local velocity. Each potential collision provides a constraint equation, and these equations form a linear system, which can be solved, for example, using the Gauss-Seidel method. The output is a list of offsets which, when applied along the collision normals, will prevent intersection at the end of the time step. Although iterative, the solution can be calculated to any accuracy so there is little jitter.

Now, a related technique can resolve intersections by changing the velocity of each body so that it brings colliding bodies exactly into contact at the end of the time step. This works only for perfectly inelastic collisions, where the objects stick together. The post-collision velocity is used to correct intersection and therefore it can’t carry any bounce.

What if we resolve positions and velocities separately? Can there be bounce in that case? It turns out that no, it doesn’t work. You need to know the outgoing velocities at the start, but these depend on the global system. Get them wrong and energy is not conserved, and the bounce may look odd.

There is another method, though. The collisions can be processed one by one until none remain, that is, all the relative velocities are positive. This works (almost) correctly for perfectly elastic collisions, and not quite so well for inelastic collisions. Even slightly inelastic collisions dissipate too much energy over several iterations, and convergence becomes poor as the coefficient of restitution (Cr) approaches zero.

Interestingly, a combination of the two methods is perfectly fine. Calculate a result for Cr = 0 using a linear system and another for Cr = 1 using sequential impulses and they can be mixed together according to the desired value of Cr. A limitation of this technique is that all simultaneous collisions have the same Cr. But it’s the only one I know of that works! All others are a poor approximation. (As opposed to a better approximation – there is no exact solution.)

The only collision paper that makes sense:

A 600 box stack.

A 600 box stack.

Fluid dynamics part 3

The floating point registers are 128 bits wide and can process four 32-bit floats at once, but I’m running the simulation in serial, one float at a time. Can this be improved?

I can’t rely on the compiler to do anything about it. It just isn’t a simple enough case for the compiler to detect. And the potential is there for a 4x speed-up. Nothing less will do! So I have to do it by hand.

The strategy is to process four cells in parallel. And I would like the code to ressemble the non-parallel version, too, to make switching back and forth for testing and debugging easy. So rewriting the whole thing using intrinsics is out. Instead, replacing every float with a ‘float4’ would turn a cell into a block of cells, and then there would be four times fewer of them.

I needed a float4 type, which doesn’t exist in c++. So I made one. It’s just a class with a single __m128 member. It has no named member functions (float doesn’t have any), just constructors, casts and overloaded operators. The compiler seems to optimise this pretty well if everything is inlined.

There’s just one problem. To calculate a flow, I need a cell and its neighbour. But the neighbouring cell isn’t a separate object, it’s either offset by one in the same block, or it’s the first cell in the next block. To get a block of four neighbouring cells I use shuffling and masking to shift a block to the left, and then when applying the flow I shift it the other way. This shifting doesn’t add much overall to the cost. I can make it work with or without SIMD like this:

	Block blockRight; 
	ShiftBlockLeft(blockRight, *pBlock, *(pBlock + 1));
	Block& blockRight = *(pBlock + 1);

And that’s about it. And it actually is four times faster. So that’s nice.

Fluid dynamics part 2

Consider two cells of unit area, touching along a boundary of unit length. We have the average mass and momentum of each cell, and we want to know how these will change in a unit time step. (Using units based on the cell size and time step makes things simpler. Scaling factors can be folded into constants, saving some multiplications.)

The flow of mass is density times velocity. And since density times velocity is the momentum density, the flow in unit time is

mass flow = dot(momentum, flow direction)

The flow of momentum is only slightly more tricky. It’s the mass flow times the velocity, and velocity is momentum divided by density, so

momentum flow (parallel) = mass flow * mass flow / density
momentum flow (perpendicular) = dot(momentum, perpendicular) * mass flow / density

All these values are interpolated at the midpoint of the two cells.

Each cell has several neighbours, so updating the grid is a two pass process. First, calculate all the flows. Then, apply them to the cells. To update the grid,

source density -= mass flow
dest density += mass flow
source momentum -= momentum flow
dest momentum += momentum flow

Note that the same amount is added to one cell as is subtracted from the other, so both mass and momentum are conserved. The only problem comes if one of the densities becomes negative. There are various ways of dealing with this, but just making sure that no more mass can flow than is actually in the cell is the easiest.

This is all fine, but run it and nothing actually happens because all the initial momentums are zero. Trying to start an explosion by putting a huge density in one cell doesn’t do anything either. Why is that? If each cell is full of particles and the average momentum is zero, that doesn’t mean the momentum of every particle is zero. In fact, they will be flying in all directions and it just happens that it all averages out. If a cell has more particles than its neighbour, some particles will just randomly cross over, and fewer will come back. So there is a flow proportional to the difference in density between the cells.

mass flow += (source density - dest density) * diffusion constant

If there’s mass flow, then there’s momentum flow, from the above formula.

Once things get moving they aren’t necessarily going to stop. You can end up with a sort of lava lamp effect as the mass flows around. Applying some damping can sort this out. For each cell,

momentum *= damping factor

And draining the excess mass can be useful too,

density = density * drain factor + (1 - drain factor)

Also, some friction between cells helps to smooth out the flow.

momentum flow (perpendicular) += momentum difference * friction constant

What are the neighbours of a cell? Technically, only the four touching cells. But eight-way flow looks better. Hexagons would be an alternative, but I like the cells to line up with the walls.

Boundary conditions are important. A cell can be part of the active area, or it can be behind a wall. Nothing can flow through a wall, so the flow is zeroed if a cell or its neighbour are out of the active area. It is, however possible for momentum to bounce off a wall.

So the idea is simple, but there are several subtleties to the implementation. It helps to have an extra row of wall cells all the way around the grid, and the flow calculation stops one short of the edge. For rendering, I want the cell corners, so I resample the grid, averaging the four cells around each corner.

The speed is not bad. A full screen of cells at 8 pixel resolution runs at 60 fps. Still, I wanted it faster and the algorithm seems a perfect match for vectorization. So that’s the next post.

Fluid dynamics

After working on refactoring for  a few days I wanted something exciting. So I thought I would try to make explosions. I wanted something better than a sprite, because after all, I can’t draw, and also the idea of shockwaves bouncing off walls and travelling down corridors was appealing.

I had a look online and found plenty of demos, mostly showing smoke inside a box or something equally useless running on a GPU. There are tutorials, and they all start with lots of maths and are obsessed with stability, with no intuitive explanations. And they all want to simulate incompressible flow because compression makes the maths even harder. I don’t care about physical accuracy! I just want fireballs!

So I started to think about how you might simulate something like that. I decided that the system is essentially a huge number of particles, far too many to simulate individually, but it could be managed by averaging the properties of particles inside fixed regions or cells. Evolving the system is then a matter of determining how the average propertics in a cell would change. The simplest particles have two properties: mass and momentum. So the important properties of a cell are the total mass (or equivalently, the density) and the total momentum. To give at least some physical basis, these should be conserved throughout the system. Mass and momentum can flow between a cell and its neighbours. Make sure the flow is always plausible and it has to be stable (if it blew up, I knew I had a bug – a sign error usually).

The details call for another post, but for now, a picture.


Ask how to render game objects on a programming forum and you will probably be told something along the lines of ‘give each class a virtual Render() function and call it on every object from the main render loop’. This approach is popular with beginners because they know all about object-oriented programming and not much about rendering.

It falls apart very quickly. In general, a single pass over all objects is not enough, and the order is important. The interface can be made more complex but it has to be duplicated in every object. And after all that, 99% of the objects are rendered in exactly the same way; either they are a model or (in a 2D game) a sprite, so the abstraction is useless.

The next step is to build the objects out of components, so each object gets a render component and the rendering loops over those instead. This is better, but not much. There’s still no overall structure to the scene, which makes it hard to control and optimise.

And then, if you know a lot about object-oriented programming and something about rendering too, you end up with the scene graph.

There are lots of game engines built around scene graphs (I’m not linking to them – they know who they are). It’s an idea that comes from graphical editing tools. The structure of the scene is a tree, so objects can be grouped together, which is very useful when you are editing a scene. You can have various types of nodes which afford various properties to their sub-nodes, so a tree can perform a variety of tasks. And this is true in a game too. Unfortunately, because the tree is a very general data structure, it’s not especially good at anything.

Let’s look at what you might want a scene graph to do:

  • Logical grouping of game objects
  • Relative transformations
  • Culling and visibility
  • Level of detail (LOD)
  • State management
  • Collision detection

Note that a lot of that has nothing to do with rendering. Mixing unrelated data means more complex code and slower processing. It should be split off into the appropriate subsystem instead. The game logic system can group objects together. The physics or animation systems can constrain objects and stick them together. Collisions are better handled by a specialised data structure containing only collision information.

That leaves culling/visibility, render states and LODs. Where does any of that benefit from a tree structure? On a modern CPU, it’s quicker to cull objects in a flat list than traverse a tree, even if there are thousands of them. At most, you could go one level deep by specifying ranges in the list with an combined bounding box. The same goes for states. Have a flat list and sort it by effect (an effect is a combination of render state and shaders). For LODs, put them all in the list and cull the unwanted ones according to distance from the camera.

And does the game code care about this sort of organisation? Generally, no. Just chuck the object in there and let the renderer do its job. This makes the interface to the renderer very simple. You can still build levels and models using a tree. It just doesn’t serve much purpose at runtime. Flatten it in a preprocessing step or on loading.

In summary, a game is not a collection of objects, and it’s not a 3D editing package. Design the code around what actually needs to happen, and it ends up a lot simpler, and works better.



There’s very little to choose between Allegro and SDL. They have similar features and interfaces. I can’t remember why I chose SDL to begin with. Anyway, after a bit of research I decided to try Allegro to try to solve some rendering problems.

It only took a day to make the switch. I already used SDL mostly through a wrapper so I only had to make new versions of a few cpp files. Operation was in most cases virtually identical.

But they aren’t quite the same. SDL (at least the latest version that isn’t stuck in development hell) uses a software renderer. Allegro uses hardware rendering. Some basic features that are standard in hardware rendering are missing from the SDL renderer. If you want to use them, you have to go through OpenGL. Allegro provides access to these through a simple interface.

So that was my choice. Change all the rendering to OpenGL, or switch libraries to Allegro and stay with a 2D API.

On the non-technical side, Allegro seems to have more active development and more helpful forums. So maybe there is something to choose between them after all.