The floating point registers are 128 bits wide and can process four 32-bit floats at once, but I’m running the simulation in serial, one float at a time. Can this be improved?

I can’t rely on the compiler to do anything about it. It just isn’t a simple enough case for the compiler to detect. And the potential is there for a 4x speed-up. Nothing less will do! So I have to do it by hand.

The strategy is to process four cells in parallel. And I would like the code to ressemble the non-parallel version, too, to make switching back and forth for testing and debugging easy. So rewriting the whole thing using intrinsics is out. Instead, replacing every float with a ‘float4’ would turn a cell into a block of cells, and then there would be four times fewer of them.

I needed a float4 type, which doesn’t exist in c++. So I made one. It’s just a class with a single __m128 member. It has no named member functions (float doesn’t have any), just constructors, casts and overloaded operators. The compiler seems to optimise this pretty well if everything is inlined.

There’s just one problem. To calculate a flow, I need a cell and its neighbour. But the neighbouring cell isn’t a separate object, it’s either offset by one in the same block, or it’s the first cell in the next block. To get a block of four neighbouring cells I use shuffling and masking to shift a block to the left, and then when applying the flow I shift it the other way. This shifting doesn’t add much overall to the cost. I can make it work with or without SIMD like this:

	Block blockRight; 
	ShiftBlockLeft(blockRight, *pBlock, *(pBlock + 1));
	Block& blockRight = *(pBlock + 1);

And that’s about it. And it actually is four times faster. So that’s nice.