Tag Archive: c++

An immediate mode GUI

I’ve been using an immediate mode GUI system for a while and I think the code is stable enough now to talk about my experience of it.

The definitive tutorial – though I deviated a lot from this:


I have so far made a level editor and a database using this system. I used my own GUI system because I knew I would have complex interactions between the GUI and the scene/document, and I didn’t fancy learning and possibly fighting an existing API. It was not for the final result, which would be better using proper Windows controls. Apart from having a non-standard look and feel I have no file selector, no copy/paste in text boxes and no tool-tips. But as the only person using it, I felt free to make that sacrifice.

Some techniques that I used:

  • Simple interface – For example, if (gui.DoButton(“MyButton”)). There are some optional input flags and that’s it.
  • No Widget IDs – I felt they were ugly, so I dispensed with them. Instead, I identify widgets by counting. This works as long as the GUI does not change while a widget is still active, and this has always been the case. The GUI changes in response to triggered widgets, and  widgets are deactivated when they are triggered. There can only be one active widget at a time, therefore there cannot be any active widgets after a trigger.
  • Automatic layout – It would not be practical to hard code the position of every button. I have vertical and horizontal lists and layout flags that can be passed to any widget. The size of the layouts is calculated and used in the following frame.
  • RAII – Layouts are created as objects on the stack. When they go out of scope, they remove themselves from the GUI layout stack.
  • Storing temporary data inside the GUI – For example the text box in the tutorial above will output values to the buffer while they are being edited. This complicates the code. My text box takes an initial string, then holds the temporary string internally while it is edited, returning the final string only when triggered.
  • Multiple trigger conditions –  How can you have a button that can distinguish left and right clicks when it only returns a bool? Place two buttons in the same spot, using a flag to prevent the bottom one being hidden. One is triggered on a left click, the other on a right click.
  • Building complex widgets out of basic ones – I have menus, lists, trees and tables. They are all buttons and labels underneath.
  • Drag boxes and Hot boxes – Drag boxes allow the user to click in a window and move it, triggering when it is dropped. Hot boxes report mouse clicks inside them. They allow painting tiles in the level and moving objects around. These are ridiculously simple to use.

I had some trouble organising the code on the user side until I realised what was going on. It’s not event-driven. The code has to be arranged the same way the interface is. That may sound inflexible – well, maybe – but it breaks down into windows, toolbars, menus and so on. Instead of events, I kept a queue of deferred actions. Want to load a file? Well, you need ask first whether to save the old one. Then request a file name. The actions might happen anywhere; it just depends where in the interface they are handled. Trust the action queue.

It’s very efficient in terms of the amount of code used, and the logic is easy to follow. My entire database interface is under 1000 lines of code, using a total of 120 widgets. And that does a lot of stuff: record editing, tables, a record tree, a schema builder and all the associated menus and dialog boxes. Now, if I could just figure out how to do tool tips…

Fast collisions

Writing your own physics engine is no good unless the code runs fast enough. This is how I got 1000 boxes in a stack running at 60 fps.


Very Sleepy is a free sampling profiler, easy to install and simple to use. I used it to find the hotspots, and I would have got nowhere without it.

Sparse Matrices

The matrices involved in collision calculations generally have only a few non-zero elements in each row, so using NxN storage space is very wasteful of memory and expensive to iterate through. It’s still important to store all the data contiguously. I store the inverse of each diagonal element as well to save a division in the solver.

Accelerating Gauss-Seidel

Gauss-Seidel is reliable, but it couldn’t be described as fast. It crawls to convergence in linear steps. It’s possible to improve on this by scaling the step taken on each iteration. But scale it too much and the convergence is worse, perhaps disastrously so. There’s no simple solution to finding the right scaling factor (the relaxation parameter). To be honest, I tried to follow the maths but gave up when the problem looked harder than solving the equation in the first place. In the end I settled on increasing the parameter slowly as long as the system seemed to be converging, and reducing it at the first sign of trouble. While perhaps not optimal, this was measurably faster than standard Gauss-Seidel.

Warm starting the solver

In a stable stack of objects the constraint forces don’t change much from one frame to the next. Caching the solution and using it a starting point for the next frame can dramatically cut the number of iterations required per frame.

Limiting memory allocations

I’m careful to avoid O(N^2) algorithms throughout, but when it comes to memory, you can’t even afford O(N).

For example, in my first attempt at caching the solver results, the overhead was bigger than the savings! I used std::unordered_map and it was allocating memory for every element. I wrote a basic hash table to use instead. With small elements it makes sense to store them in the table itself rather than allocating memory for each one as STL does.

Sleeping Objects

I’ve mentioned this before, but inactive objects can go to sleep and use zero processor time. When it comes to stacks, the entire stack has to sleep in one go or it will become unstable. In theory it should be possible to put a stack to sleep in stages starting at the bottom, but I haven’t quite worked that out.

Fluid dynamics part 3

The floating point registers are 128 bits wide and can process four 32-bit floats at once, but I’m running the simulation in serial, one float at a time. Can this be improved?

I can’t rely on the compiler to do anything about it. It just isn’t a simple enough case for the compiler to detect. And the potential is there for a 4x speed-up. Nothing less will do! So I have to do it by hand.

The strategy is to process four cells in parallel. And I would like the code to ressemble the non-parallel version, too, to make switching back and forth for testing and debugging easy. So rewriting the whole thing using intrinsics is out. Instead, replacing every float with a ‘float4’ would turn a cell into a block of cells, and then there would be four times fewer of them.

I needed a float4 type, which doesn’t exist in c++. So I made one. It’s just a class with a single __m128 member. It has no named member functions (float doesn’t have any), just constructors, casts and overloaded operators. The compiler seems to optimise this pretty well if everything is inlined.

There’s just one problem. To calculate a flow, I need a cell and its neighbour. But the neighbouring cell isn’t a separate object, it’s either offset by one in the same block, or it’s the first cell in the next block. To get a block of four neighbouring cells I use shuffling and masking to shift a block to the left, and then when applying the flow I shift it the other way. This shifting doesn’t add much overall to the cost. I can make it work with or without SIMD like this:

	Block blockRight; 
	ShiftBlockLeft(blockRight, *pBlock, *(pBlock + 1));
	Block& blockRight = *(pBlock + 1);

And that’s about it. And it actually is four times faster. So that’s nice.