How to write a 4k demo/intro
Now I’m almost done with the functional part of my 4k intro (read: all the tricky stuff is working, and there’s still lots of space to add new stuff). I’ll use this occasion to summarize some of my findings. Maybe someone else will find it useful (but I doubt that :) – I surely do, so this is also some kind of reference for myself. So here is KeyJ’s little TinyDemoWritingGuide. It’s targeted towards experienced C programmers. Not everything here may be completely true as I’m a beginner in the 4k field for myself.
Most newschool demos are written for the Win32 platform, and this one will be no exception. Writing 4k demos here is really easy nowadays (I wouldn’t have done it if this hadn’t been the case :), thanks to Crinkler, an excellent executable compressor (quote) specifically targeted towards compressing 4k intros (end quote). It does not work with lcc-win32 (yet? or ever?), but this isn’t going to be a problem either: Microsoft’s Visual C++ 2005 Express Edition, the Platform SDK and the DirectX SDK are available for free. If you don’t trust the WGA check that is required for the regular download of the DirectX SDK, don’t worry, there are alternative sources on the Internet.
Setting all this stuff up is left as an exercise to the reader. Especially Crinkler can be tricky; I ended up with a modified version of the VC++ command-line environment, because I could not get it working as a drop-in replacement for Microsoft’s linker.
Don’t expect your code to work with Crinkler »out of the box«. It won’t. You have to visit all the compiler and linker settings of your project first and set them to reasonable values. Common sense is sufficient to properly set up most of the switches – if you don’t understand these, you’re probably not going to see the finish line anyway. In particular, make sure to disable whole-program optimization (otherwise the compiler won’t generate COFF files) and disallow any default library (LIBCMT is a good starting point). Finally, add the undocumented
/QIfist parameter to the compiler command-line. This will remove any references to a library function that is invoked on each float-to-integer conversion. The price you’ll have to pay for that is unpredictable rounding behaviour (don’t expect values to be rounded down!).
Remove all library function references
You’ll most likely have some references to standard library functions like
memcpy in your code. Of course, you need to get rid of them and rewrite them with your own implementation. Unfortunately, Microsoft’s compiler seems to have some heuristics that detect
memset-like code and replace them by a library call. That’s absolutely not what you want, so you have to come up with a method to disguise your code. I succeeded by using a reverse loop instead of a forward one, but finally, I did the logical thing and replaced it by a compact assembler version centered around x86’s mighty
rep stosb instruction.
Optimize for code size, not performance
OK, this is dead obvious. But I repeat it nonetheless, because it’s a thing you have to keep in mind constantly
Don’t do any uninitialization,
ExitProcess(0) is more than sufficient.
Read the API documentation thoroughly to determine the absolute minimum set of initialization functions. In most cases, you don’t even need a message loop, for example.
For large structures like window classes or pixel format descriptors, use constants. Use assignments only for stuff that the compiler won’t grok otherwise.
Don’t buffer things that can be computed in realtime fast enough, because buffering always introduces overhead.
Avoid loop unrolling.
Don’t be afraid to use dirty code to reach these goals – you won’t win a compo with textbook source code.
Minimize floating point usage
One bad thing about having no default library is that
float doesn’t work flawlessly: the compiler constantly generates library calls to some conversion routines. SSE and SSE2 don’t work either, for the same reason. So all you get is double-precision FPU arithmetic, which tends to generate outrageously large code. The compiler tries to merge constants as good as possible, but still, every constant uses 4 to 12 bytes of code/data, not including the
fld opcode itself. The 4 bytes are the size of the memory address used to reference the constant’s address, and the additional 8 bytes are for the constant itself. Moreover, the stack structure of the FPU is really inefficiant, yielding additional bloat. So, reduce floating point usage to a bare minimum and try to do as much as possible with integers instead.
Clever use of functions
Everything (I mean everything!) that is more than approx. 10 bytes and used multiple times should be extracted and put into a function. This also includes simple stuff like PRNG postprocessing, especially if floating point math is involved. To reduce function call overhead, you should always use the
__fastcall convention, because this saves the nasty push/pop orgies in function prologs and epilogs and may even skip stack frame generation completely.
Every function that is used only one time should be declared with
__inline isn’t always sufficient. In times when you misuse the C compiler as a better assembler, you should specify exactly what you want.) This eliminates function call overhead altogether. Please note that the compiler seems to be unable to merge nested inline function calls.
Devise compact data structures
For geometry and music data, you’ll surely have to use a data structure to store it. Usually, there is lots of potential to minimize these. The first rule of thumb is to use the smallest data type possible. If something doesn’t fit in a
char, you should rethink your concept. Don’t be afraid to quantize vertex coordinates into single bytes – the difference between 0.7 and the real square root of 1/2 won’t be noticeable anyway. Next, disable all kind of stuffing or alignment. If the compiler decides to word-align the members of your
char x,y,z structure, you’ve won nothing.
Another important part is to find the right balance between direct and indirect indexing. If you use 3-byte vertices, but only half of the vertices used are actually different from any other of the set, a list with indexes to an array of unique vertices is the better choice. However, be careful not to overdo it: if most of your vertices are already unique, the overhead of the index list will make things worse.
You can help this »unification« process by tuning the vertex coordinates to the optimal compromize between high unification and low quantization error. Or better yet, design your model to have ideal properties from the ground up.
Finally, keep in mind that your raw data is going to be compressed, so it makes sense to introduce as much local redundancy as possible. This often means so convert AOS (array of structure) data to SOA (structure of arrays). For example, If you have vertex position and color data, separate them, because the positions are likely to have a completely other distribution and pattern than the colors and vice-versa. If possible, sort the data so that similar values are in close proximity to each other.
Check the generated code
A good programmer, and in particular one who is going to write demos, should always have a little compiler running in his/her head while coding, just so he/she can estimate how the generated machine code looks like. However, you can’t always trust the real compiler. Fortunately, it has an option to write an assembler dump of the generated code to a file, including the C source lines and even machine code. This is tremendously useful for checking if the MSVC-made code lives up to the expectations. In most cases, it does: I couldn’t find too much parts where hand-optimizations are possible of feasible. Of much greater help is the hex code output, because it gives you a feeling about how big the instructions are. And it helps you to find spots of a bad C-command-to-machine-code ratio. For example, I discovered a huge waste of approximately 30 bytes for a simple
glVertex3d(a.x, a.y, a.z) instruction (and because I’m drawing triangles, this code was actually present three times with three different vertices). By replacing this with
glVertex3dv(&a.x);, I gained a two-thirds improvement in code size for this part.
After the release of my 4k intro, I’ll publish some more in-depth information on how it was made.
It will help me a lot when I code my 4K game :)
You mentioned glVertex3dv. Did you try to use DirectX functions instead ? I wander which version generates smaller final code.
Good man! I will be waiting more about! Nice post!
Very nice article. Helped me to get started :)