Compiler benchmark

(February 6, 2009)

I usually write my demos using Microsofts C Compiler for Win32 and GCC for Linux. But how good does Intel’s compiler optimize? And can the performance of MSVC and GCC be improved using a clever selection of compiler switches? That’s what I wanted to find out, and so I wrote my own little benchmark based on some code of my demos and let it run through all these compilers with different options. The results are a little bit different from what I expected …

The tests

As I already mentioned, my benchmark contains some real-world code from two of my latest demos. I stripped off all the rendering code, though – all the math is still present, but nothing is submitted to OpenGL. The simulation is run over a varying number of frames (from 500 to 20000) depending on the scene to get a roughly equal run time for all three tests. The actual speed is then converted into frames per second.

All tests were run on a Core2 Duo T7300 @ 2.00 GHz on Windows XP. The tests were repeated three times and the highest result was selected.

The compilers

I used two different versions of the Microsoft compilers for the benchmark: The one which ships with Visual C++ 2005 Express Edition (which is the one I used for all my demos from 2007 to now) and the one which ships with Visual C++ 2008 Express Edition. I let each of them run with multiple sets of options: Unoptimized (vc200x_unopt) and optimized, where the latter is also used together with optimizations for the traditional x87 FPU (vc200x_fpu), SSE (vc200x_sse) and SSE2 (vc200x_sse2).

The GCC tests were performed with a current MinGW pre-release version of GCC 4.3. The build options first check every optimization level, from none (gcc_unopt), -O1 (gcc_O1) and -O2 (gcc_O2) to -O3 (gcc_O3). The -O3 options are then again extended by -ffast-math (gcc_O3_fast), GCC’s optimizations for the Core microarchitecture (gcc_O3_core) and finally SSE (gcc_O3_sse) and SSE2 (gcc_O3_sse2) optimizations.

For the Intel compiler tests, the current version (11.0) was used. First, the compiler was run without any optimization options (icl_unopt). Then, the optimization levels were tested: /O1 optimized for size (icl_O1) and for speed (icl_O1t), /O2 (icl_O2) and /O3 (icl_O3). Like with GCC, various additional options were tested with /O3, namely optimizations for SSE (icl_sse), SSE2 (icl_sse2) and SSE3 (icl_sse3). The SSE2 and SSE3 builds were again tested with auto-parallelization to check if Intel’s compiler can automatically benefit from dual-core machines (icl_sse2_par, icl_sse3_par).
In the tests, it turned out that Intel’s compiler does not deactivate all optimizations when run without parameters, but uses a default set of optimizations. The result is that the icl_unopt scores are quite a bit better than expected. There’s no way to disable optimizations on that compiler, as the /O0 switch, for reasons unknown, is only available on Unix platforms.

Here’s a complete list of all compilers and options:

Microsoft Visual C++ 2005 (14.00.50727.762)
vc2005_unopt cl /GF /FD /MT /GS-
vc2005_fpu cl /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
vc2005_sse cl /arch:SSE /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
vc2005_sse2 cl /arch:SSE2 /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
Microsoft Visual C++ 2008
vc2008_unopt cl /GF /FD /MT /GS-
vc2008_fpu cl /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
vc2008_sse cl /arch:SSE /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
vc2008_sse2 cl /arch:SSE2 /Ox /Ob2 /Oi /Ot /Oy /fp:fast /GF /FD /MT /GS-
GCC (4.3.2-tdm-2 mingw32)
gcc_unopt gcc
gcc_O1 gcc -O1
gcc_O2 gcc -O2
gcc_O3 gcc -O3
gcc_O3_fast gcc -O3 -ffast-math
gcc_O3_core gcc -O3 -ffast-math -march=core2
gcc_O3_sse gcc -O3 -ffast-math -march=core2 -msse -mfpmath=sse
gcc_O3_sse gcc -O3 -ffast-math -march=core2 -msse2 -mfpmath=sse
Intel C Compiler (11.0.066, Build 20081105)
icl_unopt icl
icl_O1 icl /O1
icl_O1t icl /O1 /Ot
icl_O2 icl /O2
icl_O3 icl /O3
icl_sse icl /O3 /QxK
icl_sse2 icl /O3 /QxW
icl_sse3 icl /O3 /QxO
icl_sse2_par icl /O3 /QxW /Qparallel
icl_sse3_par icl /O3 /QxO /Qparallel

The results

The first benchmark scene is the »magnet« scene from the end of Vortex 2. It’s not very computationally intensive, but hard to optimize due to bad coding ;)
For the most part, the results are pretty much as expected: MSVC and GCC are roughly equally fast, Intels compiler is 30% faster, and the results generally scale well with the optimization options. However, there’s one peculiarity: Even though the code uses single-precision floating-point arithmetic only, MSVC and ICL yield surprisingly low speeds in their SSE compiles and need SSE2 to reach their maximum performance. Only GCC manages to reach the expected values with SSE.

This scene is taken from the exploding flowerleafs in 8-Bit Wonderland. Again, it’s pure single-precision FP arithmetic, but this time, the code is much cleaner and should be easily optimizable. Or so I thought. In fact, the results for this benchmark are quite weird. MSVC shows no obvious inconsistencies. GCC melts down to two speed grades: One of them is roughly as fast as the unoptimized MSVC code, the other one is 20% faster. The weird thing is that it reaches its fastest speeds with -O3 -ffast-math, whereas SSE and SSE2 optimizations slow things down again.
The Intel compiler is far ahead of the others in this test, almost doubling GCC’s best performance. Again, the SSE version is much slower than most of the others, but it’s still faster than MSVC’s best results.

The cube grid / greetings scene from Vortex 2 has a good mix between integer and single-precision FP arithmetics, together with many memory accesses. It’s also the slowest of the tested scenes and the one with the smallest differences between the benchmark results.
MSVC acts a bit like the Intel compiler in the other tests: The SSE scores are lower than the FPU ones. GCC is fastest without SSE/SSE2, just like in the flowers scene. However, the differeces are much smaller. Anyway, it doesn’t quite reach MSVC’s performance, even in the fastest setting. ICL, on the other hand, is consistently 25% faster than MSVC in this test. Interestingly, for the /O1 optimization level, the size optimization performs a little bit better than the performance optimization.


Overall, there’s a clear winner: Intel’s compiler is the fastest one, beating the others by 25% to 100%. However, even though the compiler said it parallelized some of my loops, this didn’t have much impact on the performance. In some cases, the parallelized code was even a tiny bit slower than the sequential one.

The standard compilers (MSVC and GCC) are roughly equally fast – sometimes one of them is faster, sometimes the other one. The differences between the two MSVC versions are minimal; if Microsoft made any changes, they must be homeopathic.
A very strange fact about GCC is that there’s no difference between the SSE and SSE2 speeds. In one case, this was a good thing and raised the performance to a level even above Intel’s, but for the other two benchmarks, SSE/SSE2 was actually slower than the FPU with -ffast-math.

All things considered, it’s hard to draw a (personal) conclusion from the results. The Intel compiler is nice, but it’s hard to use (Visual Studio integration doesn’t work with the Express Editions) and tends to generate bloated code, 50% larger than MSVC’s. The differences between MSVC and GCC are noticeable, but still not worth the hassle of using GCC on Win32 or being sad because of the unavailability of MSVC on Linux. GCC’s behaviour with respect to SSE/SSE2 is strange, however – maybe future versions will rectify this.

Update [2012-08-22]: Today, somebody asked via mail if he can get the sources for the benchmark. I didn’t release them initially because they were parts of my demo source code that I didn’t want to make public (and, honestly, because it’s really nothing to be proud of). But now, three years later, the code in question has been released anyway, so nothing holds me back – and here it is: (8k)

Post a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>