nVidia driver bug?

(July 8, 2007)

While testing my current work-in-progress demo on my brand-new Vista-powered and GeForce-8-equipped laptop, I noticed some really strange rendering glitches. Since this was the only machine where the bug occured, so I thought it would be some bug in my code that caused incompatibilities with that particular driver version for that particular chip revision or perhaps Vista. However, a friend had the very same problem on a GeForce 7 card, Windows XP and a much older driver than the one I use on my main development PC, which has a nVidia card, too. This meant that the problem needed some serious debugging :)

The notebook still had its original junkware-infested standard Vista installation, and I didn’t want to install Visual Studio and all that stuff onto a Windows installation that was going to be killed the next day anyway. So I debugged “remotely”, that is, I started Visual Studio on my other notebook via Remote Desktop and launched the compiled executable locally. Strangely enough, the bugs did disappear! I found out that the problem only occured if the program was run locally – when run directly from a network share, everything was OK. At least for the first time; sometimes, the bug reappeared if run again. Also, the behavior was dependent on the amount of printf()s in the code, which is really strange.

Anyway, I found a stable, but unsatisfying, solution. The code that failed was a routine that loaded simple geometry data (vertex data and indices) from disk. It did so by allocating memory, loading the whole file into the newly-allocated memory block, passing the loaded data directly into an OpenGL display list (via glInterleavedArrays() and glDrawElements(), surrounded by glNewList(..., GL_COMPILE)…glEndList()) and deallocating the memory afterwards. I figured it must be a problem with my file loader routine, which is actually much more complicated than the description I just gave. By not deallocating the memory after compiling the display list, the problem went away, reliably. So I stopped debugging at this point – I knew that it was just a hack that only hid some serious bug in my memory management code (at least, so I thought), but I didn’t want to dig any deeper then. Anyway, it’s just a demo; clean programming is not a requirement there :)

During the three-hour ride from Glauchau to Hannover today, the real reason for the bug suddenly occured to me: It’s not a memory management problem, it’s a mere race condition!

Analysis

If I read the spec correctly, glDrawElements() should be an atomic operation: It should not return until all the primitives for that call have been drawn. My suspicion is that the nVidia OpenGL driver violates this constraint due to optimization: It seems that the driver tries to make use of multi-thread or multi-core facilities as good as possible, executing parts of the driver in another CPU or thread than the controlling OpenGL application. This is usually a good idea, because programmers are mostly lazy and don’t optimize for multiprocessing systems, so they can at least profit from the driver’s multi-CPU features. However, the nVidia approach might be a little bit over the top.

It looks like the nVidia driver defers glDrawElements() to the second core on multi-core processors, processing the vertices »in the background« while the application code continues to run. The problem is that the application may change the vertex data right after the glDrawElements() call: My code deallocated the data block, reallocated it (which usually yields the same address!) and loaded fresh vertex data from disk. From the viewpoint of the driver (which runs on the other CPU core) the vertex data suddenly changed during processing, so I got a weird mixture of vertices from the old and the new model.

This explanation is plausible, because it explains all of the strange behavior I saw:

  • Both computers that showed the problem were equipped with dual-core CPUs of different manufacturers.
  • My own development computer did not show the problem, even though it had a nVidia GPU and a (most probably) vulnerable driver. The reason: It only has a single-core CPU, so the OpenGL ICD runs in-thread.
  • Another dual-core computer with ATI graphics did not show the problem.
  • Execution via a network share did not show the problem in most cases, because the OpenGL driver was finished processing the vertex data before the network handshake finished.
  • The second execution mostly failed because the data was loaded much quicker due to caching.
  • Additional printf()s caused the application thread to run slower, so the driver had more time to finish its processing – hence, the problems vanished when debugging output was verbose enough.
  • Not freeing the allocated memory solves the problem because the new vertex data will always be loaded into another, unique memory location. No vertex data will be overwritten.

Confirmation – Help needed!

To verify my assumptions, I wrote a minimal application that exhibits the bug. It renders a hard-coded cube model (corner coordinates [-1, -1, -1] and [1, 1, 1], colors map to [0, 0, 0]…[1, 1, 1] along the axes) into a display list via glInterleavedArrays() / glDrawElements(). After that, it modifies the model data in-place by inverting all the vertex coordinates (both position and color) and fixing the vertex indices so that the resulting model will be visually identical to the original. This model will be compiled into another display list. The contents of both lists are then shown next to each other. Normally, both cubes should look exactly the same. However, on nVidia chips and multi-core CPUs, they don’t – the first model will be broken in some way because the modifications take place during the rendering.

The source code can be found here:
      nvbug.c
And the executable here:
      nvbug.exe
The program is available for Windows only because I already confirmed that the Linux drivers are not affected.
WARNING: This program will crash some driver versions with a bluescreen. So be warned and save your work before executing the program.

I’d love to hear from readers who tried this simple test program:

  • the result
    • Do the cubes look identical? (i.e. it works)
    • Do they differ? (i.e. bug observed)
    • Does the system crash? (I hope not :)
  • the name of your graphics chip (e.g. »GeForce 8600M GT«)
  • your operating system version (e.g. »Windows XP SP2«)
  • your graphics driver version (e.g. »158.22«)
  • your CPU type (e.g. »Intel Core 2 Duo T7300 @2.0 GHz«)

Or alternatively, if you find a bug in my code or think my assumption of the atomicity of glDrawElements() is not correct, don’t hesitate to tell me that, too.

18 Responses to »nVidia driver bug?«

Post a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Captcha: