Tale of a computer bug

I finally fixed a bug that has been pestering me for literally months. It is bug 117541. The story of how I finally tracked down the cause may be interesting, and should certainly be a cautionary tale for C++ programmers. Allow me to relate the story. (Less technically inclined readers: I will try to make it understandable, but this is a bit dry)

I first noticed the bug when I upgraded some packages on my system around May 2005. The upgrades included an upgrade to glibc (the Gnu C library, one of the core pieces the system is built around). The upgrades included a handy feature to detect programs which were using memory incorrectly, and to abort them pre-emptively (since it could be an attack that causes the memory corruption). The problem was that when I ran JuK from Konsole (where you can see the debugging output), it would always display a glibc crash message when I shutdown JuK. Actually, JuK would always crash, but normally you wouldn’t see the message, and the nice little “Your program crashed” window would never show up for some reason, so I didn’t notice before.

I poked around for a bit, but I couldn’t discover the cause. This was when I was working on the improved cover art support for JuK, and one thing that was necessary was to save covers on shutdown. This crash was preventing that from happening, so I forced the covers to save before the shutdown crash seemed to happen since I couldn’t seem to find the bug.

Since the bug wasn’t exactly serious, and since it seemed to be only affecting me (I was the only one who seemed to get the crash), I just let it go. I didn’t even report it in the KDE Bug database.

A few days ago however, I got a report from a couple of different sources that they were noticing crashes on shutdown. And with the same error message I got! Now that I knew I wasn’t the only one getting the bug, I decided to try again.

First, I realized I needed to bring out the big guns for debugging. I installed the latest valgrind (and disabled the exec-stack protection which somehow found its way on my system :). Now, valgrind is nice because it is a tool that can track each and every single bit of memory your application uses, and tell you exactly how it is incorrectly used. Better yet, it will tell you from what part of the code the error occurred.

I reproduced the bug under valgrind, and received the following output:

JuK valgrind log

So what was happening is that the PlaylistItem::Pointer class, which guards the PlaylistItem, was trying to write to memory it no longer had access to. And then, even worse, JuK was trying to free memory that had already been freed.

But looking at the PlaylistItem::Pointer code, it looked to be perfect. There were certainly no obvious bugs where valgrind was pointing.

So what I did was start to add a debugging harness around PlaylistItem::Pointer. I even went so far as to add a function to JuK’s DCOP interface to allow me to print out the status of the PlaylistItem::Pointer internal data structures at any time.

This helped a little bit. I discovered that there were very many PlaylistItems that were being pointed to, but didn’t know that they were being pointed to.

The “obvious” thing to do from there is to figure out what was causing the PlaylistItem to reset and think it wasn’t being guarded. But there were only a few lines in all of JuK that could cause such a reset, and they all looked very innocuous. There was PlaylistItem::Pointer, which as I mentioned was damn near perfect. The only other option was the PlaylistItem construction code. But that code is only run once, and runs before the PlaylistItem::Pointer starts to point at it! There should have been no way to get the results I was getting.

I was stumped, even though I could have figured it out here. What I ended up doing was adding even more debugging. I would add calls to every function in PlaylistItem::Pointer to make sure the data was still consistent. When the data became inconsistent, it would force a crash. Eventually I was able to get JuK to crash in the very function call that made the data inconsistent.

What I discovered was surprising: The PlaylistItem constructor really was being run after its PlaylistItem::Pointer was made. Huh?

It made sense when I saw the backtrace of calls leading to the failure though. This had occurred while the PlaylistItem was being created. What happens is that when a PlaylistItem is created, it first has to call some code that it inherits from Qt and KDE. This happens before the PlaylistItem initializes its own data.

The Qt code (in QListViewItem) takes care of adding the newly created item into its parent list on-screen, using a function called insertItem() in the parent. JuK had its own implementation of that function, which was used to wrap the new item in a PlaylistItem::Pointer and then add it to a list of items to process when the results were needed. This way items could be added quickly, even if the user deleted the item in the meantime before the result was needed. But, this was happening from the Qt code. The PlaylistItem constructor hadn’t been run yet.

So what happened is that the new PlaylistItem::Pointer correctly set a flag in the new item to indicate that it was being guarded. Then, once insertItem() and the rest of Qt were done, the PlaylistItem constructor finally ran, and accidentally reset that flag to indicate no guards. Eventually the PlaylistItem was deleted, and since it had no guards, it didn’t indicate to the PlaylistItem::Pointer mechanism that it needed to be cleaned up.

On shutdown, JuK’s remaining PlaylistItem::Pointer objects would try to delete the items they were pointing to, but they had already been deleted without being cleaned up. This caused the two error messages detected in valgrind. This same error was detected by glibc, which forced the crash.

The code had to defeat a C++ mechanism to stop this kind of error, by forcing the code to pretend that the QListViewItem was a PlaylistItem using static_cast. If we had used dynamic_cast instead, which checks if the conversion is valid during program execution, we would have caught the error much sooner.

So the moral of the story is, if a QListView method you’re overridding says that it uses a QListViewItem*, that is possibly true even if you never create a QListViewItem* directly! Be careful, and use dynamic_cast when possible.

« It’s the little thingsComprehensive Exam »