Debugging Stories : The Keypad
This keypad.
I’ve looked at this keypad for a week. It’s the first keypad you encounter in Deus Ex : Mankind Divided. The reason I’ve looked at this keypad for a week is because interacting with it was causing the Xbox One to reboot.
The first keypad of the game would reboot one of our target platform. Yikes.
Fortunately, I was able to reproduce the bug fairly easily. I attached the VS debugger to the process, just to see if I could catch anything obvious. Maybe a log, or an exception? Hmm. Nope. Nothing.
I searched the documentation and talked with QA to find out what might be the cause of a reboot. I found that, usually, the console reboot when the GPU hangs and that it should leave a crash dump on disk before rebooting. Great, so I looked where the crash dump should be and… nothing.
I digged a bit more and found out that I could debug at the kernel level instead of the process level. Maybe I could catch something there? Fired up WinDbg - a tool with a peculiar UX to say the least - attached to the console, reproduced the crash and… still nothing.
The Sherlock Holmes feeling I had at the beginning of this investigation was starting to dissipate and be replaced by fear. We were doing something very wrong, but there were no trace of what was happening except, huh, our entire code base?
One morning, I noticed that, when reproducing the bug, the game would freeze once, resume for 2-3 frames, freezed again and then the console would reboot. So I thought, maybe I could just pause the execution with the VS debugger when it freezes the first time? And so I did.
In front of me was The Callstack.
Somewhere, deep in Scaleform where the glyph atlas was updated, was a innocent looking wait on a GPU fence. But the fence never came.
This was a CPU/GPU deadlock.
But why?
First, if there was too much text on screen of varying sizes and fonts, the text atlas would fill up. This was the case for this first keypad, because we had a lot of tutorial prompts and stuff. Once full, the CPU would wait for the GPU to be done with it, replace parts of the atlas with some new glyphs, then continue with what it needed to render. The first problem here is that it was waiting for a moment in the current, on-going frame to finish - the one we are building right now.
But it didn’t just simply wait. It actively pushed dummy commands to the graphics command buffer to force D3D (this was D3D11) to perform an implicit command flush to the GPU, forcing the GPU to start processing the previously submitted commands that used the atlas.
As terrible as it sounds, it should’ve worked. Only, Scaleform was not alone in this - we were also using the GPU to, you know, render a game.
As I was mostly responsible for the Scaleform integration at the time, I had to ask my fellow rendering programmers for a bit of help to understand what was happening. We deduced that the graphic command processor on the GPU was probably waiting for a result of another command processor, either move or compute, before the fence. Because we generally didn’t fill the command buffers of those processors up to the limit where D3D would perform an implicit flush, they would only get flushed to the GPU at the end of the frame, just before the Present call.
So now we have the complete picture. While waiting for its fence, the CPU sends dummy graphics commands to the GPU, forcing an implicit flush of all previous graphics commands. The graphic command processor starts to execute some of those, reaches the point of its dependency with the (probably) move command processor and waits. However, the move command buffer is never flushed to the GPU because the CPU is busy waiting on that fence, never reaching the explicit flush call at the end of the frame. And so the GPU and CPU are waiting for each other, forever. Duh.
The console notice this lack of progress and does the only thing it can - reboot.
The bug was fixed by increasing the size of the glyph atlas enough so we never reach a point where we have to wait on that fence.
Oh, and that GPU hang crash dump wasn’t there because it had to be handled by the CPU…