Racing Object Construction: a debugging tale

Like so many great debugging sessions, this one started with a classic line that I’ve heard many times before, and is almost always wrong (unless it’s coming from Kenny): “Hey, it looks like there’s a code generation issue here. Wanna take a look?” Code generation issues are not unheard of, but they’re very uncommon. Nonetheless, there was a crash, and a memory dump file to look at. I’ve obfuscated a few file & type names to protect the guilty, but in all other respects this represents a real crash.

Aside: Post-mortem debugging is a special genre of debugging where, instead of stepping through a program and watching things change, you have a only snapshot of a program that represents the program state at some moment in time, usually when something bad happened. In this case, the program crashed due to an access violation trying to read from address 0.

I debug these in windbg. For Windows, there’s no better tool for poking around the state of a dump file. Visual Studio is a great debugger with some wonderful features, but too many of them rely on being able to run the application, look at symbols, etc. Windbg is better for looking at what’s happening “on the metal” with or without symbols or code.

So, I opened up the crash dump. The application failed due to an access violation. We can start by looking at the register state in the dump file to see the instruction and data that the program crashed on. For the record, in this case we’re looking at the amd64 / x64 instruction set &  machine architecture.

0:040> .ecxr
rax=0000000000000000 rbx=00000158897f70a0 rcx=000001588937aba0
rdx=00007ffb2776ebf8 rsi=0000015889704558 rdi=0000015889704530
…
somelib!ControlBlock::Resolve+0x26:
00007ffb`4f278fe6 488b00 mov rax,qword ptr [rax] ds:00000000`00000000=?

The RAX register value is 0. That’s certainly bad if you’re trying to dereference it. In most modern operating systems, the first page of memory is reserved but not committed to force a crash if code attempts to dereference it. But is it a codegen issue? Well, we need to understand what the code around this is doing to see whether this instruction is correct. In windbg, we can do this by disassembling the method up to the point of failure, as well as a few instructions past the failure.

0:040> u 00007ffb`4f278fe6-26 00007ffb`4f278fe6 
somelib!...:
00007ffb`4f278fc0 4053 push rbx
...
00007ffb`4f278fda 488b4b10 mov rcx,qword ptr [rbx+10h]
00007ffb`4f278fde 48897c2430 mov qword ptr [rsp+30h],rdi
00007ffb`4f278fe3 488b01 mov rax,qword ptr [rcx]
00007ffb`4f278fe6 488b00 mov rax,qword ptr [rax]

0:040> u
somelib!...
00007ffb`4f278fe9 ff1581000300    call    qword ptr [__guard_dispatch_icall_fptr...]
00007ffb`4f278fef 488b4b10        mov     rcx,qword ptr [rbx+10h]
00007ffb`4f278ff3 8bf8            mov     edi,eax

Without even looking at the code, I can tell a few things from the debugger. First, these two lines are a double-dereference. It’s followed by a __guard_dispatch_icall_fptr call. That last call tells me that I’m looking up a method address. The double-indirect is characteristic of a virtual method table dispatch.

The code happens to be C++ code, which matches what we might expect from the dissassembly.

hr = target->QueryInterface(
    riid, reinterpret_cast<void**>(objectReference));

The guard call is a call to a control flow guard (CFG) check function, which validates that the target of the virtual call is a valid code page. The CFG function expects rax to contain the address of the function to be called. In short, this looks like typical code for loading and invoking a virtual method. This probably isn’t a code generation issue. But what is it?

rax was nullptr. We were attempting to read a method address that we could invoke from the v-table address held in rax, and store the function back in rax. It just happens to be the case that we were looking at the first method in the v-table. Otherwise there would have been an offset added to rax before dereferencing. The value in rax was read from rcx. Presumably, rcx should have been a pointer to a valid C++ object, if everything were working as expected. So let’s see what’s hanging out at rcx.

0:040> dqs rcx
00000158`8937aba0  00007ffb`277f85d0 mydll!ViewModel::`vftable'
00000158`8937aba8  00007ffb`277f8590 mydll!ViewModel::`vftable'
00000158`8937abb0  00007ffb`277f84a8 mydll!ViewModel::`vftable'
00000158`8937abb8  00007ffb`277f8478 mydll!ViewModel::`vftable'

Hmm. The machine read rcx and store zero into rax, but the dump file says the value in that memory location is 00007ffb`277f85d0, and the debugger tells us that points to a valid v-table. Something’s fishy. So what’s up?

There are a few thoughts that go through my head in this situation:

First off, depending on the process dump, the debugger sometimes fills in data from cached DLL and EXE files rather than reading it from the dump. This can lead to mismatches if the code had been overwritten. That doesn’t apply here, though, since rcx points to a heap address. The only place that value could have come from is the dump file itself. When the snapshot was taken, that address had the non-null value.

Which leads to the second thought: this is likely a race. Dump file process snapshots represent the process state at the time the dump was taken. There are two ways it can get out of sync with what a thread saw at the time that the thread triggered the exception. First, other threads in the process continue to run for a fraction of a second while the exception machinery determines that the exception isn’t handled and a crash dump needs to be taken. The failing thread won’t change any state, but other threads can. Second, even if another thread never got paged in during that time, the dump machinery captures the flushed process state. If there is a race involving a missing memory barrier, the thread may have seen a stale cache line state that is not reflected in the dump file.

There are a couple of ways that a v-table could go from null to non-null, and point to something that looks correct in context. This is either first-time construction, or memory re-use, where another object happened to have a reasonable value, was released, zero’d out, and reallocated. It could even be the same type of object, if that type gets allocated and released frequently.

In this case, now that I’ve found the v-table, I can reason a bit more about the usage patterns. This is basically created once on startup. It’s a view model for an MVVM application and the view is basically created once. So this isn’t memory re-use. That means we need to understand how an attempt was made to call a virtual method prior to construction. What I’m looking for is a

class ViewModel : public BaseViewModel {...}

BaseViewModel::BaseViewModel(
    Controller^ controller,
    View^ view)
{
    ...
    // Listen for presentation ready change events
    m_token =
        someView->ViewChanged +=
            ref new TypedEventHandler<IView^, Object^>(
                this, &MyViewModel::OnViewChanged);

I’ve abbreviated a bit from the original code, but sure enough, we can see that we’re registering for an event inside the constructor. What’s more, reading a bit more source I found that this event can fire asynchronously at any time after registering.

The details get a little gnarly from here, and could warrant there own entire post, but basically C++ /CX objects (like most winrt objects in any language) support weak references via a separate control block. TypedEventHandler takes a weak reference to ‘this’ and the resulting weak reference, implemented on the control block object, points to the outer type, MyView, not BaseMyView.

To understand what actually happens here, we need to understand a bit about C++ object construction. Constructors in C++ run a few things during construction in a specific order:

  1. First, base classes are constructed transitively, each using the order described here.
  2. Next the v-tables for the current class are assigned
  3. Member object constructors run
  4. The current class constructor body is run.

You can see this in action in this simple program:

struct Member {
    Member() { cout << "member constructor\n"; }
};

struct Base {
    Base() { 
        cout << "base constructor\n"; 
        work();
    }
    virtual void work() { cout << "base work\n"; }
};

struct Derived : public Base {
    Derived() { 
        cout << "derived constructor\n"; 
        work();
    }
    virtual void work() override { cout << "derived work\n"; }
    Member m;
};

int main() {
    Derived d;
    return 0;
}

Output:
base constructor
base work
member constructor
derived constructor
derived work

Check line 2 of the output. Even though we’re constructing a ‘Derived’, the constructor for Base called Base::work(), not Derived::work(). v-tables pointers on an object are set as the constructors run, and can be uninitialized or overwritten as outer objects override the behavior of inner objects.

The behavior we see here in C++ is one choice. C# fills in the correct v-table before running any constructors at all. While that might sound better at a glance, in practice, it’s just a different choice with different implications. If the code above were converted to C#, it means the Derived class’s work() method would be called before its members were initialized.

Coming back to our original problem, the weak reference implementation in C++ /CX uses a placement new model. The control block used for reference counting and weak references allocates memory for the underlying type, then runs the constructor over that memory. The weak reference (basically a pointer to the control block) was handed out during the construction of the base type. When the event fired on another thread before the construction was complete and observable in cache, the control block attempted to call a method on the derived type, but the v-table hadn’t been set yet. Kaboom.

So what’s the takeaway?

In this case, the object enabled a callback against itself before construction was done. It was asynchronous here, but the same can occur with a synchronous callback registered against & triggered during construction. Although this case led to a crash, other variations of this type of coding pattern can lead to obscure behavior bugs as well.

Anytime you see a reference to self being handed out in a constructor in any programming language, be wary.

Using ‘this’ in the constructor may be correct, but only if you can reason thoroughly about the execution order of the class, derived classes, and the recipient. Specific ordering behaviors durin construction are language dependent, and if you’re relying on them, ask yourself if you’re being too clever for your own good.

2 thoughts on “Racing Object Construction: a debugging tale

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s