Last week, Rick Ballard came by my office for a consult. He had caught Xcode at a crash in
objc_msgSend(). The crash looked like an intermittent problem that had been plaguing Xcode for months. So he called the local expert on debugging
objc_msgSend(). Dr. Gregory Parker, Department of Diagnostic Engineering.
The good news was that Rick's crash was reliably reproducible. Running tests on a live patient is better than performing an autopsy on a dead one. The bad news was that the obvious debugging tools had not helped.
guardmalloc had turned up nothing, and
AUTO_USE_GUARDS=YES (the GC equivalent of
guardmalloc) just thrashed the machine for two hours before running out of address space.
So you crashed in
objc_msgSend(). The selector was
-isAbsolutePath, which was reasonable but meant the debugger's backtrace was missing a frame.
objc_msgSend() had read the class from the object, read the method cache from the class, read a method from the method cache, and crashed while trying to read the
IMP from the method. Theory: either one of those data structures had been hit by a memory smasher, or the original object was bogus but happened to have dereferenceable pointers in the right places to survive that long. The method cache's mask was invalid - it should have been of the form 2n-1 - so the failure must have been at or before that point in the chain.
The object pointer itself looked plausible. Theory: the object was valid, but a previous object at the same location had been used after being freed. We had the great luxury of a reproducible crash, so we turned on
MallocStackLoggingNoCompact and ran it again. That memory had only been used for one object, and it had not been deallocated. So the evidence did not support the use-after-free theory. But the history showed that the object had been allocated as an
NSPathStore2 - an internal subclass of
NSString for file pathnames - which matched the selector
-isAbsolutePath and matched the call site's expectations. The theory that the object pointer was valid looked good.
The object pointer was good, and the method cache was not: the failure was on the chain between them. The contents of the object looked good. The bytes looked like alternating zero and ASCII, which is a dead giveaway for the UTF-16 used inside
NSString. The string value decoded as
@"/Xcode4/usr/bin/llvm-gcc", which made sense in the call site's context.
isa pointer was not so good. Its value was
0xa0050000. This was not class
NSPathStore2 or any other class.
vmmap showed it to be in Foundation's data segment, and
otool showed it was specifically in Foundation's constant CF strings. But instead of pointing to the start of some string, it pointed to the middle of a string object. That string object was
@"tzm-Latn": some localization thingy, perhaps? Theory: some bug had replaced this object's isa pointer with a pointer to the middle of an unrelated localization string object. This did not sound like a good theory.
Go back to the board. Symptom: the object was allocated as an
NSPathStore2. Symptom: the object's
isa pointer is now
0xa0050000, which is not
NSPathStore2. What should the
isa pointer's value have been?
objc_getClass() agreed: the correct
isa pointer should have been
0xa0050000 is suspiciously similar. Theory: something had cleared two bytes of this object, leaving a nonsense
@"tzm-Latn" was a red herring.
Aha! This is 32-bit i386. Little endian. The pointer
0xa005f198 is stored backwards in memory:
0x98 0xf1 0x05 0xa0. Clearing the least-significant bytes of the
isa pointer meant clearing bytes 0 and 1 of the object, not bytes 2 and 3. Damage to bytes 0 and 1 is exactly what you'd expect from a two-byte overrun of the object preceding this one in memory. Theory: the bug was in that preceding object, and this
NSPathStore2 object was an innocent victim.
malloc_history works with a pointer to the middle of an allocation, too. We plugged in
object-1 and got back an instance of
DVTSourceModelItem, not deallocated. Rick recognized this as part of Xcode's indexer, which was always running in another thread at the time of the crash. A buffer overrun from the
DVTSourceModelItem object fit the symptoms.
But where was the buffer? I had expected an overrun in some heap-allocated C array, not an ordinary object. Nor did
DVTSourceModelItem have any C arrays in its instance variables.
Theory: the compiler or runtime had allocated too little memory for the instance of class
DVTSourceModelItem, and ordinary ivar access had overrun that allocation. It was a long shot, but easy to test.
class_getInstanceSize() and an eyeball count of ivars all agreed that the object was 32 bytes. Theory disproved.
We tested the overrun theory again. Add an unused ivar to the end of
DVTSourceModelItem, recompile, and run it. No crash. Remove the ivar. Crash. The extra ivar "fixed" the bug. The buffer overrun theory still fit the evidence, but we couldn't find it.
No more ideas. We needed data. Debugger watchpoints were out: there were thousands of instances of
DVTSourceModelItem, and we couldn't watch two bytes after each of them. We were not yet desperate enough to try brute force code inspection.
AUTO_USE_GUARDS=YES could catch it, if it didn't fall over first. Since we had a suspect in mind, we could play the
guardmalloc trick ourselves with a narrower target. Override
mprotect() the page after the allocation, and cross our fingers really hard hoping that it still reproduced after changing the timing so much.
Bang! It crashed (good) somewhere new (also good).
-init was writing to one of its own instance variables. The ivar was a bit in a bitfield, and that bitfield was at the end of the ivar list.
Disassemble. The generated code read 4 bytes around the bit into a register, change the bit in that register, and wrote the 4 bytes back to memory. That's typical for a bitfield. The unexpected part was that the 4 bytes spanned the last two bytes of the object and the first two bytes after the object. That's a bug. Most of the time the out of bounds access is invalid - it reads two bytes it shouldn't, and writes back the same value. But if there's another thread it can crash:
|Thread 1||Thread 2|
|reads four bytes, including two|
bytes outside the object
| ||allocates a new object|
| ||writes an isa pointer|
|writes four bytes, clobbering the|
new value written by Thread 2
Theory: a compiler bug generated bad code for
DVTSourceModelItem's bitfield ivar, causing a read-modify-write out of bounds by two bytes, which corrupted memory in other threads. Test: try a different compiler.
DVTSourceModelItem.m was built with
clang, so we recompiled with
llvm-gcc. No crash, and the disassembly looked correct. Compile with
clang again, crash again.
clang compiler bug in bitfield ivars. The patient's symptoms were treated with an extra ivar in
DVTSourceModelItem until a compiler transplant could be performed.
Elapsed time: about three hours. Too long for an episode of a TV procedural drama, unfortunately.