Looks like I need to take some more of my own medicine! Here’s code I worked on over a year ago, and it took a year for the data race to finally come home to roost. The following API is fundamentally flawed; no implementation of it can dodge the data race. Here’s the deal: I’ve got a field which from time-to-time is reset to NULL, generally as part of some overall timeout-based cache-flushing policy. It’s also in a time-critical function so I’m using lock-free accessing code with lots of comments on proper usage, Big Flashy Warnings, etc. (The code to set the field uses locks, and isn’t shown here)
private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time
And the gratis accessor function:
public Code code() { return _code; }
And a some convenience functions to test various conditions. Here’s one:
public boolean is_c2_code() { return code() != null && code().is_c2_method(); }
And some code using this API:
int nx = next_m.is_c2_code() ? ...next_m.code()... : ...blah...;
Ok, everybody can spot the implementation race-condition – especially after the fact! It’s easy in isolation although somewhat harder in a 0.5MLOC program. The issue is in the implementation of “is_c2_code()” – I’ve called “code()” twice. And very very rarely it changes between the loads of the underlying _code field flipping from not-null for the “code() != null” test and then to null for the “code().is_c2_method()” call. Boom, instant NullPointerException. Grumble, mumble, I go back and fix “is_c2_code()” to only read _code once:
public boolean is_c2_code() { Code code = code(); // read _code ONCE return code != null && code.is_c2_method(); }
Crank up my test case, wait a few hours and … Lo! The bug is NOT fixed! Sure enough, under the right combination of pressure and temperature I blow up at the “…next_m.code()…” bit with another NPE. And then the flaw with the API hits me: if I call the “is_c2_code()” test, it tests one version of _code – and then when I call “code()” again in “…next_m.code()…” I get another load of _code which might hold a completely different value! In fact, I need to do an entire conceptual operation from a single load of _code. I can’t use any convenience functions unless they take the already-loaded _code field as an argument. And with that epiphany I finally got it right (well, I hope anyways):
private volatile Code _code; // Can be reset to NULL by cache-flushing thread at any time public Code code() { return _code; } // Convenience function: takes a pre-loaded Code value static boolean is_c2_code( Code code ) { return code != null && code.is_c2_method(); } And some code using this new API: Code code = next_m.code(); // Read it ONCE for the whole operation int nx = Code.is_c2_code(code) ? ...code... : ...blah...;
And that’s my data-race lesson of the week: API’s can have data races as well as plain old code.
And now my question for you, the gentle reader: how do I enforce this over time? In the actual code base there are about a dozen convenience functions, all now carefully labeled. All the use-points now do a single clearly labeled read, then pass the loaded value around in all the convenience functions. But in 6 months, when Junior Engineer comes along and looks at this class and says “I need a new function that tests _code and returns a blah”…. he likely makes another data-race API mistake. Is there something better I can do than just comment, comment, comment?
Thanks,
Cliff


