A short conversation on Biased Locking, between myself, Gil Tene (also of Azul Systems) and Martin Thompson.
On 11/13/2011 12:57 PM, Martin Thompson wrote:
I’m looking a problem where some third party code has more locks that necessary. I thought I could get around the issue if I made sure only one thread ever enters the lock and thus with Biased locking it would be very cheap. I noticed an interesting problem whereby with Hotspot Biased locking is slightly more expensive than normal locking when uncontended. which really threw me. Does this make any sense to you? I plan to blog about this.
On Mon, Nov 14, 2011 at 3:11 AM, Cliff Click <email@example.com> wrote:
I’d have to do a deep drill into what Sun is doing, and what the hardware is doing, and what you’re doing to nit-pick this apart. Our BiasedLocking is different, so results on Azul will vary. CAS costs vary from platform to platform, so the relative gain (or loss) of BiasedLocking varies by platform. Older X86′s or Sparc’s – CAS is expensive; BiasedLocking is a Win. New X86′s or Azul – CAS is cheap; BiasedLocking is closer to a wash.
On 11/13/2011 11:43 PM, Martin Thompson wrote:
For me the important point is the single threaded case. This is uncontended. If I was to write a biased locking algorithm I could do this without using a CAS for subsequent calls. Only another thread needs to CAS when beginning the revoke cycle if my understanding is correct? The point of biased locking is to avoid the atomic instructions, and therefore their cost. This benchmark is suggesting that one still bears the CAS cost even uncontended which seems like a bug to me based on the evidence.
On Mon, Nov 14, 2011 at 8:46 AM, Cliff Click <firstname.lastname@example.org> wrote:
Ahh… but the trick for nehalem & Azul vega chips is that *uncontended* CAS ops are cheap; 3 clks on Azul, 10-12 on nehalem1. Where-as, biased locking can require bookkeeping and the bookkeeping might exceed the cost of a CAS. All depends on how Sun implements it. We went a different route, so without some serious spelunking I can’t tell you what Sun is doing.
You can run Sun’s JVM in a debugger & in a hot loop – and hit ^C. You’ll almost surely be spinning in your hot-loop, and you can look at what Sun is doing from the debugger. Or at least that’s what I’d end up doing if you manage to convince me I need to know what Sun is up too.
I get the chance tomorrow, I’ll see how we fair vs the same rev of Sun JVM. We don’t have an exact hardware match, so maybe we’ll get some more data points.
I’ll have a play with this later. My tests have been on Penryn, Nehalem, Westmere and Sandybridge (tock-tick-tock-tick), so I cover everything that matters on x86
I do not understand the 3 clocks quote for CAS. In my experience a CAS, aka lock cmpxchg, is ~100 cycles plus the cost of draining the store buffers. BTW I was talking to Intel last week and they have released a new version of VTune with Java support again.
I’m sure you have done a much better job on the implementation. If this is the case it could be a good selling point in the concurrent world. The more I dig the less I am impressed with Sun/Oracle implementation as you get seriously concurrent or parallel. When I write control experiments in C/C++ for work I’m doing I keep being surprised by the results by comparison.
p.s. I take you are on European time given the responses?
On Mon, Nov 14, 2011 at 4:16 PM, Cliff Click <email@example.com> wrote:
Sorry, should have explained more – 3 clks for Azul *Vega* hardware…. for a cache-hitting CAS. Our CAS does not order memory or fence, just provides atomicity guarantees. Our Vega FENCE op can “hit in cache” – if there are no outstanding cache misses, then again it’s a 1-clk op. So we can do a cache-hitting uncontended lock acquire in something like 5 or 6 clks – with full CAS semantics, no BiasedLocking. So we ripped out Sun’s BiasedLocking implementation (it had other weaknesses and design choices for which we have better underlying software already, e.g. we can stop individual threads fairly cheaply). When we moved to X86 we re-inspected the situation and decided we needed a BiasedLocking impl of some kind, we just rolled our own. Then it turned out that the recent crop of X86′s have a cheap “cache hitting” CAS – if it hits in L2 I think it’s like a dozen clocks – *not* a 100. This might Yet Again spell the Doom of BiasedLocking… a software solution to a hardware problem (expensive CAS) to a software problem (frequent uncontended locks).
(re European time: ) Nope – just complicated life. Some days I got the kids and I’m busy – and this kind of email, while fun, it’s not really time critical to answer.
Seems you and Gil had a race condition
On 11/14/2011 8:09 AM, Gil Tene wrote:
On CAS costs:
While there is a common misconception that CAS (or ordering) requires store buffers to be drained, and inherently involves the latency of socket-to-socket cache-line-invalidate operations, that’s just not the case. A processor with nothing outstanding in the load/store pipelines has practically no cost for ordering, and ordering it’s own operations “globally” is a processor-local concern: it only has to get it right with respect to it’s own outstanding load and store operations. This doesn’t require any flushing – just correct interpretation of state and correct application of state changes. It must order the operation correctly with respect to other ops before making them globally visible – something an out-of-order execution core like modern x86 is extremely good at.
The cause of 100+ cycle CAS cost had historically been due to the fact that processors implemented it by including global (processor-to-processor) communications steps which inherently involve memory-latency level (and/or socket-to-socket) operations. However, there is nothing in a CAS (or in ordering operations for that matter) that inherently makes it a super-expensive (100+ cycle) operation. It was only that way because processors did not bother to optimize implementation for multi-processor, multi-core situations. The only 100+ cycle cost that is required is one where there is contention for the cache line in question, and in that sense a CAS should have been not much different from a load/store to contended cache lines.
With Vega, we went the extreme route (having 800+ cores makes you assume the common program will actually need frequent uncontended CASs). A Vega CAS is simply an atomic read-modify-write operation with no global ordering requirements. If you want ordering you use a separate fence instruction to force the ordering you want (e.g. “fence stst | ldst”). Ordering would be practically free (1 cycle) if there are no outstanding loads or stores in the pipe, and would have the same cost regardless of whether there are atomic or non-atomic load/store stuff sitting in the pipe. The way we did the atomic read-modify-write was nearly trivial: the cache line is brought into L1 in exclusive mode, and then the processor’s L1 cache’s communications to the outside world is frozen for the duration of the operation. This makes the operation inherently atomic, and has no effect on any other processors unless they need to invalidate one of the L1 cache lines in our processor cache during our atomic op (and if they do, the delay they see is trivial compared to the communications delay of getting to us and back).
With x86, it gets more complicated because a CAS has very specific ordering semantics (as do all LOCK instructions). However, all this means is that the processor is responsible for maintaining that order appearance. Once it gains exclusive access to the cache line, it can perform similar processor-local games to make sure the operation is both atomic and appears in the right order to other processors. It has to treat it right with respect to it’s own outstanding load and stores, and with respect to it’s store buffer, but those are all core-local concerns, and non of them require the long latency of communicating with any other processor’s L1/L2/L3 levels.
As a result, an uncontended CAS on modern (Nehalem and later) cores is much cheaper than before. It is still sensitive to being mixed with other cache missing load and stores, but no more so that a regular load and store would when combined with an ordering operation barrier.
Nice clear explanation
This fits with what I’m measuring. I’m seeing lock xadd for example take 15 cycles on Westmere and 20 cycles on Sandybridge if uncontended and with an empty store buffer. This can get much worse if the store buffer is significant and therefore we are back in the >100 cycles space again.
I’ve had official confirmation back from Intel on my findings with lock instructions on Sandybridge. The reason for the issues are two fold. Firstly, there is a ~33% latency cost increase with the uncontended instructions (they cannot explain this!). Secondly, and more importantly, is the invalidate messages are now faster. So under contention a ping pong effect kicks in resulting in a thread loosing the lock very quickly as new claim is issued. So micro-benchmarks are really hurtful but real world apps look better.
For your last statement is is not the case the ordering instructions do not matter because he stores cannot be re-ordered anyway on the x86 memory model?
BTW I’ve been talking to the ######## guys and they are saying nice things about the engagement with Azul. Seems you have a number of folk pushing you in the low-latency space
On Nov 14, 2011, at 8:51 AM, Cliff Click wrote:
store/load can be reordered – and it’s very useful to do so, so X86 is good at it.
st [MISS_in_cache_adr],0 // miss in cache – will take 1000 clks
locked op[HIT_in_cache_adr1] // some random locked op that hits in cache – without the lock will take ~1 clks
ld rax,[HIT_in_cache_adr2] // takes 1 clk
ld rbx,[rax] // Would LOVE to start this early… if the above load into RAX could complete early
Would otherwise reorder except for the lock prefix.
Furthermore, even for stores “cannot be re-ordered” doesn’t mean “have to wait for each other and execute serially”. It just means that their effects cannot be made visible to the outside world in an order other than stated.
Imagine for examples the following sequence:
st [miss_in_L3_cache_addr_A], 1
st [hit_in_cache_addr_B], 2
ld regX, [hit_in_cache_addr_C]
test regX, mask
branch somewhere if we don’t like the test result
st [hit_in_cache_addr_C], 3
do some work that takes less than an L3 miss…
This entire sequence can be executed in the time it takes the first instruction to complete it’s L3 miss. It just can’t be made visible until the first miss is completed and the store is made visible “first”. So the store buffer, while continuing to be stored to, has to be gated at the first store until it completes.
Obviously, if the values of subsequent stores was somehow dependent on the earlier ones, or on earlier not-yet-evaluated operations (such as a predicted branch the depends on a missing load – imagine that addr_C above was cache missing), they may have to wait before being evaluated and committed, but even there speculative execution could be used to pre-evaluate and pre-store them, as long as the processor was able to yank the store back out before they become visible if the speculation was determined to be wrong (branch prediction followed by a store would be a common case for this sort of “I want to store but not absolutely sure about the value yet” situation).
On Nov 15, 2011, at 12:06 AM, Martin Thompson wrote:
This is exactly what I was getting at. If a lock or sfence instruction was executed next in sequence the processor would need to stall until those instructions had retired. If it was just another straight forward store the processor could continue until the store buffer is full or another cache miss occurs that stops branch prediction.