I’ve been wrestling with concurrent algorithms again. This time, it’s a Non-Blocking (concurrent, lock-free) HashTable. I’ve had it basically figured out for a few months now and I’m slowing widening the circle of people I expose it too. This HashTable seems too good to be true, which is why I’m trying hard to vet it’s correctness before claiming victory. It’s single-threaded performance appears to equal java.util.HashTable, and at high CPU count (750 cpus on a high-end Azul machine) it’s about twice as fast as Doug Lea’s marvelous java.util.concurrent.ConcurrentHashMap – with 4096-way striping. This is for very high read-to-write ratios (99% reads, 1% table mods). For higher write ratios the Non-Blocking table scales much better than ConcurrentHashMap. Of course, if you lower the 4096-way striping to the default 16-way, then the Non-Blocking table becomes easily 100x faster. Performance summary:
- As fast as java.util.HashTable for single threads
- As fast as ConcurrentHashMap for < 32 cpus and >95% read rates
- Faster for higher write rates than ConcurrentHashMap. (2x to 8x faster, depending)
- Faster for more CPUs. Much faster if not striping ConcurrentHashMap enough (easily 100x faster)
- Scales linearly for all read/write rates at least up 750 cpus. Tested on 4-way X86, Niagra, 32-way UltraSparc2, 384-way Azul Vega1 and a 768-way Azul Vega2.
Ok, how do I do it?
By writing a State-Machine based algorithm, instead of the usual concurrent programming style.
From this idea I get a simple state machine, a short Java implementation of it, and a very fast very scalable algorithm.
First the easy stuff: I assume you have written hash tables in the past (nearly every non-Java programmer does at one point or another; the Java guys get it easy). A hash table is a collection of {key,value} pairs with fast random access (first access is via the hash function). I’ve chosen a power-of-2 closed table – collisions cause re-probing into the table (contrast this to an open table where collisions turn into linked-list traversals). I chose power-of-2 instead of a prime number because ‘AND’ is faster than ‘MOD’ by an amount larger than an L1-miss-L2-hit. I re-probe by adding 1 and AND’ing to the table size (better cache behavior for repeated re-probes). Other design decisions make sense here, depending on your exact usage pattern.
Now the fun stuff: How do I handle the {key,value} pairs? This is where the state-machine kicks in. I define states for each interesting key & value. For keys the states are {null,K} – where K is any key. A key-slot makes a 1-time transition from null to some K. For values the states are {null,V/T} – where V is any value and T is a Tombstone, a special token that is not any valid value and represents a deleted key. A value-slot makes a 1-time transition from null to some V, and can waffle about being different V’s or T (deleted) according to whatever was the last table modification.
This gives me 2 key-states and 3 value-states, so the {key,value} pair has 6 states.
- {null,null} – Empty
- {K,null} – Partially inserted key/value
- {K,V} – Fully functional {key,value} pair
- {K,T} – Previously inserted, now deleted Key
- {null,V} – partially inserted K/V pair being read out-of-order
- {null,T} – partially inserted K/T pair being read out-of-order
Now here’s the cool thing: it doesn’t matter what order I read or write these bits within each pair, I always get some sane state. I can act on that state. This means I do not need to reason about ordering when either inserting or looking up data in the table! This also means I do not need any memory fencing or any locking! I do need CAS when changing the table (but not double-CAS, since I can read out-of-order I can write out-of-order as well – hence I do not need to atomically update both words). Here, then, is the ‘get(Object key)’ code:
Object get( Object K ) {
idx = hash = K.hash();
while( true ) {
idx &= (size-1);
key = table[2*idx+0]; // key/val kept in even/odd pairs
val = table[2*idx+1]; // in the large array
if( key == null ) return null; // a miss
if( K == key || key.equals(K) ) // key hit?
return val == T ? null : val; // a hit (unless deleted)
idx++; // reprobe
}
}
I’ve left out some details (e.g. using the hash to avoid doing some key-compares that are doomed to fail), but this is the gist of it. ‘get’ is extremely lightweight – just the basic hash, lookup and return value for a hit. More next time on how I do ‘put()’ as that is where the state machine gets a real workout. But this post is already too long.
Cliff
PS – Slides will be online soon, and the source code as well (need to get the license figured out)



Recent Comments