Wednesday, April 8, 2015

Case Of Fence Intrinsics

Posted by Vikas at 8:09 PM

Java Memory Model describes how threads in java interacts with memory. C++11 also define a similar memory model [c++spec]. Probably any language which supports threads needs to have a Memory Model [hans1]. In essence Memory Model tries to guarantee Sequential Consistency [sq] for Data Race Free programs (SC for DRF) [c++11]. In order to implement memory model, languages may need to provide well defined synchronization operations such as locking, volatile read/write(Java).

Java provide many such operations and constructs to implement memory model such as volatile, lock, final, synch unsafe.putOrdered() etc. Since JDK 8 we have the Fence Intrinsics API [jep171] which provides memory ordering or fences without modifying the data. In the current post i will try to explain how they are define in terms of JMM and will see some use cases where we may use Fence API

Lets start with a very basic example :-

 
T1             

a = 1
b = 1



T2        
                     
while(b!=1);
assert(a==1)


Fig. 1

The programmers expectation in the above code is that assert in the above code will never fail, but this expectation will only be met if the code is executed sequentially i.e it should run in the same sequence as it has been written [sq]. But in real world this is far from truth. Today's because of compiler and processor/hardware optimization you code doesn't run sequentially. In other words not only may concurrent executions be interleaved but they may also be reordered and otherwise manipulated in an optimized form that bears little resemblance to your code [doug1].

 
T1             

b = 1
a = 1



T2        
                     
while(b!=1);
assert(a==1) // assert may fail here 


Fig. 2 - Reordered Code

The code in Fig 1 can easily be reorder to code in Fig 2 as a result of optimization by compiler/hardware. You can easily see if b = 1 is reordered before a = 1 then the assert in thread T2 will fail. Now in order to fix the above broken code we may need to rely on the guarantees given by memory model about the resulting semantics. In other words there need to be some kind synchronization actions [sa] between the two threads T1 and T2 which can guarantee the correct and defined behaviour of the above code. The code in Fig 2. can be easily fixed if we declare b as volatile in Java (Fig. 3). The write to volatile variable v synchronizes-with all subsequent reads of v by any thread [sw], and if an action x synchronizes-with a following action y, then we also have hb(x, y) i.e write to a volatile field happens-before every subsequent read of that field [hb].

 
T1             

a = 1
b = 1
c = 2;



T2        
                     
while(b!=1);
assert(a==1) // will always pass


Fig. 3 - Fix by making b as volatile

Now declaring b as volatile is one of the way to correct the above code. One thing to note here is that in order to make the code work we only need a weaker partial guarantee of happens before [hb] not necessarily sequential consistency. For example it doesn't matter if c = 2 is executed before b = 1 as long as a = 1 is executed before b = 1 . Therefore in java we have more constructs which can provide partial guarantees happens before unsafe.putOrdered(object, value) and unsafe.getIntVolatile(int var) We can use the above two method to achieve happens before guarantee without using volatile keyword.

 
T1             

a = 1
unsafe.putIntOrdered(b,1);
c = 2;



T2        
                     
while(unsafe.getIntVolatile(b)!=1); 
assert(a==1) // will always pass


Fig. 4 - Fix by using Unsafe.putOrdered() where b is a normal variable

Volatile generally provide a little more stronger guarantee than happens before. Anyway idea is to show some alternate ways in which this can be achieved may not necessarily be better. Now lets see Fence Intrinsic API since JDK 8 [unsafe]. Below are the java doc for Fences API(sun.misc.Unsafe) from JDK9 build. The java doc comments are pretty comprehensive, please read them with patience

 
 /**
     * Ensures that loads before the fence will not be reordered with loads and
     * stores after the fence; a "LoadLoad plus LoadStore barrier".
     *
     * Corresponds to C11 atomic_thread_fence(memory_order_acquire)
     * (an "acquire fence").
     *
     * A pure LoadLoad fence is not provided, since the addition of LoadStore
     * is almost always desired, and most current hardware instructions that
     * provide a LoadLoad barrier also provide a LoadStore barrier for free.
     * @since 1.8
     */
    public native void loadFence();

    /**
     * Ensures that loads and stores before the fence will not be reordered with
     * stores after the fence; a "StoreStore plus LoadStore barrier".
     *
     * Corresponds to C11 atomic_thread_fence(memory_order_release)
     * (a "release fence").
     *
     * A pure StoreStore fence is not provided, since the addition of LoadStore
     * is almost always desired, and most current hardware instructions that
     * provide a StoreStore barrier also provide a LoadStore barrier for free.
     * @since 1.8
     */
    public native void storeFence();

    /**
     * Ensures that loads and stores before the fence will not be reordered
     * with loads and stores after the fence.  Implies the effects of both
     * loadFence() and storeFence(), and in addition, the effect of a StoreLoad
     * barrier.
     *
     * Corresponds to C11 atomic_thread_fence(memory_order_seq_cst).
     * @since 1.8
     */
    public native void fullFence();

Let see how can we use the api to achieve the correctness of the code in Fig. 1 using Intrinsic API.

 
T1             

a = 1
unsafe.storeFence();
b = 1  
c = 2;



T2        
                     
while(b!=1); 
unsafe.loadFence();
assert(a==1) // will always pass


Fig. 5 - Fixing using Fences - Wrong Attempt

Now the beauty of the above code is that it will provide the necessary compiler/hardware fences to prevent reordering, but there is a subtle problem with the above code who tells the compiler not to optimize the loop while(b!=1) . Probably no one and yes compiler may optimize the loop to an infinite loop i.e it will read the value of b as 0 then he checks yes nobody can update so lets loop forever as a result your code breaks.

Lets fix the code in Fig 5.

 
T1             

a = 1
unsafe.storeFence();
b = 1  
c = 2;



T2        
                     
while(b!=1){ 
unsafe.loadFence();
}
unsafe.loadFence();
assert(a==1) // will always pass


Fig. 6 - Fixing using Fences

Repeatedly calling loadFence() in while loop prevents compiler reordering. Code may work in the Fig. 6 but is there any kind advantage of this approach over its counterpart unsafe.putOrdered() and volatile ?. Probably not, at least not in this scenario instead now c = 2 on the other hand is not allowed to move up above the storeFence() which was allowed to move up past the volatile write and unsafe.putOrderedObject() in Fig. 3 and Fig. 4. Probably we are loosing more than gaining anything.

Similarly lets try Double Checked Locking (DCL) [dcl] and lets create a Singleton using volatile.

 
volatile Singleton instance = null
Singleton getInstance() {
  if(instance == null) {
   synchronized(Singleton.class) {
    if(instance == null)
       instance = new Singleton();
    }
   }
    return instance;
}

                         
Fig. 7 -  DCL using volatile

Now let's try the same using Fences API

 
Singleton instance = null
Singleton getInstance() {
  Singleton tmp = instance; 
  U.loadFence();
  if(tmp == null) {
   synchronized(Singleton.class) {
     tmp == instance
     if(tmp == null) {
       tmp = new Singleton();
       U.storeFence();
       instance = tmp; 
    }
   }
    return instance;
}

                         
Fig. 8 -  DCL using Fences

The above code works well without using volatile or final keyword, as storeFence() prevent reference of singleton to publishes before its construction is done and loadFence guarantees to read the fully initialized fields of the Singleton [dclnew]. I don't argue wether the above implementation of DCL is better or worse. There may be better strategies for DCL pattern [wikiDCL], purpose is to show how can be achieved using Fences API. Lets see some more examples where Fences API can be used.

Happens-before consistency is not sufficient [jls]

 
T1             

r1 = x
if(r1!=0) {
  y = 1;
}


T2        
                     
r2 = y
if(r2!=0) {
  x = 1;
}


Happens before consistent but not Sequentially consistent  
Fig. 9 - y == x == 1 allowed

Lets fix the above code using Fences:

 
T1             

r1 = x
if(r1!=0) {
  U.storeFence()
  y = 1;
}


T2        
                     
r2 = y
if(r2!=0) {
   U.storeFence()
  x = 1;
}


Fixed using Fences Api 
Fig. 10 - y == x == 1 not allowed now

In the above example storeFence prevents any Load|Store to pass down and any Store to pass up through it, and hence there will be no execution possible which will result in both y and x to be 1 i.e y == 1 and x == 1 is not allowed.

Lets take another example :-

 
T1             

B = 1
r2 = A


T2        
                     
A = 1
r1 = B


Happens before consistent but not Sequentially consistent  
Fig. 11 - r2 == r1 == 0 is allowed

Lets see if can fix this by Fences :-

 
T1             

B = 1
U.fullFence()
r2 = A


T2        
                     
A = 1
U.fullFence()
r1 = B


Sequentially consistent  
Fig. 12 - r2 == r1 == 0 is not allowed now

The code in Fig. 12 may not be fixed by using StoreFence or LoadFence as it requires a StoreLoad barrier[cookbook] which is provided by fullFence() method in the Fence API. In the above example fullFence prevents any Store|Load to pass through it, and hence there will be no execution possible which will result in both r1 and r2 to be 0 i.e r1 == 0 and r2 == 0 is not allowed.
There is another one more very interesting use case of Fences API is when you try to read optimistically using StampedLock[sl].

 
 *   double distanceFromOrigin() { // A read-only method
 *     long stamp = sl.tryOptimisticRead();
 *     double currentX = x, currentY = y;
 *     if (!sl.validate(stamp)) {
 *        stamp = sl.readLock();
 *        try {
 *          currentX = x;
 *          currentY = y;
 *        } finally {
 *           sl.unlockRead(stamp);
 *        }
 *     }
 *     return Math.sqrt(currentX * currentX + currentY * currentY);



Fig. 13 - Optimistic Read using Stamped Lock

Here in the above example first we read x and y optimistically using tryOptimisticRead(), then we validate wether that optimistic read is valid or not if its not valid then we take readLock and again read else we are good to go. Now here my requirement is strictly that my optimistic read of x and y should not reordered with (or move down past) sl.validate(stamp) method. Using prior approach to enforce this we have to use Unsafe.loadFence() before checking stamps validity. i.e validate method to work properly should apple a loadFence().

 
     * The design employs elements of Sequence locks
     * (as used in linux kernels; see Lameter's
     * http://www.lameter.com/gelato2005.pdf
     * and elsewhere; see
     * Boehm's http://www.hpl.hp.com/techreports/2012/HPL-2012-68.html)
     * and Ordered RW locks (see Shirako et al
     * http://dl.acm.org/citation.cfm?id=2312015)
     *
     *
     * As noted in Boehm's paper (above), sequence validation (mainly
     * method validate()) requires stricter ordering rules than apply
     * to normal volatile reads (of "state").  To force orderings of
     * reads before a validation and the validation itself in those
     * cases where this is not already forced, we use
     * Unsafe.loadFence.

    public boolean validate(long stamp) {
        U.loadFence();
        return (stamp & SBITS) == (state & SBITS);
    }



Fig. 14 - validate method from Stamped Lock

Possibly there will be much more to this with Enhanced Volatile[ev]. Will try to post on the same later in future.