History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-3035
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Robin Shen
Reporter: AlSt
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Race condition/Endless loop in GridJob.sun()

Created: 22/Aug/17 07:50 AM   Updated: 14/Sep/17 01:32 AM
Component/s: None
Affects Version/s: 7.0.19
Fix Version/s: 7.0.22

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown


 Description  « Hide
It seems there is some kind of a race condition in com.pmease.quickbuild/src/com/pmease/quickbuild/grid/GridJob.java. The call to taskNode.getNodeService().isGridJobActive(taskId, getId()) is done infinitely although the node does not exist anymore causing a lot of log lines like:

2017-08-22 09:14:35,141 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:40,150 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:45,152 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:50,156 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:55,157 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:00,215 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:05,217 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:10,266 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:15,268 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:20,270 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:25,275 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:30,277 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:35,279 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:40,281 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:45,347 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:50,350 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:55,560 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:00,562 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:05,563 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:10,790 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:15,791 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:20,800 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:25,801 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:30,804 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...

I investigated further and the HessianURLConnection only does 3 retries. Afterwards the loop

while (state.get() == 1) {
try {
if (!taskNode.getNodeService().isGridJobActive(taskId, getId())) {
Thread thread = GridJob.this.thread;
if (thread != null)
thread.interrupt();
}
state.wait(TEST_INTERVAL);
} catch (Exception e) {
Thread thread = GridJob.this.thread;
if (thread != null)
thread.interrupt();
}
}
state.set(3);
state.notify();

is executing isGridJobActive again and the connection attempt is tried again and again.

There is some weird behavior because of this with a second node. It tries to authenticate with the token of another node which obviously previously executed the grid job.

And I think I know whats wrong and why it is a endless loop:
there is a synchronized(state) on line 84 which surrounds the while loop which is checking state.get() == 1 and that blocks the synchronized(state) on line 110 so the state never gets changed and that's why it is in an endless loop.
Usually surrounding the loop with the synchronized would work if the state.wait(...) would be called to release the lock at least for that time, but it never gets called because of the Hessian exception. That's why it hangs in line 110 and the retry is done every 5 seconds. So the easist fix would be most likely to move the state.wait(...) of line 92 outside the try/catch. Also the InteruptedException has to be handled outside.

Basically the thing is to ensure that the lock on "state" in line 84 is released at some time that "state" can change that the condition of the loop will become false at some point.

 All   Comments   Work Log   Change History      Sort Order:
Robin Shen [23/Aug/17 02:37 AM]
Yes you are right! I did not realize the node connection exception can skip the WAIT statement. Will get it fixed in next patch release. Thanks!

AlSt [30/Aug/17 09:14 AM]
Hi Robin,
Just wanted to ask when the next patch release is expected to be released? Thank you very much.

Alex


AlSt [31/Aug/17 05:44 AM]
Thanks!