<< Back to previous view

[QB-3035] Race condition/Endless loop in GridJob.sun()
Created: 22/Aug/17  Updated: 14/Sep/17

Status: Resolved
Project: QuickBuild
Component/s: None
Affects Version/s: 7.0.19
Fix Version/s: 7.0.22

Type: Bug Priority: Blocker
Reporter: AlSt Assigned To: Robin Shen
Resolution: Fixed Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown


 Description   
It seems there is some kind of a race condition in com.pmease.quickbuild/src/com/pmease/quickbuild/grid/GridJob.java. The call to taskNode.getNodeService().isGridJobActive(taskId, getId()) is done infinitely although the node does not exist anymore causing a lot of log lines like:

2017-08-22 09:14:35,141 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:40,150 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:45,152 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:50,156 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:14:55,157 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:00,215 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:05,217 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:10,266 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:15,268 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:20,270 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:25,275 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:30,277 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:35,279 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:40,281 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:45,347 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:50,350 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:15:55,560 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:00,562 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:05,563 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:10,790 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:15,791 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:20,800 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:25,801 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...
2017-08-22 09:16:30,804 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds...

I investigated further and the HessianURLConnection only does 3 retries. Afterwards the loop

while (state.get() == 1) {
try {
if (!taskNode.getNodeService().isGridJobActive(taskId, getId())) {
Thread thread = GridJob.this.thread;
if (thread != null)
thread.interrupt();
}
state.wait(TEST_INTERVAL);
} catch (Exception e) {
Thread thread = GridJob.this.thread;
if (thread != null)
thread.interrupt();
}
}
state.set(3);
state.notify();

is executing isGridJobActive again and the connection attempt is tried again and again.

There is some weird behavior because of this with a second node. It tries to authenticate with the token of another node which obviously previously executed the grid job.

And I think I know whats wrong and why it is a endless loop:
there is a synchronized(state) on line 84 which surrounds the while loop which is checking state.get() == 1 and that blocks the synchronized(state) on line 110 so the state never gets changed and that's why it is in an endless loop.
Usually surrounding the loop with the synchronized would work if the state.wait(...) would be called to release the lock at least for that time, but it never gets called because of the Hessian exception. That's why it hangs in line 110 and the retry is done every 5 seconds. So the easist fix would be most likely to move the state.wait(...) of line 92 outside the try/catch. Also the InteruptedException has to be handled outside.

Basically the thing is to ensure that the lock on "state" in line 84 is released at some time that "state" can change that the condition of the loop will become false at some point.

 Comments   
Comment by Robin Shen [ 23/Aug/17 02:37 AM ]
Yes you are right! I did not realize the node connection exception can skip the WAIT statement. Will get it fixed in next patch release. Thanks!
Comment by AlSt [ 30/Aug/17 09:14 AM ]
Hi Robin,
Just wanted to ask when the next patch release is expected to be released? Thank you very much.

Alex
Comment by Robin Shen [ 31/Aug/17 01:04 AM ]
Got fixed now in https://build.pmease.com/build/3974/build_overview
Comment by AlSt [ 31/Aug/17 05:44 AM ]
Thanks!
Generated at Fri May 03 00:30:08 UTC 2024 using JIRA 189.