<< Back to previous view |
[QB-3035] Race condition/Endless loop in GridJob.sun()
|
|
Status: | Resolved |
Project: | QuickBuild |
Component/s: | None |
Affects Version/s: | 7.0.19 |
Fix Version/s: | 7.0.22 |
Type: | Bug | Priority: | Blocker |
Reporter: | AlSt | Assigned To: | Robin Shen |
Resolution: | Fixed | Votes: | 0 |
Remaining Estimate: | Unknown | Time Spent: | Unknown |
Original Estimate: | Unknown |
Description |
It seems there is some kind of a race condition in com.pmease.quickbuild/src/com/pmease/quickbuild/grid/GridJob.java. The call to taskNode.getNodeService().isGridJobActive(taskId, getId()) is done infinitely although the node does not exist anymore causing a lot of log lines like:
2017-08-22 09:14:35,141 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:14:40,150 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:14:45,152 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:14:50,156 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:14:55,157 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:00,215 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:05,217 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:10,266 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:15,268 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:20,270 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:25,275 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:30,277 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:35,279 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:40,281 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:45,347 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:50,350 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:15:55,560 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:00,562 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:05,563 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:10,790 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:15,791 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:20,800 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:25,801 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... 2017-08-22 09:16:30,804 [pool-1-thread-73487] WARN com.caucho.hessian.client.HessianURLConnection - Connecting to 'http://172.16.122.254:8811/service/node' failed: Connection refused (Connection refused), retrying after 5 seconds... I investigated further and the HessianURLConnection only does 3 retries. Afterwards the loop while (state.get() == 1) { try { if (!taskNode.getNodeService().isGridJobActive(taskId, getId())) { Thread thread = GridJob.this.thread; if (thread != null) thread.interrupt(); } state.wait(TEST_INTERVAL); } catch (Exception e) { Thread thread = GridJob.this.thread; if (thread != null) thread.interrupt(); } } state.set(3); state.notify(); is executing isGridJobActive again and the connection attempt is tried again and again. There is some weird behavior because of this with a second node. It tries to authenticate with the token of another node which obviously previously executed the grid job. And I think I know whats wrong and why it is a endless loop: there is a synchronized(state) on line 84 which surrounds the while loop which is checking state.get() == 1 and that blocks the synchronized(state) on line 110 so the state never gets changed and that's why it is in an endless loop. Usually surrounding the loop with the synchronized would work if the state.wait(...) would be called to release the lock at least for that time, but it never gets called because of the Hessian exception. That's why it hangs in line 110 and the retry is done every 5 seconds. So the easist fix would be most likely to move the state.wait(...) of line 92 outside the try/catch. Also the InteruptedException has to be handled outside. Basically the thing is to ensure that the lock on "state" in line 84 is released at some time that "state" can change that the condition of the loop will become false at some point. |
Comments |
Comment by Robin Shen [ 23/Aug/17 02:37 AM ] |
Yes you are right! I did not realize the node connection exception can skip the WAIT statement. Will get it fixed in next patch release. Thanks! |
Comment by AlSt [ 30/Aug/17 09:14 AM ] |
Hi Robin,
Just wanted to ask when the next patch release is expected to be released? Thank you very much. Alex |
Comment by Robin Shen [ 31/Aug/17 01:04 AM ] |
Got fixed now in https://build.pmease.com/build/3974/build_overview |
Comment by AlSt [ 31/Aug/17 05:44 AM ] |
Thanks! |