<< Back to previous view

[QB-2440] When an agent goes offline, QB cancels all of jobs which are assigned to to the agent
Created: 14/May/15  Updated: 18/May/15

Status: Closed
Project: QuickBuild
Component/s: None
Affects Version/s: 6.0.10
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Phong Trinh Assigned To: Robin Shen
Resolution: Won't Fix Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown


 Description   
 When a node/agent goes offline, all of the jobs which are assigned to that agent are automatically cancelled by QuickBuild. It causes loosing the jobs which are scheduled for nightly builds. Sometimes agent is under heavy load and doesn't respond to the server for a short period of time. The QB thinks the agent is offline and cancels all of the jobs which are assigned to it. In fact, the agent is back to online in several minutes or so. I think the jobs are supposed to be waiting for the agent until it is back online and then resume the operations.
 This bug causes a serious issue for us here, since we can not afford loosing nightly builds.

 Thank you in advance,
ptrinh

 Comments   
Comment by Robin Shen [ 14/May/15 11:25 PM ]
QB has to cancel jobs when it detects agent offline as otherwise the queue may be filled to block other builds. In your case, you may specify a sufficient large agent timeout (via "adminstration / system setting") so that QB server does not kick the agent out when there is a short outage.
Comment by Phong Trinh [ 18/May/15 07:30 PM ]
There is an issue with setting a large timeout also. The issue is that when the machine/node has a network hiccup (or some other reasons,) QB thinks the machine is still available and assigns a job to that machine. Now the server is trying to the run the job the machine/node, but unable to connect to it. QB then cancels the job, so we loose this job. We think QB should not cancel jobs which are assigned node which goes offline and would like QB to have a way to manage these jobs. Maybe there is an option/configuration to tell QB to cancel these jobs or not. This is issue is critical to us, and we need discuss with you on it.
Comment by Robin Shen [ 18/May/15 11:20 PM ]
Handling such case is very cubersome and error-prone, as the build can get failed in every possible stage. I'd suggest to reduce agent/server load or improve network bandwidth instead of having application to recover from this low level errors.
Generated at Sun May 05 17:55:34 UTC 2024 using JIRA 189.