History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-2440
Type: Bug Bug
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Robin Shen
Reporter: Phong Trinh
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
QuickBuild

When an agent goes offline, QB cancels all of jobs which are assigned to to the agent

Created: 14/May/15 08:13 PM   Updated: 18/May/15 11:20 PM
Component/s: None
Affects Version/s: 6.0.10
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown


 Description  « Hide
 When a node/agent goes offline, all of the jobs which are assigned to that agent are automatically cancelled by QuickBuild. It causes loosing the jobs which are scheduled for nightly builds. Sometimes agent is under heavy load and doesn't respond to the server for a short period of time. The QB thinks the agent is offline and cancels all of the jobs which are assigned to it. In fact, the agent is back to online in several minutes or so. I think the jobs are supposed to be waiting for the agent until it is back online and then resume the operations.
 This bug causes a serious issue for us here, since we can not afford loosing nightly builds.

 Thank you in advance,
ptrinh

 All   Comments   Work Log   Change History      Sort Order:
Robin Shen [14/May/15 11:25 PM]
QB has to cancel jobs when it detects agent offline as otherwise the queue may be filled to block other builds. In your case, you may specify a sufficient large agent timeout (via "adminstration / system setting") so that QB server does not kick the agent out when there is a short outage.

Phong Trinh [18/May/15 07:30 PM]
There is an issue with setting a large timeout also. The issue is that when the machine/node has a network hiccup (or some other reasons,) QB thinks the machine is still available and assigns a job to that machine. Now the server is trying to the run the job the machine/node, but unable to connect to it. QB then cancels the job, so we loose this job. We think QB should not cancel jobs which are assigned node which goes offline and would like QB to have a way to manage these jobs. Maybe there is an option/configuration to tell QB to cancel these jobs or not. This is issue is critical to us, and we need discuss with you on it.

Robin Shen [18/May/15 11:20 PM]
Handling such case is very cubersome and error-prone, as the build can get failed in every possible stage. I'd suggest to reduce agent/server load or improve network bandwidth instead of having application to recover from this low level errors.