History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-2192
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Robin Shen
Reporter: AlSt
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Multiple builds running on connection loss

Created: 25/Sep/14 07:46 AM   Updated: 21/Dec/14 01:08 PM
Component/s: None
Affects Version/s: 5.1.32
Fix Version/s: 6.0.0

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown
File Attachments: 1. Text File QB-fun.log (4 kb)



 Description  « Hide
We have one machine that occasionally looses network connection (for whatever reason).
If that happens during a build the server will kill the running build on the server side and as soon as the agent connects again it will start a new scheduled build on that node even though there is already a build active. The agent seams to know about that still running build as it writes to Log when it finally finishes.

I'll attach a log that shows that behavior nicely as there were broken dependency builds in that cycle and so a lot of short builds were triggered while the one build was still running.
As the build was running here at least the wrapper has to know that there are still processes running that it started. In a situation like that the agent must not start new builds.

 All   Comments   Work Log   Change History      Sort Order:
AlSt [25/Sep/14 08:00 AM]
The log that shows a build started and lots of build until the initial build is stopped

Robin Shen [26/Sep/14 12:03 AM]
Is it possible for you to attach a simple QB database demonstrating this issue and let me know the reproducing steps?

AlSt [29/Sep/14 07:29 AM]
The reproduction is not that easy as we only see it on this one node but the pattern is very clear:
A long running build starts (one resource on that node, no parallel runs)
The server looses connection to the agent and after some timeout just cancels the build on the serverside
The build is still running on the node
The server - agent connection is reestablished and the server triggers fresh builds that want to use that node

BUG: there is still a build running on that node and the wrapper is aware of that as it tries to write back the log when it finally finishes. The server does not accept it and produces an error as it has already canceled the build.
The state of the agent is completely ignored when the server agent connection is reestablished. The server does not ask the agent if there is maybe anything still running, it just assumes that there is nothing and triggers a fresh build.
In the log i have attached you see that with the build ID 1094608. This build runs into this issue and then a reconnect happens and the agent just starts fresh builds as they appear in the queue. If the dependency build for those would not have been broken at that time the builds would just have failed as resources are already in use by the build 1094608 that is still running and stopping at the end of the run.
The agent just has to check if he is running builds before triggering new ones on a reconnect.

Robin Shen [30/Sep/14 12:22 AM]
Thanks. I am clear on this issue now.