[#QB-4120] QuickBuild node went down, but it was showed as available on the server

QuickBuild

QuickBuild node went down, but it was showed as available on the server

Created: 18/Sep/24 01:06 AM Updated: 22/Nov/24 03:06 AM

Component/s:

None

Affects Version/s:

12.0.29

Fix Version/s:

None

Original Estimate:

Unknown

Remaining Estimate:

Unknown

Time Spent:

Unknown

Description

« Hide

I have a QuickBuild configuration that allows to run many test cases based on the inputs, and one test case runs on one node at a time. The node/system is reverted after each test. I have a trigger to send a large number of test cases to the configuration, and they are waiting until the nodes to be available. I am having two issues as follows:
- QuickBuild server drops all of my test when all of the nodes in the resource disappeared.
- The nodes/test systems are down (while reverting) but they are still showed as active on QB server. I am thinking of increasing heartbeat frequency at nodes' level. Is it possible?
Any advice appreciated.
ptrinh

All

Comments

Work Log

Change History

Sort Order:

[ Permlink | « Hide ]

Robin Shen [18/Sep/24 01:42 AM]

Agent timeout can be configured in system setting. However, instead of changing agent timeout, please make sure to stop agent service gracefully before reverting the system so that agent has the chance to notify server of its down.

[ Permlink | « Hide ]

Phong Trinh [18/Sep/24 02:34 AM]

I am concerned that if I stop the service before reverting the system , there is chance that qb server drops the node before the reversion happens.

[ Permlink | « Hide ]

Robin Shen [18/Sep/24 03:45 AM]

You may run a command step to stop service, for instance:

# sleep a while to allow the build to finish
sleep 10
service buildagent stop

Make sure to untick the option "wait for finish" in advanced setting of the step.

[ Permlink | « Hide ]

Phong Trinh [18/Sep/24 09:20 PM]

Thank you very much for your suggestion. I am going to give it a try and will keep you informed.
Regarding my first issue: "QuickBuild server drops all of my test when all of the nodes in the resource disappeared." How do I avoid the server dropping the tests?
Some nodes have very heavy load and need more time to response to the server. Can I override the agent timeout at their level?

[ Permlink | « Hide ]

Robin Shen [19/Sep/24 01:51 AM]

What do you mean by "when all of the nodes in the resource disappeared"? Is this because all your nodes are running heavy load and QB server thinks they are dead? If so, what do you mean by "QB drops all tests"?

[ Permlink | « Hide ]

Phong Trinh [24/Sep/24 02:44 AM]

I have a QB configuration for my tests, such as root/RegressionTest, which is set up to allow concurrent test execution. This configuration runs tests on nodes within the resource 'Eligible for Regression Tests,' which contains 5 nodes. After each test completes, even if it fails, the node/system is reverted. When I run 15 tests, 5 are executed on the available nodes, while the remaining tests wait for nodes to free up. If all 5 tests finish simultaneously, the nodes are reverted and go offline, effectively disappearing from the 'Eligible for Regression Tests' resource. When this occurs, QB drops the remaining tests.

[ Permlink | « Hide ]

Robin Shen [24/Sep/24 07:21 AM]

How are you triggering build of configuration "root/RegressionTest". Is it manually, via restful api, or via trigger build step from another configuration? How are you distinguish different tests when triggering the configuration?

Also have you tried to gracefully shutdown the agents? Taking down the agent forcily can cause job loss, due to incorrect agent state.

[ Permlink | « Hide ]

Phong Trinh [24/Sep/24 08:57 PM]

I created step in another configuration, and this step triggers root/RegressionTest is similar to the follow:

import com.pmease.quickbuild.*;
productVersions = "1.0.0,1.1.0,1.2.0,1.3.0,1.4.0,2.0.0,2.1.0" // Up to 15 versions
String[] arrayProductVersions = productVersions.split(",");
def configurationIdToTrigger = system.configurationManager.get("root/RegressionTest").id;
def productVersion
for (int nLoop = 0; nLoop < arrayProductVersions.length; nLoop++) {
   def newRequest = new BuildRequest();
   productVersion = arrayProductVersions[nLoop]
   newRequest.configurationId = configurationIdToTrigger;
   newRequest.variables = ["version":productVersion];
   system.buildEngine.requestBuild(Context.getUser(), false, newRequest);
}

I stop the QB agent service as your suggestion:
# sleep a while to allow the build to finish
sleep 10
service buildagent stop

Make sure to untick the option "wait for finish" in advanced setting of the step.

[ Permlink | « Hide ]

Robin Shen [25/Sep/24 01:40 AM]

Thanks for the elaboration. This happens as QB server sends job immediately to the build agent after current job finishes and before build agent is signaled to shutdown. To work around this issue:

1. Edit user attributes of each of your build agent to add an attribute say "ready", with initial value set to "true"
2. For the grid resoure you are using, change its node selection setting to only select build agents with attribute "ready" equals "true"
3. Edit pre-execute action of master step (or the step using above resource) of configuration "root/RegressionTest" to execute below script:
groovy:

def userAttributes = node.userAttributes;
userAttributes["ready"] = "false";
node.setUserAttributes(userAttributes, true);
4. After reverting the build agent, change property "ready" to "true" in file "<build agent dir>/conf/attributes.properties". This has to be done before starting agent service.

[ Permlink | « Hide ]

Phong Trinh [25/Sep/24 02:12 AM]

Thank you very much for your suggestion, Robin!
  I think I did very similar to the suggestion. I created a step to set the user attribute at the beginning of the process (first step in the process) as follow:
  def agentName = "$var.getValue('RunningNode')}"; // I got the node that the test is running on and assigned to that variable.
  def agent = grid.getNode(agentName);
  userAttributes = agent.userAttributes;
  userAttributes.put("CleanMachine", "0");
  agent.setUserAttributes(userAttributes,true);

For the grid resource I am using, it sets to only select build agents with attribute "CleanMachine" equals "1"

I reset the value of the attribute to 1 after the reversion. However, i still get the same issue.

[ Permlink | « Hide ]

Robin Shen [25/Sep/24 02:36 AM]

Is each of the agent running a single test at the same time? If so, the approach should work. You may check user attribute of the agent running the test to see if its value is changed.

[ Permlink | « Hide ]

Phong Trinh [25/Sep/24 03:03 AM]

Yes, that is correct. Each of the agents is running a single test at the same time.

[ Permlink | « Hide ]

Robin Shen [25/Sep/24 05:50 AM]

So I guess master step of configuration "root/RegressionTest" is set to use the resource. If so, please use the logic I suggested in pre-execute action of the master step to see if it works.

[ Permlink | « Hide ]

Phong Trinh [26/Sep/24 12:31 AM]

When one of the tests is running on a node, none of the other tests can jump on this node. I think setting the user attribute at pre-execution or in the first step are the same. However, I will try it and keep you informed.
Thank you for your help!

[ Permlink | « Hide ]

Phong Trinh [30/Sep/24 12:26 AM]

- I updated the reversion process to reset the flag/user attribute CleanMachine=1. It works for me. Thank you very much, Robin!
- Regarding the issue of dropping tests, I configured to keep at least one available node in the resource group. Hopefully, it can be configured that allows QB to drop builds/tests if there is no available in nodes in the group after a period of time.

[ Permlink | « Hide ]

Robin Shen [30/Sep/24 11:33 AM]

Even if no nodes are available, QB will not drop tests at my side with this approach. If you can reproduce this issue with a sample database, please send me the database backup for investigation.

[ Permlink | « Hide ]

Phong Trinh [22/Nov/24 03:03 AM]

Thank you for your help, Robin. Please close this request. If this happens again, I create a new ticket.