History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-4001
Type: Bug Bug
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Robin Shen
Reporter: Nguyen Duc Long
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Bug GridTaskFuture - Error testing job

Created: 11/Jul/23 04:21 AM   Updated: 23/Dec/23 01:13 AM
Component/s: None
Affects Version/s: 10.0.42
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown


 Description  « Hide
I have set the cleaning strategy for the configuration every 4 days.
Agent is faulty for an unknown reason but build does not stop.
After 4 days, Build was automatically deleted by the cleaning strategy. The error agent has also been Unauthorized.


However, I still continuously received an error notice every 1 second.

2023-07-11 12:24:42,506 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:42,506 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:43,507 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:43,507 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:44,509 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:44,510 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:45,510 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:45,511 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:46,514 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:46,515 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...

 All   Comments   Work Log   Change History      Sort Order:
Robin Shen [11/Jul/23 10:39 PM]
In case of agent error, you will need to set up a build timeout to time out it.

Nguyen Duc Long [12/Jul/23 06:48 AM]
I have set the timeout for this configuration to 90 min.
It didn't stop, so i assumed it was a bug.

The main error here is that QB keeps checking when the build has been removed and the agent has been removed

Robin Shen [12/Jul/23 10:58 PM]
Before build times out, QB will wait for time as indicated by disconnect tolerance of the step. What value is this?

Robin Shen [12/Jul/23 11:00 PM]
Also QB does not check build record in database periodically while build is running as that can stress the database. So running build will not be aware of deleted build until end of the job when it updates the record in database.

Luong Chu [13/Jul/23 03:17 AM]
Hello Robin Shen,
Build Timeout: 6hours or 10 hours based on type of build.
Step Disconnect tolerance is 150
But problem here is GridTaskFuture.testJobs(boolean cancel) is called continuously and catch block is called when agent get error:
} catch (Throwable t) {
Date now = new Date();
if (job.getDisconnectToleration() == 0
|| job.getLastDisconnectDate() != null && now.getTime() - job.getLastDisconnectDate().getTime() > job.getDisconnectToleration()*1000L) {
logger.error("Error testing job (job class: {}, job id: {}, build id: {}, job node: {})",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
job.setException(new QuickbuildException("Error testing job.", t));
jobFinished(job, false);
} else {
logger.warn("Error testing job (job class: {}, job id: {}, build id: {}, job node: {}), will retry later...",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
}
}

As we check job.getLastDisconnectDate() seems always return null because setLastDisconnectDate() is not called in any where. Plus Disconnection Tolerance is >0 then warning is kept calling :
logger.warn("Error testing job (job class: {}, job id: {}, build id: {}, job node: {}), will retry later...",
job.getClass().getName(), job.getId(), buildId, node.getAddress());

When the agent is activated again, the job will be cancelled in try block and Error display:
logger.error("Unable to find job (job class: {}, job id: {}, build id: {}, job node: {})",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
jobFinished(job, false);

What is the best way for avoid keep warning this and cancel job correctly?


Robin Shen [14/Jul/23 12:32 AM]
You are using an old version. The issue relating to disconnect tolerance has been fixed since QB11. I'd suggest to upgrade to latest version periodically.

Luong Chu [14/Jul/23 03:24 AM]
Is it fix by adding
if (job.getLastDisconnectDate() == null)
    job.setLastDisconnectDate(new Date()); ?

Robin Shen [14/Jul/23 11:08 PM]
Yes, for this part. I am not sure if there is other parts get fixed. So strongly suggest to upgrade to most recent version.