<< Back to previous view

[QB-4001] Bug GridTaskFuture - Error testing job
Created: 11/Jul/23  Updated: 23/Dec/23

Status: Closed
Project: QuickBuild
Component/s: None
Affects Version/s: 10.0.42
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Nguyen Duc Long Assigned To: Robin Shen
Resolution: Won't Fix Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown


 Description   
I have set the cleaning strategy for the configuration every 4 days.
Agent is faulty for an unknown reason but build does not stop.
After 4 days, Build was automatically deleted by the cleaning strategy. The error agent has also been Unauthorized.


However, I still continuously received an error notice every 1 second.

2023-07-11 12:24:42,506 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:42,506 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:43,507 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:43,507 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:44,509 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:44,510 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:45,510 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:45,511 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...
2023-07-11 12:24:46,514 [pool-1-thread-7974] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 3b3d07b1-a906-4d6e-b129-7a25280e6bc1, build id: 8966996, job node: 21DJGD20:8815), will retry later...
2023-07-11 12:24:46,515 [pool-1-thread-7879] WARN com.pmease.quickbuild.grid.GridTaskFuture - Error testing job (job class: com.pmease.quickbuild.stepsupport.StepExecutionJob, job id: 00fdba52-a26b-4579-9099-5c408ae5fd87, build id: 8967377, job node: 21DJGD20:8814), will retry later...

 Comments   
Comment by Robin Shen [ 11/Jul/23 10:39 PM ]
In case of agent error, you will need to set up a build timeout to time out it.
Comment by Nguyen Duc Long [ 12/Jul/23 06:48 AM ]
I have set the timeout for this configuration to 90 min.
It didn't stop, so i assumed it was a bug.

The main error here is that QB keeps checking when the build has been removed and the agent has been removed
Comment by Robin Shen [ 12/Jul/23 10:58 PM ]
Before build times out, QB will wait for time as indicated by disconnect tolerance of the step. What value is this?
Comment by Robin Shen [ 12/Jul/23 11:00 PM ]
Also QB does not check build record in database periodically while build is running as that can stress the database. So running build will not be aware of deleted build until end of the job when it updates the record in database.
Comment by Luong Chu [ 13/Jul/23 03:17 AM ]
Hello Robin Shen,
Build Timeout: 6hours or 10 hours based on type of build.
Step Disconnect tolerance is 150
But problem here is GridTaskFuture.testJobs(boolean cancel) is called continuously and catch block is called when agent get error:
} catch (Throwable t) {
Date now = new Date();
if (job.getDisconnectToleration() == 0
|| job.getLastDisconnectDate() != null && now.getTime() - job.getLastDisconnectDate().getTime() > job.getDisconnectToleration()*1000L) {
logger.error("Error testing job (job class: {}, job id: {}, build id: {}, job node: {})",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
job.setException(new QuickbuildException("Error testing job.", t));
jobFinished(job, false);
} else {
logger.warn("Error testing job (job class: {}, job id: {}, build id: {}, job node: {}), will retry later...",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
}
}

As we check job.getLastDisconnectDate() seems always return null because setLastDisconnectDate() is not called in any where. Plus Disconnection Tolerance is >0 then warning is kept calling :
logger.warn("Error testing job (job class: {}, job id: {}, build id: {}, job node: {}), will retry later...",
job.getClass().getName(), job.getId(), buildId, node.getAddress());

When the agent is activated again, the job will be cancelled in try block and Error display:
logger.error("Unable to find job (job class: {}, job id: {}, build id: {}, job node: {})",
job.getClass().getName(), job.getId(), buildId, node.getAddress());
jobFinished(job, false);

What is the best way for avoid keep warning this and cancel job correctly?

Comment by Robin Shen [ 14/Jul/23 12:32 AM ]
You are using an old version. The issue relating to disconnect tolerance has been fixed since QB11. I'd suggest to upgrade to latest version periodically.
Comment by Luong Chu [ 14/Jul/23 03:24 AM ]
Is it fix by adding
if (job.getLastDisconnectDate() == null)
    job.setLastDisconnectDate(new Date()); ?
Comment by Robin Shen [ 14/Jul/23 11:08 PM ]
Yes, for this part. I am not sure if there is other parts get fixed. So strongly suggest to upgrade to most recent version.
Generated at Thu May 16 11:39:31 UTC 2024 using JIRA 189.