<< Back to previous view

[QB-3126] Step is cancelled but unknown reason
Created: 24/Jan/18  Updated: 19/Jan/19

Status: Closed
Project: QuickBuild
Component/s: None
Affects Version/s: 7.0.28
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Piotr Jedrys Assigned To: Robin Shen
Resolution: Cannot Reproduce Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown
Environment: Linux


 Description   
Again we got the problem that a step is cancelled, but we cannot see who and why cancelled the step. No logs in audit log of the node, no logs in audit log, no logs in server log which points out who cancelled the build.

04:07:05,896 INFO - Terminating launched command gracefully...
04:07:05,909 WARN - *** Init system aborted.
04:07:05,956 WARN - write /dev/stdout: broken pipe
04:07:06,002 INFO - Executing post-execute action...
04:07:06,003 ERROR - Step 'xxx>test - run seq>test - 1 run' is failed.
    java.lang.RuntimeException: java.lang.InterruptedException
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:383)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep.run(CommandBuildStep.java:215)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c.CGLIB$run$12(<generated>)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c$$FastClassByCGLIB$$71371017.invoke(<generated>)
        at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215)
        at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:261)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c.run(<generated>)
        at com.pmease.quickbuild.stepsupport.Step.doExecute(Step.java:619)
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:521)
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31)
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:45)
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:60)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:107)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.UNIXProcess.waitFor(UNIXProcess.java:264)
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:353)
        ... 17 more

 Comments   
Comment by Piotr Jedrys [ 24/Jan/18 11:58 AM ]
Finally found it - we get 502 response from QB server.

Comment by Piotr Jedrys [ 24/Jan/18 12:00 PM ]
jvm 1 | 2018-01-23 03:07:05,882 WARN - Error connecting task node 'qb:8810', will cancel running job...
jvm 1 | com.caucho.hessian.client.HessianConnectionException: 502: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:165)
jvm 1 | at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:300)
jvm 1 | at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
jvm 1 | at com.sun.proxy.$Proxy47.isGridJobActive(Unknown Source)
jvm 1 | at com.pmease.quickbuild.grid.GridJob$1.run(GridJob.java:85)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
jvm 1 | at java.lang.Thread.run(Thread.java:748)
jvm 1 | Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
jvm 1 | at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
jvm 1 | at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
jvm 1 | at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection$7.run(HttpURLConnection.java:1725)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection$7.run(HttpURLConnection.java:1723)
jvm 1 | at java.security.AccessController.doPrivileged(Native Method)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1721)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1292)
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:145)
jvm 1 | ... 7 more
jvm 1 | Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1676)
jvm 1 | at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:132)
jvm 1 | ... 7 more


Can you explain why server returned 502. What should we check/fix
Comment by Robin Shen [ 24/Jan/18 11:38 PM ]
This status code seems odd. Can you please check QB server log to see if there is any errors relating to this during this time period?
Comment by Piotr Jedrys [ 25/Jan/18 12:23 PM ]
Only a warning with this :

2018-01-23 03:07:05,949 [pool-1-thread-14980] WARN com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: f27d88f1-7bfd-42cd-a934-21c6143b9b70, bu
ild id: 10140887, job node: agent:8811)...
2018-01-23 03:07:05,987 [pool-1-thread-15517] WARN com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 2b29eb96-1581-474a-bd6f-4e99e02e3b78, bu
ild id: 10140889, job node: agent:8811)...
2018-01-23 03:07:06,828 [qtp536855282-165183] WARN org.eclipse.jetty.servlet.ServletHandler - /service/server
 com.pmease.quickbuild.AccessDeniedException: Access denied for connection from '172.18.0.4'.
        at com.pmease.quickbuild.security.SecurityHelper.checkAgentToken(SecurityHelper.java:757)
        at com.pmease.quickbuild.grid.GridServlet.service(GridServlet.java:34)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
        at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1076)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:748)
Comment by Robin Shen [ 25/Jan/18 11:05 PM ]
Do you have any form of proxy between QB server and agent?
Comment by Piotr Jedrys [ 26/Jan/18 08:13 AM ]
yes, there is an nginx set up as a proxy (no buffering enabled on it though)
Comment by Robin Shen [ 26/Jan/18 11:52 PM ]
Please check at Nginx side as it seems that the 405 message is not reported by QB server.
Comment by Piotr Jedrys [ 27/Jan/18 08:41 PM ]
No 405 messages in nginx . At that time only these:

10.117.225.47 - - [23/Jan/2018:03:06:33 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"
10.117.225.50 - - [23/Jan/2018:03:06:33 +0000] "POST /service/server HTTP/1.1" 200 73 "-" "Java/1.7.0_121" "-"
10.66.205.31 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_151" "-"
10.66.110.77 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.185.224 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.117.225.50 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.225.242 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.181 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.193.118 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.152.89 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_151" "-"
10.111.8.183 - - [23/Jan/2018:03:07:05 +0000] "GET /wicket/page?38-1.IBehaviorListener.0-&action=ajaxRefresh&random=0.10830004480240407 HTTP/1.1" 502 575 "http://qb/build/10139551/junit_report" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
10.66.127.110 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.173 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.173 - - [23/Jan/2018:03:07:05 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"
10.66.193.128 - - [23/Jan/2018:03:07:05 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"
Comment by Robin Shen [ 27/Jan/18 11:36 PM ]
This seems quite odd. Are you able to reproduce this issue?
Comment by Piotr Jedrys [ 28/Jan/18 03:02 PM ]
It is a random issue on our production server, (I cannot reproduce it locally), which seems like a problem of overloaded QB. Any suggestions?
Comment by Robin Shen [ 29/Jan/18 12:18 AM ]
If this happens when server is overloaded, please reduce server load to see if the situation is better. Several ways to reduce server load:
1. make sure all master step runs on build agents
2. if possible, manage artifacts on a build agent by using "store artifacts on specified agent" on advanced setting of the root configuration
Comment by Piotr Jedrys [ 29/Jan/18 10:31 AM ]
We already had both this points done.
All steps run on build agents and we store artifacts on specified agents not on master.
Comment by Robin Shen [ 30/Jan/18 12:07 AM ]
How many build agents do you have and how many builds are running at peak time?
Comment by Piotr Jedrys [ 30/Jan/18 10:35 AM ]
120 agents, and from what I could see around 50-60 builds (but that is hard to measure, maybe you know exact command how we can check how many builds are running)
Comment by Robin Shen [ 30/Jan/18 11:56 PM ]
You may check the queue to see how many of them are running when this issue happens. I can not find any clue why this happens. Can you please telnet against the server port from the problematic agent to see if there are any network connection issues to server?
Generated at Sat May 04 23:40:57 UTC 2024 using JIRA 189.