History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-3126
Type: Bug Bug
Status: Closed Closed
Resolution: Cannot Reproduce
Priority: Blocker Blocker
Assignee: Robin Shen
Reporter: Piotr Jedrys
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Step is cancelled but unknown reason

Created: 24/Jan/18 11:35 AM   Updated: 19/Jan/19 01:09 PM
Component/s: None
Affects Version/s: 7.0.28
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown
Environment: Linux


 Description  « Hide
Again we got the problem that a step is cancelled, but we cannot see who and why cancelled the step. No logs in audit log of the node, no logs in audit log, no logs in server log which points out who cancelled the build.

04:07:05,896 INFO - Terminating launched command gracefully...
04:07:05,909 WARN - *** Init system aborted.
04:07:05,956 WARN - write /dev/stdout: broken pipe
04:07:06,002 INFO - Executing post-execute action...
04:07:06,003 ERROR - Step 'xxx>test - run seq>test - 1 run' is failed.
    java.lang.RuntimeException: java.lang.InterruptedException
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:383)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep.run(CommandBuildStep.java:215)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c.CGLIB$run$12(<generated>)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c$$FastClassByCGLIB$$71371017.invoke(<generated>)
        at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215)
        at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:261)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$9ac4f79c.run(<generated>)
        at com.pmease.quickbuild.stepsupport.Step.doExecute(Step.java:619)
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:521)
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31)
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:45)
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:60)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:107)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.UNIXProcess.waitFor(UNIXProcess.java:264)
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:353)
        ... 17 more

 All   Comments   Work Log   Change History      Sort Order:
Piotr Jedrys [24/Jan/18 11:58 AM]
Finally found it - we get 502 response from QB server.


Piotr Jedrys [24/Jan/18 12:00 PM]
jvm 1 | 2018-01-23 03:07:05,882 WARN - Error connecting task node 'qb:8810', will cancel running job...
jvm 1 | com.caucho.hessian.client.HessianConnectionException: 502: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:165)
jvm 1 | at com.caucho.hessian.client.HessianProxy.sendRequest(HessianProxy.java:300)
jvm 1 | at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:171)
jvm 1 | at com.sun.proxy.$Proxy47.isGridJobActive(Unknown Source)
jvm 1 | at com.pmease.quickbuild.grid.GridJob$1.run(GridJob.java:85)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
jvm 1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
jvm 1 | at java.lang.Thread.run(Thread.java:748)
jvm 1 | Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
jvm 1 | at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
jvm 1 | at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
jvm 1 | at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection$7.run(HttpURLConnection.java:1725)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection$7.run(HttpURLConnection.java:1723)
jvm 1 | at java.security.AccessController.doPrivileged(Native Method)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1721)
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1292)
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:145)
jvm 1 | ... 7 more
jvm 1 | Caused by: java.io.IOException: Server returned HTTP response code: 502 for URL: http://qb/service/node
jvm 1 | at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1676)
jvm 1 | at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
jvm 1 | at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:132)
jvm 1 | ... 7 more


Can you explain why server returned 502. What should we check/fix

Robin Shen [24/Jan/18 11:38 PM]
This status code seems odd. Can you please check QB server log to see if there is any errors relating to this during this time period?

Piotr Jedrys [25/Jan/18 12:23 PM]
Only a warning with this :

2018-01-23 03:07:05,949 [pool-1-thread-14980] WARN com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: f27d88f1-7bfd-42cd-a934-21c6143b9b70, bu
ild id: 10140887, job node: agent:8811)...
2018-01-23 03:07:05,987 [pool-1-thread-15517] WARN com.pmease.quickbuild.grid.GridTaskFuture - Job still exists on job node and cancel command is issued (job id: 2b29eb96-1581-474a-bd6f-4e99e02e3b78, bu
ild id: 10140889, job node: agent:8811)...
2018-01-23 03:07:06,828 [qtp536855282-165183] WARN org.eclipse.jetty.servlet.ServletHandler - /service/server
 com.pmease.quickbuild.AccessDeniedException: Access denied for connection from '172.18.0.4'.
        at com.pmease.quickbuild.security.SecurityHelper.checkAgentToken(SecurityHelper.java:757)
        at com.pmease.quickbuild.grid.GridServlet.service(GridServlet.java:34)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1669)
        at com.pmease.quickbuild.Quickbuild$DisableTraceFilter.doFilter(Quickbuild.java:1076)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:258)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:748)

Robin Shen [25/Jan/18 11:05 PM]
Do you have any form of proxy between QB server and agent?

Piotr Jedrys [26/Jan/18 08:13 AM]
yes, there is an nginx set up as a proxy (no buffering enabled on it though)

Robin Shen [26/Jan/18 11:52 PM]
Please check at Nginx side as it seems that the 405 message is not reported by QB server.

Piotr Jedrys [27/Jan/18 08:41 PM]
No 405 messages in nginx . At that time only these:

10.117.225.47 - - [23/Jan/2018:03:06:33 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"
10.117.225.50 - - [23/Jan/2018:03:06:33 +0000] "POST /service/server HTTP/1.1" 200 73 "-" "Java/1.7.0_121" "-"
10.66.205.31 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_151" "-"
10.66.110.77 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.185.224 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.117.225.50 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.225.242 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.181 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.193.118 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.152.89 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_151" "-"
10.111.8.183 - - [23/Jan/2018:03:07:05 +0000] "GET /wicket/page?38-1.IBehaviorListener.0-&action=ajaxRefresh&random=0.10830004480240407 HTTP/1.1" 502 575 "http://qb/build/10139551/junit_report" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
10.66.127.110 - - [23/Jan/2018:03:07:05 +0000] "POST /service/server HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.173 - - [23/Jan/2018:03:07:05 +0000] "POST /service/node HTTP/1.1" 502 173 "-" "Java/1.7.0_121" "-"
10.66.104.173 - - [23/Jan/2018:03:07:05 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"
10.66.193.128 - - [23/Jan/2018:03:07:05 +0000] "POST /service/connect HTTP/1.1" 200 219 "-" "Java/1.7.0_121" "-"

Robin Shen [27/Jan/18 11:36 PM]
This seems quite odd. Are you able to reproduce this issue?

Piotr Jedrys [28/Jan/18 03:02 PM]
It is a random issue on our production server, (I cannot reproduce it locally), which seems like a problem of overloaded QB. Any suggestions?

Robin Shen [29/Jan/18 12:18 AM]
If this happens when server is overloaded, please reduce server load to see if the situation is better. Several ways to reduce server load:
1. make sure all master step runs on build agents
2. if possible, manage artifacts on a build agent by using "store artifacts on specified agent" on advanced setting of the root configuration

Piotr Jedrys [29/Jan/18 10:31 AM]
We already had both this points done.
All steps run on build agents and we store artifacts on specified agents not on master.

Robin Shen [30/Jan/18 12:07 AM]
How many build agents do you have and how many builds are running at peak time?

Piotr Jedrys [30/Jan/18 10:35 AM]
120 agents, and from what I could see around 50-60 builds (but that is hard to measure, maybe you know exact command how we can check how many builds are running)

Robin Shen [30/Jan/18 11:56 PM]
You may check the queue to see how many of them are running when this issue happens. I can not find any clue why this happens. Can you please telnet against the server port from the problematic agent to see if there are any network connection issues to server?