<< Back to previous view |
[QB-949] Build requests hanging in CHECKING_BUILD_CONDITION state because socket readTimeout is not set
|
|
Status: | Resolved |
Project: | QuickBuild |
Component/s: | None |
Affects Version/s: | 3.1.45 |
Fix Version/s: | 3.1.48 |
Type: | Bug | Priority: | Critical |
Reporter: | Rene Raasuke | Assigned To: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Remaining Estimate: | Unknown | Time Spent: | Unknown |
Original Estimate: | Unknown | ||
Environment: | Debian Squeeze x86-64 |
File Attachments: | blocked_queue.png |
Description |
We have a situation where build request is hanging in CHECKING_BUILD_CONDITION state while the build with same build version appears to be finished already.
After digging through the heap dump I found that a thread processing that request is stuck at socket read: "Thread-5248962" daemon prio=10 tid=0x00007f21800cb000 nid=0x1924 runnable [0x00007f217f848000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) - locked <0x0000000708777a90> (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195) - locked <0x0000000708777ad0> (a sun.net.www.protocol.http.HttpURLConnection) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:166) at $Proxy100.cacheBuildStatus(Unknown Source) at com.pmease.quickbuild.DefaultBuildEngine.cacheBuildStatusInGrid(DefaultBuildEngine.java:1275) at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:286) at com.pmease.quickbuild.DefaultBuildEngine.access$1(DefaultBuildEngine.java:242) at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:753) at java.lang.Thread.run(Thread.java:662) Checking the parameters for connection QB uses to communicate with grid I found that you have set connectionTimeout but no readTimeout meaning it will just wait forever if a node just disappears in the middle of socket session and does not close the connection properly. We can not assume the network to be 100% stable due to distributed nature of our grid. What's most annoying is that it's impossible to remove the hanging build from the queue. It just does not go away. The only way to get rid of it is to restart QB which we can not afford to do very often. |
Comments |
Comment by Rene Raasuke [ 14/Jun/11 02:14 PM ] |
Attached a screenshot of the situation as well |
Comment by Robin Shen [ 15/Jun/11 12:34 AM ] |
Previously we've set the socket read timeout but get frequent bug report that the build experiences socket read timeout. Unlike socket connection, socket read may take a very long time due to current remote call mechanism in QB (for example when a build resolves dependency of another build, it will simply wait them for build completeness). So we temporarily removed the socket read timeout in QB3. As we are redesigning the grid system in QB4 to make use of asynchronous remote calls (a remote call will not wait for certain build to finish, instead, the finished build will notify the consumer), read timeout will be feasible as each remote call is expected to finish in a short time. |
Comment by Rene Raasuke [ 15/Jun/11 04:44 AM ] |
I'm afraid we can't wait for QB4 with this one. We get several incidents per week. We'd rather lose a build request than watch the queue fill up with thousands of builds, bringing the whole grid to halt eventually.
what I suggest you do: 1) set the timeout to a big enough number, it does not have to be 30 seconds. Even 20 minutes would do. 2) and/or make the timeout configurable so we can set it for ourselves only |
Comment by Robin Shen [ 15/Jun/11 11:26 PM ] |
OK, I understand. Will add this back into QB3 probably with a configurable read timeout. |