History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-949
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Unassigned
Reporter: Rene Raasuke
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Build requests hanging in CHECKING_BUILD_CONDITION state because socket readTimeout is not set

Created: 14/Jun/11 02:12 PM   Updated: 18/Jun/11 12:13 PM
Component/s: None
Affects Version/s: 3.1.45
Fix Version/s: 3.1.48

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown
File Attachments: None
Image Attachments:

1. blocked_queue.png
(52 kb)
Environment: Debian Squeeze x86-64


 Description  « Hide
We have a situation where build request is hanging in CHECKING_BUILD_CONDITION state while the build with same build version appears to be finished already.

After digging through the heap dump I found that a thread processing that request is stuck at socket read:
"Thread-5248962" daemon prio=10 tid=0x00007f21800cb000 nid=0x1924 runnable [0x00007f217f848000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x0000000708777a90> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x0000000708777ad0> (a sun.net.www.protocol.http.HttpURLConnection)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:166)
at $Proxy100.cacheBuildStatus(Unknown Source)
at com.pmease.quickbuild.DefaultBuildEngine.cacheBuildStatusInGrid(DefaultBuildEngine.java:1275)
at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:286)
at com.pmease.quickbuild.DefaultBuildEngine.access$1(DefaultBuildEngine.java:242)
at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:753)
at java.lang.Thread.run(Thread.java:662)

Checking the parameters for connection QB uses to communicate with grid I found that you have set connectionTimeout but no readTimeout meaning it will just wait forever if a node just disappears in the middle of socket session and does not close the connection properly. We can not assume the network to be 100% stable due to distributed nature of our grid.
What's most annoying is that it's impossible to remove the hanging build from the queue. It just does not go away. The only way to get rid of it is to restart QB which we can not afford to do very often.

 All   Comments   Work Log   Change History      Sort Order:
Rene Raasuke [14/Jun/11 02:14 PM]
Attached a screenshot of the situation as well

Robin Shen [15/Jun/11 12:34 AM]
Previously we've set the socket read timeout but get frequent bug report that the build experiences socket read timeout. Unlike socket connection, socket read may take a very long time due to current remote call mechanism in QB (for example when a build resolves dependency of another build, it will simply wait them for build completeness). So we temporarily removed the socket read timeout in QB3. As we are redesigning the grid system in QB4 to make use of asynchronous remote calls (a remote call will not wait for certain build to finish, instead, the finished build will notify the consumer), read timeout will be feasible as each remote call is expected to finish in a short time.

Rene Raasuke [15/Jun/11 04:44 AM]
I'm afraid we can't wait for QB4 with this one. We get several incidents per week. We'd rather lose a build request than watch the queue fill up with thousands of builds, bringing the whole grid to halt eventually.
what I suggest you do:
1) set the timeout to a big enough number, it does not have to be 30 seconds. Even 20 minutes would do.
2) and/or make the timeout configurable so we can set it for ourselves only

Robin Shen [15/Jun/11 11:26 PM]
OK, I understand. Will add this back into QB3 probably with a configurable read timeout.