<< Back to previous view

[QB-949] Build requests hanging in CHECKING_BUILD_CONDITION state because socket readTimeout is not set
Created: 14/Jun/11  Updated: 18/Jun/11

Status: Resolved
Project: QuickBuild
Component/s: None
Affects Version/s: 3.1.45
Fix Version/s: 3.1.48

Type: Bug Priority: Critical
Reporter: Rene Raasuke Assigned To: Unassigned
Resolution: Fixed Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown
Environment: Debian Squeeze x86-64

File Attachments: PNG File blocked_queue.png    

 Description   
We have a situation where build request is hanging in CHECKING_BUILD_CONDITION state while the build with same build version appears to be finished already.

After digging through the heap dump I found that a thread processing that request is stuck at socket read:
"Thread-5248962" daemon prio=10 tid=0x00007f21800cb000 nid=0x1924 runnable [0x00007f217f848000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x0000000708777a90> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x0000000708777ad0> (a sun.net.www.protocol.http.HttpURLConnection)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:166)
at $Proxy100.cacheBuildStatus(Unknown Source)
at com.pmease.quickbuild.DefaultBuildEngine.cacheBuildStatusInGrid(DefaultBuildEngine.java:1275)
at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:286)
at com.pmease.quickbuild.DefaultBuildEngine.access$1(DefaultBuildEngine.java:242)
at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:753)
at java.lang.Thread.run(Thread.java:662)

Checking the parameters for connection QB uses to communicate with grid I found that you have set connectionTimeout but no readTimeout meaning it will just wait forever if a node just disappears in the middle of socket session and does not close the connection properly. We can not assume the network to be 100% stable due to distributed nature of our grid.
What's most annoying is that it's impossible to remove the hanging build from the queue. It just does not go away. The only way to get rid of it is to restart QB which we can not afford to do very often.

 Comments   
Comment by Rene Raasuke [ 14/Jun/11 02:14 PM ]
Attached a screenshot of the situation as well
Comment by Robin Shen [ 15/Jun/11 12:34 AM ]
Previously we've set the socket read timeout but get frequent bug report that the build experiences socket read timeout. Unlike socket connection, socket read may take a very long time due to current remote call mechanism in QB (for example when a build resolves dependency of another build, it will simply wait them for build completeness). So we temporarily removed the socket read timeout in QB3. As we are redesigning the grid system in QB4 to make use of asynchronous remote calls (a remote call will not wait for certain build to finish, instead, the finished build will notify the consumer), read timeout will be feasible as each remote call is expected to finish in a short time.
Comment by Rene Raasuke [ 15/Jun/11 04:44 AM ]
I'm afraid we can't wait for QB4 with this one. We get several incidents per week. We'd rather lose a build request than watch the queue fill up with thousands of builds, bringing the whole grid to halt eventually.
what I suggest you do:
1) set the timeout to a big enough number, it does not have to be 30 seconds. Even 20 minutes would do.
2) and/or make the timeout configurable so we can set it for ourselves only
Comment by Robin Shen [ 15/Jun/11 11:26 PM ]
OK, I understand. Will add this back into QB3 probably with a configurable read timeout.
Generated at Thu May 16 18:01:52 UTC 2024 using JIRA 189.