<< Back to previous view

[QB-1896] Connection timeouts in EC2
Created: 24/Dec/13  Updated: 11/Jan/14

Status: Resolved
Project: QuickBuild
Component/s: None
Affects Version/s: 5.1.3
Fix Version/s: 5.1.6

Type: Bug Priority: Major
Reporter: Lukasz Guminski Assigned To: Robin Shen
Resolution: Fixed Votes: 0
Remaining Estimate: Unknown Time Spent: Unknown
Original Estimate: Unknown


 Description   
Publish -> Archive step executed on EC2 instances regularly crashes with connection timed out

Is it possible to make the timeout configurable?

full stacktrace:
https://gist.github.com/anonymous/8058039#file-gistfile1-txt

 Comments   
Comment by Lukasz Guminski [ 06/Jan/14 01:20 PM ]
With 5.1.6. I keep receiving the same exception (at EC2 instance which was started manually - not via a cloud profile)
https://gist.github.com/anonymous/8282692#file-gistfile1-txt

Could you please instruct how to configure the timeout value?
Comment by Robin Shen [ 07/Jan/14 12:28 AM ]
The read timeout has already been set to 1 hour, and if it still result in a read timeout, please examine your artifact publish step to make sure not too many files are being published all together. Also it can take a lot of time if you specify a pattern searching in a large directory.
Comment by Lukasz Guminski [ 07/Jan/14 10:30 AM ]
We use QB in a corporate environment, and our builds produce relatively large images which need to be transferred back to our company. This might change after QB-1800 is resolved (publishing to Amazon S3), but for now we cannot reduce the size of artifacts.

Therefore the best option would be exposing the value as a configuration parameter.
Comment by Robin Shen [ 08/Jan/14 12:23 AM ]
OK. Will make it configurable in next patch release.
Comment by Robin Shen [ 08/Jan/14 04:01 AM ]
Checked this issue again. Although we can make this configurable, but setting it to a very high value can cause artifact sending threads not being able to quit timely in case of network connection problems.
Also the socket read timeout does not mean that the file transfer has to be finished within that period of time. It means the maximum time when no data is being sent. So the reason for the timeout should not be caused by the large artifact, instead, it should be caused by the factor that QB is spent a lot of time searching files with the specified patterns in specified directory, before sending any data to server. For instance, if workspace contains many many files, and if you specify pattern as "**/*.zip" will cause QB spending a lot of time searching for zip file recursively in workspace, however, if you can make sure that your zip files sits inside several small directories, you can limit the search scope by specifying pattern as for instance: dir1/*.zip, dir2/*.zip
Comment by Maikel vd Hurk [ 08/Jan/14 05:43 AM ]
If it is caused due to search time for patterns, I would expect similar behaving in setup without EC2 as well. But without EC2 we don't see this pattern of socket time out.
Comment by Lukasz Guminski [ 08/Jan/14 01:32 PM ]
I agree with Maikel that it is not about IO operations. Because indeed IO operation on Amazon ephemeral storage are slower, but not that slow. In this case we are talking about publishing a directory containing built images and various build artifacts. In total 2194 files (587 MB).

When I executed _tar_ utlility to create and archive. With compression it took 1 m 20 sek. Without - 5 secs.

And the publish step breaks exactly after 1 hour, after transferring 100-300 MB.

Comment by Robin Shen [ 08/Jan/14 11:30 PM ]
Can you please send me full log of server and the agent running the artifact publish step (exists in logs directory of server and agent installation) when this error happens? Please also send me the build log of that failed build.
Comment by Robin Shen [ 10/Jan/14 12:27 AM ]
If possible, please upgrade to 5.1.7, and enable debug logging (conf/log4j.properties) on server and the problematic agent and then collect relevant logs when the timeout occurs.
Comment by Lukasz Guminski [ 10/Jan/14 05:19 PM ]
Hi Robin, we need to schedule the upgrade, so it is not so easy to get to 5.1.7. I will provide you with logs from 5.1.6, unless you think the extended logging feature is essential in this case.

BTW we have raised several EC2 related requests:

QB-1905 Passing user tags from EC2 cloud profile to instances
QB-1904 Build requests disappear after unsuccessful scaling
QB-1901 Support for Amazon spot instances
QB-1899 Constraining the size of EC2 cloud

do you think there is a change to have them addressed in 5.1.8 ?

This would make for me easier getting all the approvals for QB downtime related to upgrade. Regards,
Lukasz
Comment by Lukasz Guminski [ 10/Jan/14 05:20 PM ]
a change -> a chance
Comment by Maikel vd Hurk [ 10/Jan/14 05:29 PM ]
Please also consider QB-1800 Publishing artifacts in Amazon S3, this would also partially fulfil the request of QB-826 Multiple storage areas.
Comment by Robin Shen [ 11/Jan/14 02:02 AM ]
We need to investigate these features and priority them among other features. For now, we can not give an estimation of the date getting them implemented.
Generated at Tue May 21 07:19:22 UTC 2024 using JIRA 189.