[#QB-1896] Connection timeouts in EC2

QuickBuild

Connection timeouts in EC2

Created: 24/Dec/13 11:33 PM Updated: 11/Jan/14 02:02 AM

Component/s:

None

Affects Version/s:

5.1.3

Fix Version/s:

5.1.6

Original Estimate:

Unknown

Remaining Estimate:

Unknown

Time Spent:

Unknown

Description

« Hide

Publish -> Archive step executed on EC2 instances regularly crashes with connection timed out

Is it possible to make the timeout configurable?

full stacktrace:
https://gist.github.com/anonymous/8058039#file-gistfile1-txt

All

Comments

Work Log

Change History

Sort Order:

[ Permlink | « Hide ]

Lukasz Guminski [06/Jan/14 01:20 PM]

With 5.1.6. I keep receiving the same exception (at EC2 instance which was started manually - not via a cloud profile)
https://gist.github.com/anonymous/8282692#file-gistfile1-txt

Could you please instruct how to configure the timeout value?

[ Permlink | « Hide ]

Robin Shen [07/Jan/14 12:28 AM]

The read timeout has already been set to 1 hour, and if it still result in a read timeout, please examine your artifact publish step to make sure not too many files are being published all together. Also it can take a lot of time if you specify a pattern searching in a large directory.

[ Permlink | « Hide ]

Lukasz Guminski [07/Jan/14 10:30 AM]

We use QB in a corporate environment, and our builds produce relatively large images which need to be transferred back to our company. This might change after ~~QB-1800~~ is resolved (publishing to Amazon S3), but for now we cannot reduce the size of artifacts.

Therefore the best option would be exposing the value as a configuration parameter.

[ Permlink | « Hide ]

Robin Shen [08/Jan/14 12:23 AM]

OK. Will make it configurable in next patch release.

[ Permlink | « Hide ]

Robin Shen [08/Jan/14 04:01 AM]

Checked this issue again. Although we can make this configurable, but setting it to a very high value can cause artifact sending threads not being able to quit timely in case of network connection problems.
Also the socket read timeout does not mean that the file transfer has to be finished within that period of time. It means the maximum time when no data is being sent. So the reason for the timeout should not be caused by the large artifact, instead, it should be caused by the factor that QB is spent a lot of time searching files with the specified patterns in specified directory, before sending any data to server. For instance, if workspace contains many many files, and if you specify pattern as "**/*.zip" will cause QB spending a lot of time searching for zip file recursively in workspace, however, if you can make sure that your zip files sits inside several small directories, you can limit the search scope by specifying pattern as for instance: dir1/*.zip, dir2/*.zip

[ Permlink | « Hide ]

Maikel vd Hurk [08/Jan/14 05:43 AM]

If it is caused due to search time for patterns, I would expect similar behaving in setup without EC2 as well. But without EC2 we don't see this pattern of socket time out.

[ Permlink | « Hide ]

Lukasz Guminski [08/Jan/14 01:32 PM]

I agree with Maikel that it is not about IO operations. Because indeed IO operation on Amazon ephemeral storage are slower, but not that slow. In this case we are talking about publishing a directory containing built images and various build artifacts. In total 2194 files (587 MB).

When I executed _tar_ utlility to create and archive. With compression it took 1 m 20 sek. Without - 5 secs.

And the publish step breaks exactly after 1 hour, after transferring 100-300 MB.

[ Permlink | « Hide ]

Robin Shen [08/Jan/14 11:30 PM]

Can you please send me full log of server and the agent running the artifact publish step (exists in logs directory of server and agent installation) when this error happens? Please also send me the build log of that failed build.

[ Permlink | « Hide ]

Robin Shen [10/Jan/14 12:27 AM]

If possible, please upgrade to 5.1.7, and enable debug logging (conf/log4j.properties) on server and the problematic agent and then collect relevant logs when the timeout occurs.

[ Permlink | « Hide ]

Lukasz Guminski [10/Jan/14 05:19 PM]

Hi Robin, we need to schedule the upgrade, so it is not so easy to get to 5.1.7. I will provide you with logs from 5.1.6, unless you think the extended logging feature is essential in this case.

BTW we have raised several EC2 related requests:

~~QB-1905~~ Passing user tags from EC2 cloud profile to instances
~~QB-1904~~ Build requests disappear after unsuccessful scaling
~~QB-1901~~ Support for Amazon spot instances
~~QB-1899~~ Constraining the size of EC2 cloud

do you think there is a change to have them addressed in 5.1.8 ?

This would make for me easier getting all the approvals for QB downtime related to upgrade. Regards,
Lukasz

[ Permlink | « Hide ]

Lukasz Guminski [10/Jan/14 05:20 PM]

a change -> a chance

[ Permlink | « Hide ]

Maikel vd Hurk [10/Jan/14 05:29 PM]

Please also consider ~~QB-1800~~ Publishing artifacts in Amazon S3, this would also partially fulfil the request of QB-826 Multiple storage areas.

[ Permlink | « Hide ]

Robin Shen [11/Jan/14 02:02 AM]

We need to investigate these features and priority them among other features. For now, we can not give an estimation of the date getting them implemented.