History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-2640
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Robin Shen
Reporter: AlSt
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

QB Checkout on aix is extremely slow

Created: 19/Feb/16 03:18 PM   Updated: 10/Mar/16 11:15 AM
Component/s: None
Affects Version/s: 6.0.35
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown
File Attachments: 1. Zip Archive JGet.zip (3 kb)

Image Attachments:

1. networkperformance.png
(16 kb)
Environment: AIX 7.1, different IBM JDKS. 1.7 and 1.8 tested.


 Description  « Hide
There seems to be an issue with qb checkouts. Artifacts are stored on QB Server. Artifacts size is about 500 MB.

Our test results (we ran the tests multiple times and :
* wget to the zip download of all artifacts of a build: about 15 seconds
* download via a self written testing java application which just operates like wget: about 20 seconds (also had some different results while increasing and decreasing buffer sizes, but nothing was as slow as QB checkout)
* checkout on agent running on AIX: 45 minutes
* checkout on agent with TAR_TO_DISK system property set to true: 15 minutes

I also attach the test java class I've made: usage:
java -jar jget.jar <url-to-artifact-download> <target-filename> <chunksize-in-byte> <base64auth>

base64auth = encoded "username:password"
Had to do this to be able to have java 6 compatibility without using third party libs.

 All   Comments   Work Log   Change History      Sort Order:
AlSt [19/Feb/16 03:25 PM]
Oh... I missed the last parameter after base64auth there is the additional parameter <useBuffered> which has to be true or false and indicates that BufferedInputStream and BufferedOutputStream should be used.

Also CPU usage and Memory usage is ok during download.

Robin Shen [20/Feb/16 01:09 AM]
How are the "source path" and "file patterns" specified? Sometimes, it can be very slow scanning files when a directory containing many files are specified.

AlSt [22/Feb/16 09:10 AM]
Source path: artifacts
Patterns: **

The thing is that the same checkout from any other machine (except aix ones) is pretty fast (~ 20 seconds). And the file matching is done on the server if I'm not mistaken and this should always be the same.

Robin Shen [23/Feb/16 12:16 AM]
Pattern will be match on the node storing the artifacts. Can you please write a script running on the checkout node to transfer a single file from AIX to see how it performs?
groovy:
grid.transferFile(aixNode, "/path/to/a/large/file", node, "/path/to/local/file", null, null);

AlSt [23/Feb/16 12:38 PM]
I tried two scenarios. Filesize always about 500 MB (zip file).

Running build on server with script:
def aixNode = grid.getNode('aixnode:8811')
grid.transferFile(aixNode, "/opt/test.zip", node, "C:/test.zip", null, null)
grid.transferFile(node, "C:/test.zip", aixNode, "/tmp/test2.zip", null, null)

and running build on aix with script
grid.transferFile(node, "/opt/test.zip", grid.serverNode, "C:/test.zip", null, null)
grid.transferFile(grid.serverNode, "C:/test.zip", node, "/tmp/test2.zip", null, null)

Both ways the entire script took about 20 seconds.

Robin Shen [24/Feb/16 12:24 AM]
Thanks for doing the test. Looks like file transferring itself is not the problem. Can you please modify your script a bit to make it scan/tar and transfer the slow downloading artifacts on AIX?

def aixNode = grid.getNode('aixnode:8811')
grid.transferFiles(aixNode, "/absolute/path/to/artifacts", "**", node, "C:/temp", false, null, null);

If it is slow, then limit "/absolute/path/to/artifacts" to a small folder containing only several large files to see if it performs better.

AlSt [01/Mar/16 09:04 AM]
I did some more testing. What I did:

1. Use grid.transferFiles to transfer the artifacts from server to aixnode like they are published at the dependency build.
2. Use grid.transferFiles to transfer the same artifacts but prepackaged them into 5 tar files with approx. the same size (also adding up to about 500 MB)

Both times the transfer took around 45 minutes

Second test:
1. Tar all files into one.
2. Use the following script:
groovy:
logger.info('transferFile')
grid.transferFile(grid.serverNode, 'C:/aixfull/aixtest.tar', node, '/tmp/singlefile', null, null);
logger.info('transferFiles')
grid.transferFiles(grid.serverNode, 'C:/aixfull', 'aixtest.tar', node, '/tmp/multifile', false, null, null);

Transferring the file via transferFile takes about 10 seconds.
Transferring the file via transferFiles takes again around 45 minutes.

So the file scanning seems not to be the problem.

You can also see the transfer rate in the attached screenshot (networkperformance.png). The red marked range is the single file transfer. After that the transferFiles method got started. As you can see the rate is about 150 MBit/s on transferFile and about 2 MBit/s on transferFiles.

I could also do the tests the other way round. Tranferring from the aix node to server if needed. I just did the tests for now in the direction server -> aix.

Robin Shen [02/Mar/16 12:30 AM]
When transferFiles is called, QB has to tar all matched files (even if only one file matches), and thanks for your detailed test. Looks like the tar utility used by QB is causing the issue. Please run below test on server to verify it is indeed the case:

groovy:
OutputStream os = new FileOutputStream("/temp/multifile");
com.pmease.quickbuild.util.FileUtils.tar(new File("C:/aixfull"), "aixtest.tar", false, os);
os.close();

If yes, please turn off any anti-virus software or change the "/temp/multifile" to a different driver to see if it can speed up.

AlSt [03/Mar/16 11:29 AM]
I ran the script on the aix node and it took about 20 seconds.
I also suspected that maybe it is not the tar. So I tested it with tar and untar with this script:

groovy:
def inStream = new java.io.PipedInputStream()
def outStream = new java.io.PipedOutputStream(inStream)

def t1 = Thread.start {
    com.pmease.quickbuild.util.FileUtils.tar(new File("/tmp/multifile"), "aixtest.tar", false, outStream )
    outStream.close()
}

def t2 = Thread.start {
    com.pmease.quickbuild.util.FileUtils.untar(inStream , false, new File("/tmp/destdir"))
     inStream.close()
}

t1.join()
t2.join()

also about 20 seconds. So the problems seems to be multi file transfer over the network with this tar stuff. I will test the tar functionality on the quickbuild server to see how it behaves there.

AlSt [03/Mar/16 11:32 AM]
Ok. I also tested the same script on quickbuild server running windows.... Took about 25 seconds.

AlSt [03/Mar/16 12:02 PM]
Now I also tested the slow checkout again... around 40 minutes for the checkout. But I had the idea testing with the server option in the quickbuild repository:

so i checked the checkbox for "Server" and added the same qb server used as before just to force QB using the code for "remote server". Checkout took about 5 minutes.

Robin Shen [04/Mar/16 01:36 AM]
This seems odd. For below script, can you please turn on option TAR_TO_DISK (define as system property via JVM -D in conf/wrapper.conf) on both server node and the transfer target node, and test below script to see how it performs?

grid.transferFiles(grid.serverNode, 'C:/aixfull', 'aixtest.tar', node, '/tmp/multifile', false, null, null);

AlSt [04/Mar/16 08:08 AM]
Does this really have any effect on the server when I set this system property? I took a look at the CopyFilesJob and the option seems only be used on receiver side. And as I already said in the issue description:
* checkout on agent with TAR_TO_DISK system property set to true: 15 minutes
So this option speeds up the transfer about 75%.

I can do the test you've suggested. But setting the system property on server side is a bit difficult. I would have to start a separate server for that because restarting the server is not that easy. There are currently 130 builds running. So basically by main question here is if the setting is really needed on server side.

Robin Shen [05/Mar/16 12:25 AM]
I overlooked that message. Using this option on CopyFilesJob side is enough. Please keep this option on on file receiving side and edit the file "conf/log4j.properties" both at file receiving and sending side to add below log statement:
log4j.logger.com.pmease.quickbuild.grid=DEBUG

And then wait a minute to have this file taking effect. Then run the transferFiles script. QB will log file transferring timing in system log of both file receiving and sending side, something like below:

at file receiving side:

Downloading files ...
Finished receiving files...
Finished extracting files...

at file sending side:

Serving files...
Finished archiving files...
Finished sending files...

Please post exact log statements of these files so that I can check the timing

AlSt [07/Mar/16 12:34 PM]
Ok. I tried with a separate server instance to be able to track everything better because with 100+ agent connected the debug log spams a lot. So I tried the copy job from a newly set up server on a windows server 2012 R2 machine.
Copy task (without TAR_TO_DISK) suddenly took 10 seconds. Switched the node back to the production qb server -> again 45 minutes. This drove me nuts. So I tried the copy task from the production qb server to a ubuntu12 node -> 12 seconds. Then I started modifying the groovy script like this:

groovy:
def source = grid.getNode('<node>:8811')
grid.transferFiles(source, '/tmp/multifile', 'aixtest.tar', node, '/tmp/multifile', false, null, null);

I tried the script with different source nodes:
Node running Windows Server 2003 -> ~45 minutes
Server running Windows Server 2008 -> ~45 minutes
Node running Windows Server 2012 R2 -> ~10 seconds
Node running ubuntu 12 -> ~10 seconds

We also tested setting the delayed ACK in aix to disabled and suddenly it only takes about 2 minutes from the server running Win Server 2008. So this seems to be a weird communication issue between windows and aix. Also setting the delayed ACK to disabled is not recommended because it produces heavy network load. Maybe you know more or have some suggestions. The interesting thing is that only the multi file transfer using the TAR functionality is that slow.

Robin Shen [08/Mar/16 02:09 AM]
Can you please upgrade to below build of your prod server?
https://build.pmease.com/build/2983/overview

I simplied the multi-file transferring to use exactly the same mechanism as single file transferring (except for additional tar/untar), previous there is some logic to send bits to keep alive while tar/untar is ongoing.

AlSt [10/Mar/16 11:15 AM]
We'll migrate the server next week to linux and also upgrade to latest QB version. As this seems to be only a Windows problem with versions lower/equal to 2008 it should not happen anymore.

Thanks for your help.