|
|
|
1. I cannot remember, I will have to try and reproduce and not clean up the workspace to check
2. Yes, I am sure on this one that they DO exist on the platform node. I checked the code and can not find any obvious problems so far. Please send the newest backup of your current database causing this problem, and I will run load test on your configuration to see if I can get it reproduced.
Thanks. We finally got this reproduced. It is a problem of improper configuration instead of QuickBuild bug. I will explain it using two platforms: linux and windows
1. Assume linux node is selected as master node. 2. At platform compile stage, the linux node and windows node are selected for corresponding platforms, and run platform compile/test task parallelly. 3. The linux node generates test reports into "${vars.getValue("basedir")}/build/tests/linux", and in the same time, the reports from windows node are transferred to linux node and expaned into "${vars.getValue("basedir")}/build/tests/windows". Here a race condition occurs: If the directory "${vars.getValue("basedir")}/build/tests" does not exist, two threads are now trying to create the sub directory "${vars.getValue("basedir")}/build/tests" in the same time, and one thread will be the loser causing files not being copied to its intended directory. QuickBuild tries to avoid this contention at its best and locks the workspace in this case when files are transferred and expanded from external node. However it failed to do so if directories are created through an external command as done by the linux platform step in this scenario. So you will never experience this problem when all platform steps are running on seperate nodes other than the master. This problem can be easily resolved by adding an extra step before the "platforms" step and have it create necessary directory structures to avoid contention. In this simplified case, you will need to create "${vars.getValue("basedir")}/build/tests" in advance. I am closing this bug. Please just reopen it if you experience this issue again following above suggestions. Thanks. I am not very experienced with multi threading and mutexs and race conditions and stuff so I never would have figured this one out on my own. It would be great if you could eventually write some help tips on this kind of thing in the documentation to help avoid these kind of threading contention problems.
Yes, we will complement our documentation to include this
I followed your instructions by having my lgcbuild-generic step (which is run just before the platforms parallel step) to create the necessary directory structure and I still have the same problem. I still get some unit tests not being copied to the node selected as the master node.
I verified the directory structure was created by checking the log of the generic step: 10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - module.generic: 10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - Created dir: /d01/QuickBuild/workspace/root/lgcbuilds/DSInfrastructure/dsinfra_native/DS_5000_4_0_0/dsinfra-native/build/tests 10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - Created dir: /d01/QuickBuild/workspace/root/lgcbuilds/DSInfrastructure/dsinfra_native/DS_5000_4_0_0/dsinfra-native/build/module/install Any chance there is still a bug or is there something else I can do? This seems interesting. There must be other source of the problem besides the one we've discovered. Can you please check below as I've commented at the very beginning of this issue?
1. Does the missing files actually been transferred to the master node? 2. If not, does the missing files actually exist in the platform node? Thanks And is possible that your platform native steps recreated these directories (delete and create) while it is running?
I answered those two questions to my best ability earlier in this issue. Since I cannot reproduce it reliably, I cannot answer with 100% confidence. However, I am pretty sure they are being transferred and that they do exist on the platform node.
For your new question, I am sure those directories do NOT get deleted and recreated. I was hoping that since you were able to reproduce my problem, that you could test that adding the directory structure creation before the platforms step does fix the issue. Maybe you will also see that it still doesn't help in your test environment. I did tested pre-creating directory structure and the problem gone away. Maybe I've not tested long enough, I will leave it running all night to see if the problem still exist.
Also to narrow down the problem, please arrange your JUnit report publish step so that they execute sequentially instead of concurrently to see if the problem still exist. The report publish step execute pretty fast and executing them sequentially do not hurt the performance normally.
Running whole night without any problems. I am now changing file transferring mechanism so that it is writing files sequentially (the previous one is writing parallel with proper locking which I think is also correct) and add proper debugging messages. This will be available in 2.0.2. When this issue happen next time, please check whether the test report files are missing or the test report directory is missing on the master node.
I guess I will wait until 2.0.2 before starting my testing then. Thanks.
I changed my unit test qb-publish step to be sequential instead of parallel and that did NOT make a difference.
One thing that might help is that if I run an individual module, it usually works correctly. However, if I run my big 70 module trigger chain where multiple builds are happening at the same time, it usually happens a lot. I am going to turn off the cleanup step to inspect the master node after it fails. If this holds true, is it possible that there is a workspace overlap between these concurrent running builds, and workspace cleanup of one build affects another?
Nevertheless, in 2.0.2 we will print file names after transferring to check if files are really transferred. No possibility of overlap with all of these builds. They are all completely isolated and have their own svn repository and own sandbox.
I thought more on your statement of if the lgcbuild-test steps delete and recreate the build/tests directory. I checked and they did NOT, but there are statements like: <delete dir="${env.BUILD_DIR}/tests"> <exclude name="lgcTestCases.xml"/> <exclude name="lgcTestCases2.xml"/> </delete> but since the excluded files ARE present, the tests directory should NOT actually be deleted. I also found that I create some extra files in the tests directory instead of in the tests/${os.platform} directory. This would mean each of the platforms have the same named file in the plain tests directory instead of in their platform specific directory. However, this shouldn't make a difference as I am not specifying build/tests/** in the output files, but build/tests/<platform>/**. But just to be safe, I changed all of my usages of the build/tests directory to build/tests/${os.platform} so that everything is in the paltform specific directory that QB2 will output. After that change, everything seems to work now. I was not able to recreate the problem. I will do some more test runs to be sure. Do you have any idea why this might have fixed the problem? It should NOT have made a difference since QB2 was not told to copy those files. This is good news. Does not below statements actually delete the platform specific directory?
<delete dir="${env.BUILD_DIR}/tests"> <exclude name="lgcTestCases.xml"/> <exclude name="lgcTestCases2.xml"/> </delete> For example, when win32 test results are transferred from win32 agent, the directory "${env.BUILD_DIR}/tests/win32" directory will be created to hold win32 test results. If above statements run after this transferring, the win32 test files are lost since they've been deleted. You are right. How stupid of me not to notice this. However, since the master agent is run in parallel with the other platform agents AND since the other platform agents have to wait for input files to be transferred that the master agent does not, I would assume the master agent will almost always run the above delete statement BEFORE any other platform agents transfer their files back. Therefore, I think it is unlikely that this delete statement actually deletes anything transferred. The fact that my last test ran without any problems even with this delete statement in place seems to verify this. But it is still worth fixing just to remove all possibilities of this delete statement from causing problems.
You said previously that you were changing the file transfer mechanism to be sequential. That sounds like it might slow things down and since it appears that you were right that it was my bug and not yours, do you still need to change to sequential?? It might be slower. I will revert the change in 2.0.2 and let's see if this issue is truly resolved.
Thanks |
1. Does the missing files actually been transferred to the master node?
2. If not, does the missing files actually exist in the platform node?
In the meanwhile, we will examine our code carefully to see if there is any racing issues on this regard.