History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-438
Type: Bug Bug
Status: Resolved Resolved
Resolution: Won't Fix
Priority: Major Major
Assignee: Robin Shen
Reporter: Shawn Castrianni
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Junit Report publishing has bug such that it thinks there is nothing to publish

Created: 24/Sep/09 09:02 AM   Updated: 03/Oct/09 07:19 AM
Component/s: None
Affects Version/s: None
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown


 Description  « Hide
This is the same previously report Junit Report publishing bug that was first reported in the user forum. This is NOT the same as the bug most recently fixed which caused "WARN - No report generators defined" to show up in the log. This is the original bug where everything seemed to work except on the publishing part, it couldn't find any junit xml files to publish causing, "WARN - There is no input file need publish for 'JUnit Report', [Dir: /d01/QuickBuild/workspace/root/lgcbuilds/DSInfrastructure/dsinfra_native/DS_5000_4_0_0, Pattern: dsinfra-native/build/tests/linux64/xml/*.xml]" to show up in the log.

If you remember, this happens when I try to be very advanced in my node management and have the controller of the entire build (the node that runs the master step) also be used as one of my platform child nodes. This is all described in my original user forum post. So if I send out unit tests to be run on 5 platforms where the master node is used as one of those platforms, this bug occurs some of the time. It is not 100% reproducible. It seems to be some sort of timing issue or workspace lock issue or something. I have 5 parallel steps all executing unit tests on 5 different platform nodes such that one of those platform nodes is the master node. Therefore, if they all finish at the same time causing all to send their output files back to the master node, one of those nodes is sending output files to itself while the other 4 actually have to send the files. Maybe the bug is caused when a different node is sending output files back at the same time the master node is sending files back to itself?

I have checked the logs and I can see NO error during the sending output files step. I do see that the master node sending files to itself is very quick since it shouldn't have to do anything. I do see that the different nodes sending files to the master node takes a few seconds.

My workaround was to NOT reuse the master node as a platform unit test node, but that causes 1 extra node to be used unnecessarily. I was hoping that you fixing the "WARN - No report generators defined" would have fixed this problem too, but apparently not. I really could use a fix for this as it is now in my way from doing other things.

By the way, this was tested against 2.0B9, not 2.0.0.

 All   Comments   Work Log   Change History      Sort Order:
Robin Shen [24/Sep/09 09:25 AM]
Yes, we've not resolved this issue yet. I would like to ask below questions:
1. Does the missing files actually been transferred to the master node?
2. If not, does the missing files actually exist in the platform node?

In the meanwhile, we will examine our code carefully to see if there is any racing issues on this regard.

Shawn Castrianni [24/Sep/09 03:24 PM]
1. I cannot remember, I will have to try and reproduce and not clean up the workspace to check
2. Yes, I am sure on this one that they DO exist on the platform node.

Robin Shen [25/Sep/09 12:34 AM]
I checked the code and can not find any obvious problems so far. Please send the newest backup of your current database causing this problem, and I will run load test on your configuration to see if I can get it reproduced.

Thanks.

Robin Shen [26/Sep/09 01:07 PM]
We finally got this reproduced. It is a problem of improper configuration instead of QuickBuild bug. I will explain it using two platforms: linux and windows

1. Assume linux node is selected as master node.
2. At platform compile stage, the linux node and windows node are selected for corresponding platforms, and run platform compile/test task parallelly.
3. The linux node generates test reports into "${vars.getValue("basedir")}/build/tests/linux", and in the same time, the reports from windows node are transferred to linux node and expaned into "${vars.getValue("basedir")}/build/tests/windows". Here a race condition occurs: If the directory "${vars.getValue("basedir")}/build/tests" does not exist, two threads are now trying to create the sub directory "${vars.getValue("basedir")}/build/tests" in the same time, and one thread will be the loser causing files not being copied to its intended directory.

QuickBuild tries to avoid this contention at its best and locks the workspace in this case when files are transferred and expanded from external node. However it failed to do so if directories are created through an external command as done by the linux platform step in this scenario. So you will never experience this problem when all platform steps are running on seperate nodes other than the master.

This problem can be easily resolved by adding an extra step before the "platforms" step and have it create necessary directory structures to avoid contention. In this simplified case, you will need to create "${vars.getValue("basedir")}/build/tests" in advance.

I am closing this bug. Please just reopen it if you experience this issue again following above suggestions.

Change by Robin Shen [26/Sep/09 01:07 PM]
Field Original Value New Value
Status Open [ 1 ] Resolved [ 5 ]
Resolution Won't Fix [ 2 ]

Shawn Castrianni [26/Sep/09 03:27 PM]
Thanks. I am not very experienced with multi threading and mutexs and race conditions and stuff so I never would have figured this one out on my own. It would be great if you could eventually write some help tips on this kind of thing in the documentation to help avoid these kind of threading contention problems.

Robin Shen [26/Sep/09 11:06 PM]
Yes, we will complement our documentation to include this

Shawn Castrianni [29/Sep/09 03:38 PM]
I followed your instructions by having my lgcbuild-generic step (which is run just before the platforms parallel step) to create the necessary directory structure and I still have the same problem. I still get some unit tests not being copied to the node selected as the master node.

I verified the directory structure was created by checking the log of the generic step:

10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - module.generic:
10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - Created dir: /d01/QuickBuild/workspace/root/lgcbuilds/DSInfrastructure/dsinfra_native/DS_5000_4_0_0/dsinfra-native/build/tests
10:03:10,1 [lgcbuild-generic@cmlinux64build2:8811] INFO - Created dir: /d01/QuickBuild/workspace/root/lgcbuilds/DSInfrastructure/dsinfra_native/DS_5000_4_0_0/dsinfra-native/build/module/install



Any chance there is still a bug or is there something else I can do?

Robin Shen [29/Sep/09 10:35 PM]
This seems interesting. There must be other source of the problem besides the one we've discovered. Can you please check below as I've commented at the very beginning of this issue?

1. Does the missing files actually been transferred to the master node?
2. If not, does the missing files actually exist in the platform node?

Thanks

Robin Shen [29/Sep/09 10:44 PM]
And is possible that your platform native steps recreated these directories (delete and create) while it is running?

Shawn Castrianni [30/Sep/09 05:30 AM]
I answered those two questions to my best ability earlier in this issue. Since I cannot reproduce it reliably, I cannot answer with 100% confidence. However, I am pretty sure they are being transferred and that they do exist on the platform node.

For your new question, I am sure those directories do NOT get deleted and recreated.


I was hoping that since you were able to reproduce my problem, that you could test that adding the directory structure creation before the platforms step does fix the issue. Maybe you will also see that it still doesn't help in your test environment.

Robin Shen [30/Sep/09 06:04 AM]
I did tested pre-creating directory structure and the problem gone away. Maybe I've not tested long enough, I will leave it running all night to see if the problem still exist.

Robin Shen [30/Sep/09 06:31 AM]
Also to narrow down the problem, please arrange your JUnit report publish step so that they execute sequentially instead of concurrently to see if the problem still exist. The report publish step execute pretty fast and executing them sequentially do not hurt the performance normally.

Shawn Castrianni [30/Sep/09 08:43 PM]
I will try that at 3AM Houston time.

Robin Shen [30/Sep/09 11:26 PM]
Running whole night without any problems. I am now changing file transferring mechanism so that it is writing files sequentially (the previous one is writing parallel with proper locking which I think is also correct) and add proper debugging messages. This will be available in 2.0.2. When this issue happen next time, please check whether the test report files are missing or the test report directory is missing on the master node.

Shawn Castrianni [01/Oct/09 01:45 AM]
I guess I will wait until 2.0.2 before starting my testing then. Thanks.

Shawn Castrianni [01/Oct/09 07:53 PM]
Do you have an ETA for 2.0.2 release?

Robin Shen [01/Oct/09 10:59 PM]
It will be released in just one or two days.

Shawn Castrianni [02/Oct/09 08:15 AM]
I changed my unit test qb-publish step to be sequential instead of parallel and that did NOT make a difference.

One thing that might help is that if I run an individual module, it usually works correctly. However, if I run my big 70 module trigger chain where multiple builds are happening at the same time, it usually happens a lot. I am going to turn off the cleanup step to inspect the master node after it fails.

Robin Shen [02/Oct/09 11:58 AM]
If this holds true, is it possible that there is a workspace overlap between these concurrent running builds, and workspace cleanup of one build affects another?
Nevertheless, in 2.0.2 we will print file names after transferring to check if files are really transferred.

Shawn Castrianni [02/Oct/09 04:42 PM]
No possibility of overlap with all of these builds. They are all completely isolated and have their own svn repository and own sandbox.

I thought more on your statement of if the lgcbuild-test steps delete and recreate the build/tests directory. I checked and they did NOT, but there are statements like:

<delete dir="${env.BUILD_DIR}/tests">
<exclude name="lgcTestCases.xml"/>
<exclude name="lgcTestCases2.xml"/>
</delete>

but since the excluded files ARE present, the tests directory should NOT actually be deleted. I also found that I create some extra files in the tests directory instead of in the tests/${os.platform} directory. This would mean each of the platforms have the same named file in the plain tests directory instead of in their platform specific directory. However, this shouldn't make a difference as I am not specifying build/tests/** in the output files, but build/tests/<platform>/**.

But just to be safe, I changed all of my usages of the build/tests directory to build/tests/${os.platform} so that everything is in the paltform specific directory that QB2 will output. After that change, everything seems to work now. I was not able to recreate the problem. I will do some more test runs to be sure.

Do you have any idea why this might have fixed the problem? It should NOT have made a difference since QB2 was not told to copy those files.

Robin Shen [03/Oct/09 12:24 AM]
This is good news. Does not below statements actually delete the platform specific directory?

<delete dir="${env.BUILD_DIR}/tests">
<exclude name="lgcTestCases.xml"/>
<exclude name="lgcTestCases2.xml"/>
</delete>

For example, when win32 test results are transferred from win32 agent, the directory "${env.BUILD_DIR}/tests/win32" directory will be created to hold win32 test results. If above statements run after this transferring, the win32 test files are lost since they've been deleted.


Shawn Castrianni [03/Oct/09 06:00 AM]
You are right. How stupid of me not to notice this. However, since the master agent is run in parallel with the other platform agents AND since the other platform agents have to wait for input files to be transferred that the master agent does not, I would assume the master agent will almost always run the above delete statement BEFORE any other platform agents transfer their files back. Therefore, I think it is unlikely that this delete statement actually deletes anything transferred. The fact that my last test ran without any problems even with this delete statement in place seems to verify this. But it is still worth fixing just to remove all possibilities of this delete statement from causing problems.


You said previously that you were changing the file transfer mechanism to be sequential. That sounds like it might slow things down and since it appears that you were right that it was my bug and not yours, do you still need to change to sequential??

Robin Shen [03/Oct/09 07:19 AM]
It might be slower. I will revert the change in 2.0.2 and let's see if this issue is truly resolved.

Thanks