[#QB-2384] Timeouts or stop not working correctly under FreeBSD 10

QuickBuild

Timeouts or stop not working correctly under FreeBSD 10

Created: 22/Mar/15 06:16 PM Updated: 03/Aug/16 12:00 AM

Component/s:

None

Affects Version/s:

6.0.9

Fix Version/s:

6.0.14

Original Estimate:	Unknown	Remaining Estimate:	Unknown	Time Spent:	Unknown
File Attachments:	1. can_not_kill_by_sigterm.c (0.7 kb)

Description

« Hide

For the Neovim builds, I have an agent running on FreeBSD 10. I have timeouts placed on a few steps because there are test cases that creep up and can result in a build taking an extraordinary amount of time. Unfortunately, the timeout doesn't appear to kicking in, but I think it might be because QB is having a hard time canceling the work on that node. The build been running for over 6 hours now, and has failed to stop despite hitting stop several times.

All

Comments

Work Log

Change History

Sort Order:

[ Permlink | « Hide ]

Robin Shen [22/Mar/15 11:24 PM]

Yes it could be that the forked process can not be terminated by QB. Can you please kill that process with OS kill utility and then check if the build can be stopped?

[ Permlink | « Hide ]

John Szakmeister [23/Mar/15 12:05 AM]

That's what I did and it stopped. It'd be nice if I didn't have to do that though. :-(

[ Permlink | « Hide ]

Robin Shen [23/Mar/15 11:11 PM]

Which process is the build forking? And can it be killed gracefully via "kill -TERM"? If not, QB might not be able to stop it as it sends SIG_TERM to forked processes.

[ Permlink | « Hide ]

John Szakmeister [31/Mar/15 08:27 AM]

It was running busted, though it may have been doing it through a shell. It died with a regular TERM signal, so I'm not sure why QB had trouble stopping it. It would be nice if QB would try again at some point though. Some programs run cleanup code for SIGTERM, but it's possible they could be stuck in that cleanup. Usually, another SIGTERM will force it to exit anyways. Subversion is one such application.

Ours did not have this behavior, but it is possible that the test app in use had some bad behavior in terms of signals (though, it was not blocking SIGTERM). We use libuv in a test application, and there is a known bug for libuv and signal delivery. However, libuv also does nothing for SIGTERM so the default handler to terminate the process should still be in place.

We did end up modifying the test app slightly and that helped us avoid the issue, but I still think QB should have at least tried harder here.

[ Permlink | « Hide ]

Robin Shen [01/Apr/15 12:50 AM]

Please check if manually stopping the build for another one or two tries works in case the timeout or first cancel does not work. Upon build timeout or first cancel, QB issues SIGTERM to all spawned process, and then it sits there waiting, and upon second cancellation (have to be issued by user manually), it will issue the "kill -9" command to forcbily kill living processes. We do not retry the kill as some cleanup steps might be running in QB (set up by QB user as a upon cancellation condition) and we leave this up to end user to retry the cancellation.

The first cancellation along with subsequent cancellation retry should be reflected in build log something like this:

08:40:44,963 INFO - Terminating launched command gracefully...
08:40:49,582 INFO - Killing process 4965...
08:40:49,582 INFO - Killing process 4964...
08:40:49,582 INFO - Killing process 4963...
08:40:49,582 INFO - Killing process 4960...
08:40:49,623 INFO - Killing process 4965...
08:41:24,404 INFO - Unable to terminate launched command gracefully, terminating forcibly instead...
08:41:24,424 INFO - Calling kill utility to forcibly kill process 4965...

[ Permlink | « Hide ]

John Szakmeister [01/Apr/15 03:02 PM]

Unfortunately, I can't really create the issue right now. How about we close this for now. If I can provide something more concrete, I'll let you know. I will say that I did try to cancel the build several times with no effect--the build just continued to hang there. Unfortunately, it's now been a while so I don't have the logs to go back too. :-(

[ Permlink | « Hide ]

Robin Shen [01/Apr/15 11:38 PM]

No problem, let's open the issue here. Please comment it if the problem happens again.

[ Permlink | « Hide ]

John Szakmeister [03/Apr/15 12:31 PM]

I'm seeing this in the logs:

01:58:44,044 INFO - Running step...
01:59:44,125 INFO - Terminating launched command gracefully...
03:09:33,060 INFO - Unable to terminate launched command gracefully, terminating forcibly instead...
03:10:23,912 INFO - Unable to terminate launched command gracefully, terminating forcibly instead...

Nothing about process ids, and nothing about calling the kill utility. Is that expected?

[ Permlink | « Hide ]

Robin Shen [04/Apr/15 01:30 AM]

Seems that QB is not able to capture process id of launched processes. Can you please set up a simple step to call the sleep utility and then cancel the build to see if the log contains process ids shown previously in my log?

[ Permlink | « Hide ]

John Szakmeister [04/Apr/15 07:01 AM]

I tried letting a timeout lapse and manually stopping the build, and they both worked using sleep. Would turning up the logging level (to DEBUG) help uncover anything?

[ Permlink | « Hide ]

Robin Shen [05/Apr/15 01:59 AM]

The various killing process IDs will be printed with INFO level, so DEBUG will not discover more things in this regard. Please help with another test:
1. Download attached "can_not_kill_by_sigterm.c", and compile with gcc to get a.out
2. Create a test.sh calling a.out
#!/bin/sh
/path/to/a.out
3. Add a command build step in QB to execute test.sh
4. Run the build and then stop the build manually
5. The first stop try will cause QB sending SIGTERM to a.out and process ids should be printed in build log, and the build will not exit as it can not be killed by SIGTERM
6. Now stop the build again from QB UI, now QB should call kill command to kill a.out forcibly and the build should stop after a while.

Please check if above works at your side.

[ Permlink | « Hide ]

John Szakmeister [09/Apr/15 08:21 AM]

Sorry it took so long, but I did manage to do this today. A few things came up.

First, my apologies, but I didn't make sure that the sleep step was executing on the FreeBSD node. The sleep test actually fails on the FreeBSD node:

    22:45:22,841 INFO - Executing pre-execute action...
    22:45:22,841 INFO - Running step...
    22:45:22,850 INFO - Checking step execute condition...
    22:45:22,851 INFO - Step execute condition satisfied, executing...
    22:45:23,022 INFO - Executing pre-execute action...
    22:45:23,023 INFO - Running step...
    22:45:23,024 DEBUG - Executing command: /bin/sleep 600
    22:45:23,024 DEBUG - Command working directory: /usr/home/quickbuild/buildagent/workspace/root/neovim/sleep-test
    22:45:33,051 INFO - Terminating launched command gracefully...
    22:46:08,935 INFO - Unable to terminate launched command gracefully, terminating forcibly instead...

No PIDs of any sort. Compiling and running a.out via test.sh had similar results.

[ Permlink | « Hide ]

Robin Shen [09/Apr/15 11:29 PM]

So stopping a build simply executing a sleep command does not work? I do not have FreeBSD 10 at hand and will set up one to check this issue.

[ Permlink | « Hide ]

John Szakmeister [10/Apr/15 10:13 AM]

> So stopping a build simply executing a sleep command does not work?

That's correct.

[ Permlink | « Hide ]

Robin Shen [11/Apr/15 12:20 AM]

Thanks for the info, will go ahead to setup a FreeBSD 10 system to check what might be the cause.

[ Permlink | « Hide ]

Robin Shen [04/May/15 02:54 AM]

This issue has now been fixed in 6.0.14 (due to JIRA bug, we can not update status of this issue now for some reason)

[ Permlink | « Hide ]

John Szakmeister [04/May/15 10:20 AM]

Thanks a million! This couldn't have come at a better time... my FreeBSD was getting stuck quiet frequently this week. Thank you!

[ Permlink | « Hide ]

John Szakmeister [27/Jul/16 11:58 PM]

I'm seeing this issue again in 6.1.19.

[ Permlink | « Hide ]

Robin Shen [01/Aug/16 12:09 PM]

It works at my side. Please create a simple configuration running a command/batch step with below command:
sleep 300

Then stop the build to see if it works.

[ Permlink | « Hide ]

John Szakmeister [01/Aug/16 06:03 PM]

Sorry, I didn't mean to imply that it never works--it only appears to mostly work. I've definitely had builds run up against the time limits and QB has terminated them without issue.

But then, there are ones like this: http://neovim-qb.szakmeister.net/build/7676 and this: http://neovim-qb.szakmeister.net/build/7668.

The first ran for 8 hours before I got a chance to take a closer look, while the second ran considerably longer (over a day) I had to log into the FreeBSD node and kill the running process for the executable (nvim, in this case). I'm not sure how it's happening. Looking at the timeout builds, there are none that are at approximately the limit (meaning, QB wasn't able to terminate them and I had to jump in and kill the process under test).

Your simple test works, but I think things are failing when trying to kill a tree of processes. The OS X and Linux boxes we have configured don't appear to suffer from this issue. :-(

[ Permlink | « Hide ]

Robin Shen [02/Aug/16 12:29 AM]

Maybe that is because the process does not respond to SIGTERM signal. QB uses below script to loop against all process trees and send them SIGTERM:

#!/bin/sh
list_descendants ()
{
local children=$(pgrep -P "$1")

for pid in $children
do
list_descendants "$pid"
done

echo "$children"
}

kill $(list_descendants $1)

[ Permlink | « Hide ]

John Szakmeister [02/Aug/16 11:20 AM]

It's a good suggestion, but I use "kill <pid>" to stop the executable under test (not QuickBuild), which defaults to using SIGTERM and it exits. :-( I'll try and dig into this more, but I'm not sure where to go. What I can say is that a child process is being abandoned (because its parent dies), so "pgrep -P <pid>" no longer finds it afterwards. I wonder if there's a bit of a race between killing the children and the children starting a new process, which causes the new child to be missed? I'll try to look and see if there's a safer way to kill a tree of processes. OTOH, QB is getting stuck when this condition occurs waiting for a process that may never terminate. I have a step to kill any lingering processes, but it doesn't get to that point. Is there something that could be done there?

[ Permlink | « Hide ]

Robin Shen [02/Aug/16 09:57 PM]

Seems there is no other common working approach killing the whole process tree other than using pgrep. Maybe you can add some terminating logic at the end of the main command to terminate the child processes by yourself if you know child processes being spawned.

[ Permlink | « Hide ]

John Szakmeister [03/Aug/16 12:00 AM]

One approach that I think would do the trick the majority of the time is to create a new session for the launched process and then use pkill to kill the session by session id. Some quick tests shows that works well, and abandoned children are still terminated. The trick here is that the launched processes needs to make a call to setsid() to set itself as a session leader. I think the only catch here is that you have to fork a new process and let the parent die so that it is able to become a session leader. I'm not sure how that affects things. I can probably work something into our testing to help make this happen for us, but that obviously won't help others who run into this issue.