Gridengine and Qlogin Zombies
Written on 2009-09-12.
we’re using gridengine 6.2 on our cluster system at work. On this system we have a queue for interactive jobs. The queue has a fixed number of slots. If someone wants to start an interactive session and no free slots are available the interactive job cannot start otherwise it will consume one slot during the session and free it again after the user quit the interactive session.
Unfortunately, if the user forgets to quit the interactive session properly (e.g. by entering “quit” in the terminal) and just closes the terminal, the interactive session becomes a zombie. The user cannot use this session anymore, but the process is still running and thus blocking a slot. Since it happens quite often that users fail to quit the interactive session properly, the free slots are used up pretty soon and no one can start new interactive sessions anymore – quite a nice bug in the gridengine software (available in Debian/Lenny by the way…)
You can see that those processes are still running using ps and qstat. The question is: how can I effectively search for those processes and kill them? Gridengine does not seem to be aware of those zombies and the user might run other interactive jobs in parallel which must not be killed. So how does one find the zombies and kill them reliably?
This entry was tagged gridengine and lazyweb