Manage OOM score of frontend processes
We had a fire in CMS where an increase in the number of jobs in the pool caused an increase in the memory for each worker sub-processes of the frontend.
This pushed the node into swap and invoked the OOM killer. Unfortunately, after a few rounds, we got unlucky and the OOM killer selected the frontend top-level process as a victim.
No frontend process = no glideins = angry users.
Can we set the oom_adj (on newer kernels, oom_score_adj) explicitly for the parent and child processes so the child process is always selected?
#1 Updated by Burt Holzman over 4 years ago
Is this the right thing to do? Killing the children leads to harder-to-understand behavior (the service is running, but not always requesting glideins since the child keeps dying).
I'd actually think about the opposite: always kill the parent, then it's much more clear what's going on.
Not really convinced either way..