Project

General

Profile

Feature #7815

Manage OOM score of frontend processes

Added by Brian Bockelman over 4 years ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Category:
Frontend
Target version:
Start date:
02/10/2015
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS

Duration:

Description

We had a fire in CMS where an increase in the number of jobs in the pool caused an increase in the memory for each worker sub-processes of the frontend.

This pushed the node into swap and invoked the OOM killer. Unfortunately, after a few rounds, we got unlucky and the OOM killer selected the frontend top-level process as a victim.

No frontend process = no glideins = angry users.

Can we set the oom_adj (on newer kernels, oom_score_adj) explicitly for the parent and child processes so the child process is always selected?

History

#1 Updated by Burt Holzman over 4 years ago

Is this the right thing to do? Killing the children leads to harder-to-understand behavior (the service is running, but not always requesting glideins since the child keeps dying).
I'd actually think about the opposite: always kill the parent, then it's much more clear what's going on.

Not really convinced either way..

#2 Updated by Parag Mhashilkar over 4 years ago

  • Assignee set to Marco Mambelli
  • Target version set to v3_3

#3 Updated by Parag Mhashilkar over 4 years ago

  • Priority changed from High to Normal

I agree with Burt, if there are system wide issues, killing the main frontend process is better than randomly selected child.

#4 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_3 to v3_2_x

#5 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_2_x to v3_4_x

#6 Updated by Marco Mambelli 12 months ago

  • Target version changed from v3_4_x to v3_5_x


Also available in: Atom PDF