Improve frontend monitoring for Multi Core Glideins
Original request from James Letts ...
Hello Parag, Burt,
So we went off and thought carefully about what we would like to monitor, and how this might differ from what we have currently in the glideinWMS monitor.
The motivation for this is the roll out of multi-CPU applications both for production and analysis workflows. As I understand it, the current glideinWMS frontend monitoring  counts glideins and jobs. In a multi-core world, we need to also look at the CPU usage of the glideins and jobs. To get an idea of what we would like, I wrote a simple monitor, mostly as an exercise to understand what we should ask you, which weights glideins by the number of Cpu available to each one . It does not look at jobs, although weighting jobs by Cpu count would also be important as we roll-out multi-CPU applications.
Our aim of course is to spot problems, which we would usually see as a spike in unmatched resources, especially those that persist over a long period of time. Not weighting by CPU masks the problems at sites with multi-core glideins.
So what we wanted to ask about are these items:
- CPU-weighted plots (high priority): Would it be possible to add to the frontend monitoring CPU-weighted plots? Or perhaps even make this the default view? I think the original non-CPU weighted view that we have now is useful and it would be good to keep it.
- Filters (lower priority): We can already filter the plots by frontend group and factory entry.
- We would be interested to be able to select the glideins by type (Partitionable, Dynamic, Static) also.
- It would also be very very useful for operations if we could filter the running and idle jobs by SCHEDD (mainly to identify problematic schedulers).
What do you think? Is any of this possible, and on what time-scales? The roll out of multi-core applications is upon is, so the first item is a priority for us.
Lastly I had a question about resource requests in a multi-core world. Are the resource requests glideinWMS makes taking into account the Cpu count of the glideins? Or does a number of idle user jobs trigger a number of glideins irrespective of the Cpu multiplicity? I don’t have hard numbers yet to know if we are over-requesting glideins at sites with multi-core enabled, but I suspect that we might be at certain times. Any guidance or explanation that you could give here would be very useful.