bulk_update_jobs is cpu_bound and not keeping up
So when we moved the jobsub_q agent and reporting to pomsgpvm02, we saw that the CPU usage on the server spiked; this is mainly calls to bulk_update_jobs. So the issue there is that we're doing all the calls using the ORM to individually update Job items, and then the flush/commit is CPU bound converting all those Job items to update commands.
So we need to rewrite the code to avoid generating Job objects for all the jobs we're modifying, and use the bulk insert mechanism where possible.
#1 Updated by Marc Mengel about 3 years ago
- % Done changed from 0 to 90
Most of the changes for this are on the feature/better_bulk_update branch, with a few small patches after that to not clear data we already set (i.e. if one update reports a task_project='soemthing' and another one does task_project=None, or task_project='' don't take the None/'')
This ends up with this code
which does everything with
bulk_insert_mappings() which are hopefully much faster (and use less memory) than the current code, though we won't really know until we finish it up.
I have this marked only 90% done because it probably needs some locking code, particularly on the part where we decide what needs an insert; we may need to lock the table between checking and inserting so we don't have two update calls trying to insert duplicate records.
#2 Updated by Marc Mengel about 3 years ago
So for now I'm trying 36cab63 which briefly locks the whole table over a query for existing jobsub_job_ids and the bulk insert, to prevent duplicate key errors. I'd really like to pass the "on conflict do nothing" bits through to the bulk insert, but I don't see a way to do that in sqlalchemy just yet..
#3 Updated by Marc Mengel about 3 years ago
Had a good discussion with Robert; he suggested rather than locking the table, lock any related entities that exist (i.e. just the tasks the processes are associated with) which prevents overlapping updates without locking the whole table...
which as it turns out we're already doing -- we lock the Task rows we've been asked to work on already. So dropping any other locking altogether.