Memory overrun from jobsub_q_scraper -> uwsgi on pomsgpvm02
So we ran pomsgpvm02 out of memory yesterday around 16:20.
We ran out of memory so fast, we didn't have time to log what killed us, best I can see.
but looking at the idle jobs graph
I see that Nova dropped just over 10k new (idle) jobs in the queue just before then, so I suspect it was a series of full 1k bulk_update calls that came in in that short time window that killed us. So this is the sort of thing we should be able to test synthetically in development.
In the mean time, I hacked the job_reporter code in production to knock the max batch size down to 256 -- hoping that will help; possibly we might need to put in an explicit call to the garbage collector on the server side at the end of a bulk_update request...
#1 Updated by Marc Mengel about 3 years ago
- % Done changed from 0 to 90
Okay, so I wrote a script source:test/bash_test_big_job_batch which along with source:test/data/mk_condor_q_20k.sh that let me reproduce the problem, and after
making some cleanups in bulk_update that didn't help at all, I found that the problem
was really in wrapup_tasks when 20k jobs or so come complete at the same time.
It was issuing a sqlachemy query(Task).options(joinedload(Jobs)).options(joinedload(CampaignSnapshots)...
which ended up being totally massive when 20k jobs become completed and blew our
memory up by basically 2G resident and 3G virtual. So now I've rewritten big chunks
of wrapup_tasks and things are Much Better.
Still need to retest wrapup_tasks in other situation.