Project

General

Profile

Bug #17641

Memory overrun from jobsub_q_scraper -> uwsgi on pomsgpvm02

Added by Marc Mengel about 3 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
09/05/2017
Due date:
% Done:

100%

Estimated time:
First Occurred:
Scope:
Internal
Experiment:
-
Stakeholders:
Duration:

Description

So we ran pomsgpvm02 out of memory yesterday around 16:20.

We ran out of memory so fast, we didn't have time to log what killed us, best I can see.

but looking at the idle jobs graph

I see that Nova dropped just over 10k new (idle) jobs in the queue just before then, so I suspect it was a series of full 1k bulk_update calls that came in in that short time window that killed us. So this is the sort of thing we should be able to test synthetically in development.

In the mean time, I hacked the job_reporter code in production to knock the max batch size down to 256 -- hoping that will help; possibly we might need to put in an explicit call to the garbage collector on the server side at the end of a bulk_update request...

Associated revisions

Revision e46520be (diff)
Added by Marc Mengel about 3 years ago

memory footprint much improved now! issue #17641

History

#1 Updated by Marc Mengel about 3 years ago

  • % Done changed from 0 to 90

Okay, so I wrote a script source:test/bash_test_big_job_batch which along with source:test/data/mk_condor_q_20k.sh that let me reproduce the problem, and after
making some cleanups in bulk_update that didn't help at all, I found that the problem
was really in wrapup_tasks when 20k jobs or so come complete at the same time.

It was issuing a sqlachemy query(Task).options(joinedload(Jobs)).options(joinedload(CampaignSnapshots)...
which ended up being totally massive when 20k jobs become completed and blew our
memory up by basically 2G resident and 3G virtual. So now I've rewritten big chunks
of wrapup_tasks and things are Much Better.

Still need to retest wrapup_tasks in other situation.

#2 Updated by Marc Mengel about 3 years ago

  • Target version set to v2_2_1
  • % Done changed from 90 to 100

#3 Updated by Anna Mazzacane about 3 years ago

  • Status changed from New to Resolved

#4 Updated by Anna Mazzacane about 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF