Project

General

Profile

Bug #4229

Improve factory performance in writeStats to bring down the loop time to acceptable value.

Added by Parag Mhashilkar over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
Factory
Target version:
Start date:
06/27/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

I just looked into the SDSC factory performance, and the results are dismal, to be polite!
Each entry only manages to run one iteration every 10 minutes!

Here is an example (grepping for rrd_stats):
[2013-06-27T06:49:51-07:00 112877] rrd_stats written
[2013-06-27T06:59:41-07:00 1679] rrd_stats written
[2013-06-27T07:08:40-07:00 16410] rrd_stats written
[2013-06-27T07:19:25-07:00 30368] rrd_stats written
[2013-06-27T07:29:36-07:00 43355] rrd_stats written
[2013-06-27T07:39:51-07:00 57594] rrd_stats written
[2013-06-27T07:49:58-07:00 72208] rrd_stats written
[2013-06-27T07:59:35-07:00 88020] rrd_stats written
[2013-06-27T08:10:36-07:00 102382] rrd_stats written
[2013-06-27T08:20:45-07:00 117565] rrd_stats written
[2013-06-27T08:31:24-07:00 4541] rrd_stats written
[2013-06-27T08:42:06-07:00 20515] rrd_stats written
[2013-06-27T08:53:10-07:00 38602] rrd_stats written
[2013-06-27T09:05:15-07:00 55830] rrd_stats written
[2013-06-27T09:15:17-07:00 71594] rrd_stats written
[2013-06-27T09:25:10-07:00 87537] rrd_stats written
[2013-06-27T09:35:14-07:00 104778] rrd_stats written

The main culprit seems to be the monitoring;
below is an example; it takes about 30s to do the processing of all the requests and do the submissions,
but then 10 minutes in monitoring!!!

This has to be fixed... ASAP.

Thanks,
Igor


Related issues

Related to GlideinWMS - Bug #4276: MASTER: Improve factory performance in writeStats to bring down the loop time to acceptable valueClosed07/01/2013

History

#1 Updated by Parag Mhashilkar over 7 years ago

  • Assignee set to Parag Mhashilkar

Igor,

Burt suggested temp work around to increase the number of glideins submitted per cycle. Is that an option?

Can you provide some info on the timing. I need to know time spent between these two logging lines in the group logs

logSupport.log.info("Writing stats for all entries")
....
logSupport.log.info("Sleep %is" % sleep_time)

#2 Updated by Igor Sfiligoi over 7 years ago

Not something we have control over; the FEs specify how many idle they want.

Igor

#3 Updated by Parag Mhashilkar over 7 years ago

Here is the info I could piece together from the group logs for one iteration

Details on Time spent (approx) every iteration

  • Until Finding work (from condor_status + condor_q): 150 sec
  • To actually find work based on info received from condor: 40 sec
  • Performing work (glidein submissions ie condor_submit): 150 sec
  • Advertising entry info (condor_advertise): 60 sec
  • Writing Stats (rrdtool updates) + processing its output from child processes: 120+30 sec
  • Sleep at the end of the iteration: 60 sec
  • Total: 610 sec

So clearly we need to settle on addressing the issue at different levels

To get approx 20% gain with minimum effort

  • We do not need to sleep for 60 sec in a busy factory any more
  • Combine the individual advertisements to get down the advertisement time from 60 sec to few sec

Next

  • Reduce the memory footprint of the entry so that we can processes more children at a time. Thats 190 sec down to whatever number we can get to
  • Make writing stats more efficient
  • See if we can make condor queries more efficient if possible
[2013-06-27T12:54:53-07:00 123153] Writing stats for all entries
[2013-06-27T12:54:53-07:00 123153] Number of parallel writes for stats: 24
[2013-06-27T12:54:55-07:00 123153] Processing response from children after write stats
[2013-06-27T12:55:24-07:00 123153] Sleep 60s
[2013-06-27T12:56:24-07:00 123153] Iteration at Thu Jun 27 12:56:24 2013
[2013-06-27T12:59:58-07:00 123153] Finding work
[2013-06-27T13:00:33-07:00 123153] Work tasks grouped by entries using existing factory key
[2013-06-27T13:00:33-07:00 123153] Entry: CMS_T2_UA_KIPT_cream_ce (Tasks: 4)
[2013-06-27T13:00:33-07:00 123153] Entry: Engage_US_VT_valiant (Tasks: 10)
[...]
[2013-06-27T13:00:33-07:00 123153] Entry: CMS_T2_FR_CCIN2P3_cccreamceli08_verylong (Tasks: 4)
[2013-06-27T13:00:33-07:00 123153] Entry: CMSHTPC_T3_US_Omaha_tusker (Tasks: 6)
[2013-06-27T13:00:33-07:00 123153] Entry: GLOW_US_UConn_gluskap_64 (Tasks: 2)
[2013-06-27T13:00:33-07:00 123153] Found 1509 total tasks to work on
[2013-06-27T13:02:56-07:00 123153] All children forked for glideFactoryEntry.check_and_perform_work terminated. Loading post work state for the entry.
[2013-06-27T13:02:56-07:00 123153] Advertising entry: Engage_US_VT_valiant
[2013-06-27T13:02:59-07:00 123153] Advertising entry: CMS_T1_UK_RAL_lcgce02
[2013-06-27T13:03:00-07:00 123153] Advertising entry: OSG_US_SMU_smuosg1
[2013-06-27T13:03:55-07:00 123153] Advertising entry: CMS_T2_IT_Bari_ce2
[...]
[2013-06-27T13:03:56-07:00 123153] Writing stats for all entries
[2013-06-27T13:03:56-07:00 123153] Number of parallel writes for stats: 24
[2013-06-27T13:03:58-07:00 123153] Processing response from children after write stats
[2013-06-27T13:04:27-07:00 123153] Sleep 60s
[2013-06-27T13:05:28-07:00 123153] Iteration at Thu Jun 27 13:05:28 2013

#4 Updated by Igor Sfiligoi over 7 years ago

Parag mis-diagnosed part of the time used.

The initial time (X minutes) is spent mostly in the log cleanup section (LogFiles.cleanup in entry.initIteration).

So it should be easily forked into a different process, since it changes no state in memory (only on disk).

#5 Updated by Parag Mhashilkar over 7 years ago

All the changes described below are now in branch_v2plus_4229

Me and Igor exchanged several messages and addressed following issues -

  • Igor
    • Changed the log cleanup so that now it is forked. Estimated: 150 sec
  • Parag
    • Changed the semantic for loop_delay (sleep_time). Estimated: 60 sec * Changed advertisement of glidefactory & glidefactoryclient classads into multi advertise instead of per entry. This should also help reduce the load on the collector. Estimated: 60 sec

I believe next biggest gain will come from reducing the memory footprint of the entry forked processes when doing work. This will let us fork more at a time reducing the total time for the iteration.

#6 Updated by Igor Sfiligoi over 7 years ago

After Parag's changes, the bottleneck was again in the cleanup code; it was taking longer than everything else combined.

So I made two more changes:
  1. Do cleanup only once every half hour. There is really no good reason to do it once a minute.
  2. Spawn four cleanup process, each looking at its quarter of the entries.

These changes were deployed on the production OSG factory in San Diego, and now the time are in the 2.5-3.5 minute range.

PS: I have hardcoded both the 30 minute cleanup time, as well as the 4 forks. We should probably make it a configurable parameter. Not sure it is worth doing it in v2, though.

#7 Updated by Parag Mhashilkar over 7 years ago

Glad we are down to 3 min. Any further gains will not be that easy or of this magnitude. Let it run for some time before we get more experience.
When I am back, I will clean up the code. There is considerable code duplication atleast in the part which I put in place. Once cleanup is done will assign for review and put a 2.7.2 release.

#8 Updated by Burt Holzman over 7 years ago

Oh boy, there is a metric buttload of code duplication in commit:43ced837. I can try to clean that up while you're on vacation.

#9 Updated by Parag Mhashilkar over 7 years ago

Sure. I was hesitant to change the frontend when I made this quick fix. Master already has the changes I had in mind to remove the duplicate code.

  • move the common code to a new library file, lib/classSupport.py and move the relevant base classes classads and advertisemanagers there
  • remove the duplicated code from the factory and frontend and inherit from the common code in factory/frontend.

Of course if you think there is a better or more pythonic way I am all ears

#10 Updated by Burt Holzman over 7 years ago

Sounds like a good plan to me.

#11 Updated by Burt Holzman over 7 years ago

  • Status changed from New to Assigned

#12 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Assigned to Closed

Deferred the code cleanup and consolidation to master branch since we may not have many more v2 releases. Merged the changes to branch_v2plus

Also available in: Atom PDF