Improve factory performance in writeStats to bring down the loop time to acceptable value.
I just looked into the SDSC factory performance, and the results are dismal, to be polite!
Each entry only manages to run one iteration every 10 minutes!
Here is an example (grepping for rrd_stats):
[2013-06-27T06:49:51-07:00 112877] rrd_stats written
[2013-06-27T06:59:41-07:00 1679] rrd_stats written
[2013-06-27T07:08:40-07:00 16410] rrd_stats written
[2013-06-27T07:19:25-07:00 30368] rrd_stats written
[2013-06-27T07:29:36-07:00 43355] rrd_stats written
[2013-06-27T07:39:51-07:00 57594] rrd_stats written
[2013-06-27T07:49:58-07:00 72208] rrd_stats written
[2013-06-27T07:59:35-07:00 88020] rrd_stats written
[2013-06-27T08:10:36-07:00 102382] rrd_stats written
[2013-06-27T08:20:45-07:00 117565] rrd_stats written
[2013-06-27T08:31:24-07:00 4541] rrd_stats written
[2013-06-27T08:42:06-07:00 20515] rrd_stats written
[2013-06-27T08:53:10-07:00 38602] rrd_stats written
[2013-06-27T09:05:15-07:00 55830] rrd_stats written
[2013-06-27T09:15:17-07:00 71594] rrd_stats written
[2013-06-27T09:25:10-07:00 87537] rrd_stats written
[2013-06-27T09:35:14-07:00 104778] rrd_stats written
The main culprit seems to be the monitoring;
below is an example; it takes about 30s to do the processing of all the requests and do the submissions,
but then 10 minutes in monitoring!!!
This has to be fixed... ASAP.
#1 Updated by Parag Mhashilkar over 7 years ago
- Assignee set to Parag Mhashilkar
Burt suggested temp work around to increase the number of glideins submitted per cycle. Is that an option?
Can you provide some info on the timing. I need to know time spent between these two logging lines in the group logs
logSupport.log.info("Writing stats for all entries") .... logSupport.log.info("Sleep %is" % sleep_time)
#3 Updated by Parag Mhashilkar over 7 years ago
Here is the info I could piece together from the group logs for one iteration
Details on Time spent (approx) every iteration
- Until Finding work (from condor_status + condor_q): 150 sec
- To actually find work based on info received from condor: 40 sec
- Performing work (glidein submissions ie condor_submit): 150 sec
- Advertising entry info (condor_advertise): 60 sec
- Writing Stats (rrdtool updates) + processing its output from child processes: 120+30 sec
- Sleep at the end of the iteration: 60 sec
- Total: 610 sec
So clearly we need to settle on addressing the issue at different levels
To get approx 20% gain with minimum effort
- We do not need to sleep for 60 sec in a busy factory any more
- Combine the individual advertisements to get down the advertisement time from 60 sec to few sec
- Reduce the memory footprint of the entry so that we can processes more children at a time. Thats 190 sec down to whatever number we can get to
- Make writing stats more efficient
- See if we can make condor queries more efficient if possible
[2013-06-27T12:54:53-07:00 123153] Writing stats for all entries [2013-06-27T12:54:53-07:00 123153] Number of parallel writes for stats: 24 [2013-06-27T12:54:55-07:00 123153] Processing response from children after write stats [2013-06-27T12:55:24-07:00 123153] Sleep 60s [2013-06-27T12:56:24-07:00 123153] Iteration at Thu Jun 27 12:56:24 2013 [2013-06-27T12:59:58-07:00 123153] Finding work [2013-06-27T13:00:33-07:00 123153] Work tasks grouped by entries using existing factory key [2013-06-27T13:00:33-07:00 123153] Entry: CMS_T2_UA_KIPT_cream_ce (Tasks: 4) [2013-06-27T13:00:33-07:00 123153] Entry: Engage_US_VT_valiant (Tasks: 10) [...] [2013-06-27T13:00:33-07:00 123153] Entry: CMS_T2_FR_CCIN2P3_cccreamceli08_verylong (Tasks: 4) [2013-06-27T13:00:33-07:00 123153] Entry: CMSHTPC_T3_US_Omaha_tusker (Tasks: 6) [2013-06-27T13:00:33-07:00 123153] Entry: GLOW_US_UConn_gluskap_64 (Tasks: 2) [2013-06-27T13:00:33-07:00 123153] Found 1509 total tasks to work on [2013-06-27T13:02:56-07:00 123153] All children forked for glideFactoryEntry.check_and_perform_work terminated. Loading post work state for the entry. [2013-06-27T13:02:56-07:00 123153] Advertising entry: Engage_US_VT_valiant [2013-06-27T13:02:59-07:00 123153] Advertising entry: CMS_T1_UK_RAL_lcgce02 [2013-06-27T13:03:00-07:00 123153] Advertising entry: OSG_US_SMU_smuosg1 [2013-06-27T13:03:55-07:00 123153] Advertising entry: CMS_T2_IT_Bari_ce2 [...] [2013-06-27T13:03:56-07:00 123153] Writing stats for all entries [2013-06-27T13:03:56-07:00 123153] Number of parallel writes for stats: 24 [2013-06-27T13:03:58-07:00 123153] Processing response from children after write stats [2013-06-27T13:04:27-07:00 123153] Sleep 60s [2013-06-27T13:05:28-07:00 123153] Iteration at Thu Jun 27 13:05:28 2013
#4 Updated by Igor Sfiligoi over 7 years ago
Parag mis-diagnosed part of the time used.
The initial time (X minutes) is spent mostly in the log cleanup section (LogFiles.cleanup in entry.initIteration).
So it should be easily forked into a different process, since it changes no state in memory (only on disk).
#5 Updated by Parag Mhashilkar over 7 years ago
All the changes described below are now in branch_v2plus_4229
Me and Igor exchanged several messages and addressed following issues -
- Changed the log cleanup so that now it is forked. Estimated: 150 sec
- Changed the semantic for loop_delay (sleep_time). Estimated: 60 sec * Changed advertisement of glidefactory & glidefactoryclient classads into multi advertise instead of per entry. This should also help reduce the load on the collector. Estimated: 60 sec
I believe next biggest gain will come from reducing the memory footprint of the entry forked processes when doing work. This will let us fork more at a time reducing the total time for the iteration.
#6 Updated by Igor Sfiligoi over 7 years ago
After Parag's changes, the bottleneck was again in the cleanup code; it was taking longer than everything else combined.So I made two more changes:
- Do cleanup only once every half hour. There is really no good reason to do it once a minute.
- Spawn four cleanup process, each looking at its quarter of the entries.
These changes were deployed on the production OSG factory in San Diego, and now the time are in the 2.5-3.5 minute range.
PS: I have hardcoded both the 30 minute cleanup time, as well as the 4 forks. We should probably make it a configurable parameter. Not sure it is worth doing it in v2, though.
#7 Updated by Parag Mhashilkar over 7 years ago
Glad we are down to 3 min. Any further gains will not be that easy or of this magnitude. Let it run for some time before we get more experience.
When I am back, I will clean up the code. There is considerable code duplication atleast in the part which I put in place. Once cleanup is done will assign for review and put a 2.7.2 release.
#9 Updated by Parag Mhashilkar over 7 years ago
Sure. I was hesitant to change the frontend when I made this quick fix. Master already has the changes I had in mind to remove the duplicate code.
- move the common code to a new library file, lib/classSupport.py and move the relevant base classes classads and advertisemanagers there
- remove the duplicated code from the factory and frontend and inherit from the common code in factory/frontend.
Of course if you think there is a better or more pythonic way I am all ears