Project

General

Profile

Bug #1200

expired/renewed proxy creates error in completed_jobs reporting

Added by Dennis Box over 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Douglas Strain
Category:
-
Target version:
Start date:
04/14/2011
Due date:
% Done:

90%

Estimated time:
Spent time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Hello GlideinWMS Team,

Today we discovered an error in which it appears that glideins that have already terminated are appearing multiple times later in the completed_jobs logs (see below output). The problem does not only happen here but must be affecting whatever the completed_jobs logs get their info from. We suspect the equivalent Log_Completed rrd's also have the duplicated information because our analyze_entries tool is also reporting duplicates.

The case that made this apparent today was due to a glidein that failed because of it's proxy expiring. The glidein really failed at 9:05 am PST yesterday the 11th. Since then the proxy has been renewed. However if you see below you will see that the same glidein is reported to have finished 6 more times today.

This caused confusion because both the completed_jobs log and the analyze_entries email continued to report startup errors today when really no glideins have failed since the proxy was renewed yesterday at 11:30 am PST. Also the actual job error log is timestamped with the original time:

$ ls l /home/gfactory/glideinsubmit/glidein_Production_v4_0/client_log/user_fehcc/entry_CMS_T2_US_UCSD_gw2/job.36506.2.err
-rw-r--r-
1 fehcc fehcc 8558 Apr 11 09:05 /home/gfactory/glideinsubmit/glidein_Production_v4_0/client_log/user_fehcc/entry_CMS_T2_US_UCSD_gw2/job.36506.2.err

Can you please have a look and see why this might be happening? In the meantime I will try and come up with a way to consistently reproduce the problem.

It is urgent that this is addressed soon because we rely on dependable monitoring information for the daily operations.

Thank you,
Jeff Dost
OSG Glidein Factory Operations

$ grep 36506.002 log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_2011041*.log
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110411.log:<job terminated="2011-04-11T15:44:25-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110411.log:<job terminated="2011-04-11T19:44:37-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:<job terminated="2011-04-12T05:30:04-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:<job terminated="2011-04-12T10:13:57-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:<job terminated="2011-04-12T11:44:23-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>
log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:<job terminated="2011-04-12T12:29:43-07:00" client="hcc-glidein_frontend" username="fehcc" id="36506.002" duration="1201" condor_started="False" condor_duration="0"><user jobsnr="0" duration="0" goodput="0" terminated="0"/><wastemill validation="1000" idle="0" nosuccess="0" badput="1000"/></job>

History

#1 Updated by Dennis Box over 8 years ago

  • Assignee changed from Dennis Box to Douglas Strain

#2 Updated by Douglas Strain over 8 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 90

I believe I have found the completed jobs duplication error. I will be committing it shortly into the v2_plus branch. I have tested it in my local cluster as well as tested it using the logs from a problematic directory at UCSD, but I don't have the breadth of groups/users as the UCSD factory, so please let me know your results once you try to integrate it into the factory there.

The problem is this:
- The log statistics are indexed by client name and username.
- If the client has two groups, (for example, with one group having no jobs submitted to the entry point), then when the second group advertises, then the stats are replaced. Then the first group advertises the next iteration, it sees that all the stats from the first group are gone. Therefore, they must all be newly completed! It then reports all the jobs as completed (again)
- So, whenever the groups statistics thrash, the jobs are reported again. For whatever reason, this seems to happen about once every 2-4 hours for most entry points that have two groups.

I fixed it with the following:
- The log_stats are now indexed by client name and "username:client_int_name" e.g. "ferenci:EngageVO-submit3-v1_0.main"
- Since this include the group ".main", if the group changes, it is indexed to a new entry in the dictionary and does not mess up the diff between the old and new stats.
- When I print out the completed jobs, I strip the end off and just print out the username.
I don't think the username is used anywhere else (the rest is summary statistics), but, if it is, there may be additional code changes needed.

#3 Updated by Douglas Strain almost 8 years ago

  • Target version set to v2_5_3


Also available in: Atom PDF