Bug #24516
Confusing verification script
Start date:
06/04/2020
Due date:
% Done:
0%
Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Description
Mirica installed a new factory 3.6.2 from scratch.
Condor was not starting and she get this error.
It is at least misleading.
[root@fermicloud044 condor]# systemctl status gwms-factory.service ● gwms-factory.service - GWMS Factory Service Loaded: loaded (/usr/lib/systemd/system/gwms-factory.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2020-06-04 18:09:21 CDT; 16s ago Docs: http://glideinwms.fnal.gov/doc.prd/factory/index.html Process: 4898 ExecStart=/usr/sbin/gwms-factory start --check_35_ready (code=exited, status=150) Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: RET = main() # capital letters used because pylint considers this a constant Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: File "/usr/bin/fact_chown_check", line 52, in main Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: coll_query = htcondor.Collector().locateAll(htcondor.DaemonTypes.Schedd) Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: IOError: Failed communication with collector. Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: The Factory is not ready for 3.5.x. Please run /usr/bin/fact_chown_check --verbo...tails. Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: gwms-factory.service: control process exited, code=exited status=150 Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: To disable this check remove the --check_35_ready option from the gwms-factory.s...AILED] Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: Failed to start GWMS Factory Service. Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: Unit gwms-factory.service entered failed state. Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: gwms-factory.service failed. Hint: Some lines were ellipsized, use -l to show in full. [root@fermicloud044 condor]#
A brand new install should never get this error.
And then fact_chown_check is not documented (I was unable to find the documentation) and when started as root was giving:
Directory /var/log/gwms-factory/client/user_frontend/glidein_gfactory_instance is owned by user with id 43680, while the user running this process is 0 Please, make sure to run the fact_chown script. More details at https://glideinwms.fnal.gov/doc.v3_5_1/factory/configuration.html#single_user
And as gfactory:
-bash-4.2$ fact_chown_check Traceback (most recent call last): File "/usr/kerberos/bin/fact_chown_check", line 100, in <module> RET = main() # capital letters used because pylint considers this a constant File "/usr/kerberos/bin/fact_chown_check", line 52, in main coll_query = htcondor.Collector().locateAll(htcondor.DaemonTypes.Schedd) IOError: Failed communication with collector.
The stack trace is confusing, a message saying that it needs condor to be running would be more operator friendly
History
#1 Updated by Marco Mambelli 8 months ago
- Description updated (diff)
#2 Updated by Marco Mambelli 4 months ago
- Target version changed from v3_6_4 to v3_6_5
#3 Updated by Marco Mambelli 3 months ago
- Target version changed from v3_6_5 to v3_6_6
#4 Updated by Marco Mambelli about 1 month ago
- Target version changed from v3_6_6 to v3_6_7