Project

General

Profile

Bug #24516

Updated by Marco Mambelli 2 months ago

Mirica installed a new factory 3.6.2 from scratch.
Condor was not starting and she get this error.
It is at least misleading.
<pre>
[root@fermicloud044 condor]# systemctl status gwms-factory.service
‚óŹ gwms-factory.service - GWMS Factory Service
Loaded: loaded (/usr/lib/systemd/system/gwms-factory.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-06-04 18:09:21 CDT; 16s ago
Docs: http://glideinwms.fnal.gov/doc.prd/factory/index.html
Process: 4898 ExecStart=/usr/sbin/gwms-factory start --check_35_ready (code=exited, status=150)
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: RET = main() # capital letters used because pylint considers this a constant
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: File "/usr/bin/fact_chown_check", line 52, in main
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: coll_query = htcondor.Collector().locateAll(htcondor.DaemonTypes.Schedd)
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: IOError: Failed communication with collector.
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: The Factory is not ready for 3.5.x. Please run /usr/bin/fact_chown_check --verbo...tails.
Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: gwms-factory.service: control process exited, code=exited status=150
Jun 04 18:09:21 fermicloud044.fnal.gov gwms-factory[4898]: To disable this check remove the --check_35_ready option from the gwms-factory.s...AILED]
Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: Failed to start GWMS Factory Service.
Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: Unit gwms-factory.service entered failed state.
Jun 04 18:09:21 fermicloud044.fnal.gov systemd[1]: gwms-factory.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@fermicloud044 condor]#
</pre>

A brand new install should never get this error.
And then fact_chown_check is not documented (I was unable to find the documentation) and when started as root was giving:
<pre>
Directory /var/log/gwms-factory/client/user_frontend/glidein_gfactory_instance is owned by user with id 43680, while the user running this process is 0
Please, make sure to run the fact_chown script. More details at https://glideinwms.fnal.gov/doc.v3_5_1/factory/configuration.html#single_user
</pre>
And as gfactory:
<pre>
-bash-4.2$ fact_chown_check
Traceback (most recent call last):
File "/usr/kerberos/bin/fact_chown_check", line 100, in <module>
RET = main() # capital letters used because pylint considers this a constant
File "/usr/kerberos/bin/fact_chown_check", line 52, in main
coll_query = htcondor.Collector().locateAll(htcondor.DaemonTypes.Schedd)
IOError: Failed communication with collector.
</pre>
The stack trace is confusing, a message saying that it needs condor to be running would be more operator friendly

Back