Project

General

Profile

Common errors with Managed Proxies Service

In the Managed Proxy Service (Service Certificate Management), the User Support for Distributed Computing (USDC, a.k.a. FIFE Support) maintains service certificates on behalf of experiments, periodically generates VOMS proxies from these certificates, and pushes them to the experiments' interactive nodes. Due to a recent procedural change within the Scientific Computing Division, the USDC group will no longer be opening tickets on behalf of experiments if they see an issue with an interactive node (e.g. a node was down). This is to ensure that experiment activities on these nodes are not disrupted unintentionally. It will be up to experiment representatives to open Service Desk tickets if there are any interactive node-related issues.

To help experiments with this, the USDC group will now be sending notifications to experiments if there was an error with any of their nodes (or with the service with regard to the experiment). We hope that these help experiments understand any potential issues with their interactive nodes, or to alert them to ongoing issues that need to be addressed. Here are the most common errors we've seen with the service, and the recommended action to take. We'll be updating this list as we see other common errors pop up.

In general though, our advice usually is to try to log into the node, and if that fails, open a ServiceNow ticket to the Scientific Server Infrastructure group to investigate. Otherwise, open a ticket to Distributed Computing Support (USDC).

One Note before we go into the errors: This service runs every four hours, so if you get a notification and ignore it or otherwise can't address it for four hours, and you do not get another notification four hours after the first, it most probably means the issue has resolved itself.

  1. Error: Can't ping a node (unknown reason), so proxy copy fails
    2017-10-16 17:58:42,909 - WARNING - The node exptgpvm01 didn't return a response to ping after 5 seconds.  Please investigate, and see if the node is up. 
    It may be necessary for the experiment to request via a ServiceNow ticket that the Scientific Server Infrastructure group reboot the node. Moving to the next node
    2017-10-16 17:59:20,005 - ERROR - Error copying /path/to/proxy to exptgpvm01. Trying next node
     ssh: connect to host uboonepubsgpvm03 port 22: Connection timed out
    lost connection
    
    2017-10-16 17:59:20,006 - WARNING - Node exptgpvm01 didn't respond to pings earlier - so it's expected that copying there would fail.
    
    • Possible Reasons: Node may be down due to users overloading system or just can't be pinged
    • Recommended Action: See if you can log into that node to investigate. If you can't, open ServiceNow ticket to the Scientific Server Infrastructure group to reboot the node/investigate
  2. Error: Can't ping a node (no route to host, or some other known reason), so proxy copy fails
    2017-10-06 05:58:32,325 - WARNING - The node exptgpvm01 didn't return a response to ping after 5 seconds.  Please investigate, and see if the node is up. 
    It may be necessary for the experiment to request via a ServiceNow ticket that the Scientific Server Infrastructure group reboot the node.  Moving to the next node
    2017-10-06 05:58:47,555 - ERROR - Error copying /path/to/proxy to exptgpvm01. Trying next node
     ssh: connect to host exptgpvm01 port 22: No route to host
    lost connection
    
    2017-10-06 05:58:47,556 - WARNING - Node exptgpvm01 didn't respond to pings earlier - so it's expected that copying there would fail.
    
    • Possible Reasons: Node may be down.
    • Recommended Action: See if you can log into that node to investigate. If you can't (in this case, you probably can't), open ServiceNow ticket to the Scientific Server Infrastructure group to reboot the node/investigate
  3. Error: File permission error
    2017-09-29 13:58:53,737 - ERROR - Error changing permission of experiment.proxy to mode 400 on exptgpvm01. Trying next node
     mv: cannot stat `/path/to/experimentproxyfile.new': No such file or directory
    
    • Possible Reasons: Some error with the Managed Proxy service itself with regards to pushing the proxy onto the interactive node
    • Recommended Action: This is usually a transient issue that goes away. The proxies are valid for 12 hours, and we renew them every four, so that gives us a generous grace period. Check the update time listed on the proxy to make sure it's within the last 8 hours (which will ensure that you can use this proxy through the next automated run. Open ServiceNow ticket to Distributed Computing Support if you're concerned or if there's an immediate need for this to be resolved.
  4. Error: Connection timed out
    2017-08-13 13:59:35,116 - ERROR - Error copying /path/to/proxy to exptgpvm01. Trying next node
     Connection timed out during banner exchange
    lost connection
    
    • Possible Reasons: Perhaps there's maintenance going on with the node such that ssh was disabled.
    • Recommended Action: This is usually a transient issue that goes away. Check to make sure there's no downtime occurring. If not, try to ssh to the node. If that doesn't work, open a ticket to the Scientific Server Infrastructure group to investigate. If you can ssh in, check the update time listed on the proxy to make sure it's within the last 8 hours (which will ensure that you can use this proxy through the next automated run. Open ServiceNow ticket to Distributed Computing Support and we'll try to manually re-push the proxy (and investigate if that doesn't work).