Project

General

Profile

Initial daq cluster setup checklist » History » Version 26

« Previous - Version 26/40 (diff) - Next » - Current version
Geoff Savage, 01/27/2020 02:00 PM


Initial DAQ cluster setup checklist.

Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters.

Networking

  1. define subnets for IPMI, fnal/public and data/daq interfaces
    • How many cables are needed? Shared IPMI/public?
    • Name interfaces by function.
  2. define host names for all network interfaces and make them consistent
    • mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
    • the list of host names should be complete as if all hardware is available
      • Reserve a few for IPs for the next computer installs
    • put all host names into /etc/hosts and distribute it across all servers
    • How do we automate generation of the hosts file?
    • Right now /etc/hosts managed by puppet. Remove this?
  3. make a consistent IP address assignment across all subnets
    • use address blocks for the same server roles
    • make the last octet of an IP address being the same across all NICs of the same host
      • Discussion with networking
  4. configure authentication
    • Kerberos for the public interface
    • public key for the data interface
      • Access everything over private network for the daq user
      • User testing artdaq will get instructions to set up their own public key
  5. create instructions for rebooting servers using IPMI
  6. enable the 9000 MTU frames on all DAQ interfaces and networking equipment by default
    • Switch configuration by networking.
    • Just on DAQ network, not all interfaces.
    • NFS on public network - jumbo frames for performance? No jumbo on public.
  7. configure and verify that multicasting is enabled and working all networking equipment
    • Need testing software to verify the configuration.

Users

  1. define a shared user for
    • managing UPS products
    • running daq, dcs, databases
    • Experiments manage .k5logins for the shared accounts
  2. add all people from the RSI group to the /root/.k5login
  3. add all known daq users to the daq and dcs shared accounts
  4. shared user profiles are not expected to have any customizations
  5. Control room accounts - shared

Storage areas

  1. setup a reliable NFS server for /home, /daq/products, /daq/database, /daq/log, /daq/database, /daq/tmp,.... /data /scratch, /daq/backup
    • No mounts from labs central storage or pnfs.
  2. reserve adequate disk space for each area
  3. create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server
    • a faster NVMe drive such as Samsung 970 Pro or faster is preferred
  4. setup a nightly backup for /home and a weekly backup for /daq/backup areas
  5. the performance of the NFS should be monitored

Software

  1. any base software such as the OS and productivity RPMs should be identical on all servers
  2. a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default
  3. implement system monitoring using ganglia
    or similar software
  4. Support for MOSH?
    • We should try it. Might be blocked by ACLs.

System Services

  1. Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus.
  2. Ganglie, graphite, grafana
    • system monitoring - check_mk, net data
    • singularity container to distribute monitoring software
    • graphite/grafana - part of standard installation
  3. Disable selinux enforcing - permissive mode
  4. Disable firewall on private networks

Geoff

  • Turn off checking of raid arrays.
  • Raid arrays must be raid 10? You lose half the disk size?
  • Do we really need hosts file?
    • If we use a hosts file we should use a script to create the file.
  • ntp from fermilab servers works well. No need for an experiment ntp server.
  • Who is in the RSI group?
  • Use Ansible to verify the settings from puppet are correct?
  • Buffer sizes in network switches
  • Database computer specs