Initial daq cluster setup checklist » History » Version 32
Initial DAQ cluster setup checklist.¶
Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters.
- define subnets for IPMI, fnal/public and data/daq interfaces
- How many cables are needed? Shared IPMI/public?
- Name interfaces by function.
- define host names for all network interfaces and make them consistent
- mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
- the list of host names should be complete as if all hardware is available
- Reserve a few for IPs for the next computer installs
- put all host names into /etc/hosts and distribute it across all servers
- How do we automate generation of the hosts file?
- Right now /etc/hosts managed by puppet. Remove this?
- make a consistent IP address assignment across all subnets
- use address blocks for the same server roles
- make the last octet of an IP address being the same across all NICs of the same host
- Discussion with networking
- configure authentication
- Kerberos for the public interface
- public key for the data interface
- Access everything over private network for the daq user
- User testing artdaq will get instructions to set up their own public key
- create instructions for rebooting servers using IPMI
- enable the 9000 MTU frames on
allDAQ interfaces and networking equipment by default
- Switch configuration by networking.
- Just on DAQ network, not all interfaces.
- NFS on public network - jumbo frames for performance? No jumbo on public.
- configure and verify that multicasting is enabled and working all networking equipment
- Need testing software to verify the configuration.
- define a shared user for
- managing UPS products
- running daq, dcs, databases
- Experiments manage .k5logins for the shared accounts
- add all people from the RSI group to the /root/.k5login
- add all known daq users to the daq and dcs shared accounts
- shared user profiles are not expected to have any customizations
- Control room accounts - shared
- setup a reliable NFS server for /home, /daq/software, /daq/log, /daq/run_records, /daq/scratch
- No mounts from labs central storage or pnfs.
- cvmfs requires additional configuration to optimize
- reserve adequate disk space for each area
- raid 10 for nfs server
- create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server
- a faster NVMe drive such as Samsung 970 Pro or faster is preferred
- Pick the current SSD drive, larger size has faster write speed.
- setup a nightly backup for /home and a weekly backup for /daq/backup areas
- the performance of the NFS should be monitored
- /data is a local file system on data logger computers
- raid 10 for performance
- lose half the disk space
- any base software such as the OS and productivity RPMs should be identical on all servers
- a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default
- implement system monitoring using ganglia
or similar software
- Support for MOSH?
- We should try it. Might be blocked by ACLs.
- Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus.
Ganglia, graphite, grafana
- system monitoring - check_mk, net data
- singularity container to distribute monitoring software
- graphite/grafana - part of standard installation
- Keep separate hardware monitoring for system administration
- Combined hardware monitoring for DAQ - DAQ monitoring + hardware monitoring
- Disable selinux enforcing - permissive mode
- Disable firewall on private networks
- Turn off checking of raid arrays.
- Raid arrays must be raid 10? You lose half the disk size?
- Do we really need hosts file?
- If we use a hosts file we should use a script to create the file.
- ntp from fermilab servers works well. No need for an experiment ntp server.
- Who is in the RSI group?
- Use Ansible to verify the settings from puppet are correct?
- Buffer sizes in network switches
- Database computer specs