Initial daq cluster setup checklist » History » Version 10
Geoff Savage, 01/27/2020 01:13 PM
1 | 1 | Gennadiy Lukhanin | h1. Initial DAQ cluster setup checklist. |
---|---|---|---|
2 | 1 | Gennadiy Lukhanin | |
3 | 1 | Gennadiy Lukhanin | _Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._ |
4 | 1 | Gennadiy Lukhanin | |
5 | 1 | Gennadiy Lukhanin | h2. Networking |
6 | 1 | Gennadiy Lukhanin | |
7 | 1 | Gennadiy Lukhanin | # define subnets for IPMI, fnal/public and data interfaces |
8 | 10 | Geoff Savage | ** How many cables are needed? Shared IPMI/public? |
9 | 10 | Geoff Savage | ** Name interfaces by function. |
10 | 1 | Gennadiy Lukhanin | # define host names for all network interfaces and make them consistent |
11 | 1 | Gennadiy Lukhanin | ** mydaq-br01, mydaq-eb01, mydaq-ipmi-br01, mydaq-data-br01 |
12 | 1 | Gennadiy Lukhanin | ** the list of host names should be complete as if all hardware is available |
13 | 3 | Ron Rechenmacher | ** put all host names into /etc/hosts and distribute it across all servers |
14 | 1 | Gennadiy Lukhanin | # make a consistent IP address assignment across all subnets |
15 | 1 | Gennadiy Lukhanin | ** use address blocks for the same server roles |
16 | 1 | Gennadiy Lukhanin | ** make the last octet of an IP address being the same across all NICs of the same host |
17 | 1 | Gennadiy Lukhanin | # configure authentication |
18 | 1 | Gennadiy Lukhanin | ** Kerberos for the public interface |
19 | 1 | Gennadiy Lukhanin | ** publickey for the data interface |
20 | 1 | Gennadiy Lukhanin | # create instructions for rebooting servers using IPMI |
21 | 10 | Geoff Savage | # enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default |
22 | 10 | Geoff Savage | ** Switch configuration by networking. |
23 | 10 | Geoff Savage | ** Just on DAQ network, not all interfaces. |
24 | 10 | Geoff Savage | ** NFS on public network - jumbo frames for performance? |
25 | 1 | Gennadiy Lukhanin | # configure and verify that multicasting is enabled and working all networking equipment |
26 | 1 | Gennadiy Lukhanin | |
27 | 1 | Gennadiy Lukhanin | h2. Users |
28 | 1 | Gennadiy Lukhanin | |
29 | 1 | Gennadiy Lukhanin | # define a shared user for |
30 | 1 | Gennadiy Lukhanin | ** managing UPS products |
31 | 1 | Gennadiy Lukhanin | ** running daq, dcs, databases |
32 | 1 | Gennadiy Lukhanin | # add all people from the RSI group to the /root/.k5login |
33 | 1 | Gennadiy Lukhanin | # add all known daq users to the daq and dcs shared accounts |
34 | 1 | Gennadiy Lukhanin | # shared user profiles are not expected to have any customizations |
35 | 1 | Gennadiy Lukhanin | |
36 | 1 | Gennadiy Lukhanin | h2. Storage areas |
37 | 1 | Gennadiy Lukhanin | |
38 | 1 | Gennadiy Lukhanin | # setup a reliable NFS server for /home, /daq/products, /daq/database, /daq/log, /daq/database, /daq/tmp,.... /data /scratch, /daq/backup |
39 | 1 | Gennadiy Lukhanin | # reserve adequate disk space for each area |
40 | 1 | Gennadiy Lukhanin | # create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server |
41 | 1 | Gennadiy Lukhanin | ** a faster NVMe drive such as Samsung 970 Pro or faster is preferred |
42 | 1 | Gennadiy Lukhanin | # setup a nightly backup for /home and a weekly backup for /daq/backup areas |
43 | 1 | Gennadiy Lukhanin | # the performance of the NFS should be monitored |
44 | 1 | Gennadiy Lukhanin | |
45 | 1 | Gennadiy Lukhanin | h2. Software |
46 | 1 | Gennadiy Lukhanin | |
47 | 1 | Gennadiy Lukhanin | # any base software such as the OS and productivity RPMs should be identical on all servers |
48 | 1 | Gennadiy Lukhanin | # a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default |
49 | 2 | Gennadiy Lukhanin | # implement system monitoring using ganglia |
50 | 2 | Gennadiy Lukhanin | or similar software |
51 | 4 | Pengfei Ding | |
52 | 5 | Pengfei Ding | h2. System Services |
53 | 4 | Pengfei Ding | |
54 | 6 | Pengfei Ding | # Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus. |
55 | 6 | Pengfei Ding | # Ganglia, graphite. |
56 | 7 | Geoff Savage | |
57 | 7 | Geoff Savage | h2. Geoff |
58 | 7 | Geoff Savage | |
59 | 8 | Geoff Savage | * Turn off checking of raid arrays. |
60 | 8 | Geoff Savage | * Raid arrays must be raid 10? You lose half the disk size? |
61 | 8 | Geoff Savage | * Do we really need hosts file? |
62 | 8 | Geoff Savage | ** If we use a hosts file we should use a script to create the file. |
63 | 8 | Geoff Savage | * ntp from fermilab servers works well. No need for an experiment ntp server. |
64 | 8 | Geoff Savage | * Who is in the RSI group? |
65 | 8 | Geoff Savage | * Use Ansible to verify the settings from puppet are correct? |
66 | 9 | Geoff Savage | |
67 | 9 | Geoff Savage | * Buffer sizes in network switches |
68 | 9 | Geoff Savage | * Database computer specs |