Project

General

Profile

Install OFED drivers and mvapich2

mvapich2 is installed as part of the OFED package

Build and install

You must have root privileges to perform the install.

You now have rpms in /usr/local/OFED-1.5.4.1/RPMS
If this is a shared directory (as on the dsfr machines), you can install these rpms on other machines.
  • cd /usr/local/OFED-1.5.4.1
  • ./install.pl -c ofed.conf (this step will now install existing rpms from the RPMS directory)
  • /etc/rc.d/init.d/openibd start

Configuration

  • add these lines to /etc/security/limits.conf
  • hard memlock unlimited
  • soft memlock unlimited

Protect against unintentional updates

Older copies of these rpms are also distributed via yum. We need to make sure we don't accidentally replace the packages we just built.

Add this block near the bottom of /etc/yum.conf

# This is to avoid problems with the IB drivers from OFED.org.
exclude=compat-dapl,compat-dapl-devel,dapl,dapl-debuginfo,dapl-devel,dapl-devel-static,dapl-utils,ibacm,ibsim,ibsim-debuginfo,ibutils,infiniband-diags,infinipath-psm,infinipath-psm-devel,kernel-ib,kernel-ib-devel,libcxgb3,libcxgb3-debuginfo,libcxgb3-devel,libcxgb4,libcxgb4-debuginfo,libcxgb4-devel,libibcm,libibcm-debuginfo,libibcm-devel,libibmad,libibmad-debuginfo,libibmad-devel,libibmad-static,libibumad,libibumad-debuginfo,libibumad-devel,libibumad-static,libibverbs,libibverbs-debuginfo,libibverbs-devel,libibverbs-devel-static,libibverbs-utils,libipathverbs,libipathverbs-debuginfo,libipathverbs-devel,libmlx4,libmlx4-debuginfo,libmlx4-devel,libmthca,libmthca-static,libmthca-debuginfo,libmthca-devel-static,libnes,libnes-debuginfo,libnes-devel-static,librdmacm,librdmacm-debuginfo,librdmacm-devel,librdmacm-utils,libsdp,libsdp-debuginfo,libsdp-devel,mpi-selector,mpitests_mvapich2_gcc,mpitests_mvapich_gcc,mpitests_openmpi_gcc,mstflint,mvapich2_gcc,mvapich_gcc,ofed-docs,ofed-scripts,openmpi_gcc,opensm,opensm-debuginfo,opensm-devel,opensm-libs,opensm-static,perftest,qperf,qperf-debuginfo,rds-devel,rds-tools,sdpnetstat,srptools

Subnet manager

The subnet manager needs to run on at least one machine. Two is best. We choose dsfr1 and dsag.

  • chkconfig --list opensmd
  • chkconfig opensmd on (if not already)

Kernel updates

If the kernel is updated, the OFED software needs to be rebuilt and reinstalled.
Also you need to make sure that the subnet manager is running on the appropriate machines.

Note on testing

For simple connectivity testing, we've used
  • 'ibv_rc_pingpong' on one node, and
  • 'ibv_rc_pingpong <node1>' on a second node

Other tests are described on the following web page: http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaai.hpcrh%2Ftestib.htm (Google search = "infiniband testing").

We've used a simple test from artdaq in the past to help reproduce connection problems. The steps to do this are the following:

  • log into one of the nodes in the cluster
  • 'source /products/setup'
  • 'setup artdaq v0_03_01 -q e2:prof'
  • 'export FHICL_FILE_PATH=$FHICL_FILE_PATH:$ARTDAQ_DIR/fcl'
  • 'daqrate 2 2 10 202 --nodes=<clusterNodeA>,<clusterNodeB> -- -c daqrate_simdata_noart.fcl'
    • Please NOTE that for this to be a useful test, the two nodes must be different (e.g. dsfr6 and dseb8).

Additional step?

On 08-Jul-2013, Ron found that when he ran some of the Infiniband tests that he found when he Googled for "infiniband tests", he saw that "cpu frequency scaling" was causing an issue. So, he disabled it with the following steps:
  • service cpuspeed stop
  • chkconfig cpuspeed off