Project

General

Profile

LOCO in MPI

This is an implementation of LOCO that utilizes multi-processors to speed up getting to the solution.
The source code can be retrieved by issuing the following command:

git clone http://cdcvs.fnal.gov/projects/booster_loco_mpi

Requirements

Builds on MacOSX 10.8.4 and LINUX on CLX cluster are supported

Compiling on MacOSX

The requirements for building mfitit:
  • python (comes pre-installed)
  • Apple Accelerate Framework
  • Apple's implementation of BLAS and LAPACK used for SVD. (comes pre-installed)
  • git
  • gnuplot
  • use MacPorts to install
 sudo port install gnuplot 
  • Boost library
  • use MacPorts to install
 port install boost 

Boost.Process 0.5. This is the beta version of the process library and needs to be downloaded and installed separatel

Boost.Process

  • GNU Scientific library
  • use MacPorts to install
 sudo port install gsl 
  • OpenMPI
  • use MacPorts to install
 port install openmpi-default 

For some stupid reason the above command must be run twice because the first invocation will inevitably fail.
For completeness set mpi to openmpi-mp-fortran after the install completes

 sudo port select --set mpi openmpi-mp-fortran 
  • MADX for MAC can be downloaded at madx homepage
  • Elegant is no longer supported.

Compiling on LINUX (CLX cluster)

Unlike LOCO in C++, LOCO in MPI will use an old version of the Boost library that is already installed.
A special version of Boost::process has been written by J. deVoy to supplement the old Boost library.
All the necessary libraries OPENMPI, GNU Scientific Library, LAPACK, ATLAS, and MADX have been installed on the cluster and thus mfitit should build without any problems.

Building

Edit the Makefile to put in the appropriate paths if necessary. The Makefile has been set up so that it knows whether it is run on a Mac or on the CLX cluster.

Run

 make

mfitit should be created. Move mfitit to your local path.

A pre-built version of mfitit for MacOSX or CLX LINUX can be requested from the author:

Running

mfitit assumes the same directory structure as fitit. The file options can be changed in .loco

Checking the configuration

mfitit -h

The output should look something like this

~/expt/booster/loco_mpi/optimize@xanadu% ./mfitit -h
mfitit version 2-25-gc9247c5-45
config file: /Users/cytan/.loco
         gnuplot font = /Library/Fonts/Arial.ttf
         plots dir = ./plots
         common dir = ./common
         headers dir = /Users/cytan/expt/booster/loco_new/common
         results dir = ./results
         work dir = /tmp
         magnet settings = MagnetSettings.txt
         rpos file = deltaRPOS.txt
         qsdqsf file = qsdqsf.dat
         lattice = booster.madx
         machine params file = machine_parameters.sdds
         parameters to vary file = parameters2vary.sdds
MADX is the tracking program
         MADX = /Users/cytan/bin/madx

command line options and their DEFAULT values

generic options :
  -h [ --help ]                         this message
  -A [ --ormA ] arg                     ORM input file. Either one file with 
                                        both H & V data or two files separated 
                                        by , or space. H first, V second
  -d [ --dfile ] arg                    dispersion input file name
  -r [ --rfactor ] arg (=1 1)           rfactors. dresp rfactor first, loco 
                                        rfactor second
  -s [ --smin ] arg (=0.0025000000000000001)
                                        smin (s): start time for processing the
                                        ORM data
  -S [ --Sthreshold ] arg (=0.14999999999999999)
                                        Sthreshold: singular value threshold
  -t [ --trange ] arg (=3 30)           start and stop time in ms (separated by
                                        a space) for fitting the ramp data
  -c [ --config ] arg (=/Users/cytan/.loco)
                                        LOCO configuration file
  -u [ --useoffset ]                    use magnet current offsets in 
                                        Magnet.txt
  -n [ --nfit_iter ] arg (=3)           number of LOCO fit iterations
  -D [ --debug ]                        enable debugging
  -v [ --version ]                      print version

config file options:
  --fitit.gnufont arg                   gnuplot font
  --fitit.plots_dir arg (=./plots)      plot directory
  --fitit.common_dir arg (=./common)    common directory
  --fitit.results_dir arg (=./results)  results directory
  --fitit.headers_dir arg (=./headers)  headers directory
  --fitit.work_dir arg (=/tmp)          work directory for temporary files
  --fitit.rposfile arg (=deltaRPOS.txt) rpos ramp filename
  --fitit.elegant arg                   elegant path
  --fitit.madx arg                      madx path
  --fitit.bad_bpm_list arg (=bad_bpm_list.dat)
                                        bad bpm list
  --fitit.bad_corrector_list arg (=bad_corrector_list.dat)
                                        bad corrector list
  --fitit.lattice arg (=machine.lte)    elegant lattice file
  --fitit.magnet_settings arg (=MagnetSettings.txt)
                                        magnet settings up the ramp file
  --fitit.orm_ele arg (=ORM.ele)        elegant ORM setup file
  --fitit.twiss_ele arg (=twiss.ele)    elegant twiss setup file
  --fitit.mlattice arg (=booster.madx)  madx lattice file
  --fitit.machine_params_sdds arg (=machine_parameters.sdds)
                                        initial machine element calibrations
  --fitit.params2vary_sdds arg (=parameters2vary.sdds)
                                        parameters to vary and their step 
                                        change
  --fitit.qsdqsf_fname arg (=qsdqsf.dat)
                                        QL and QS K values in madx format

Creating machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds

The requirements are that you have the following files:

  • in your directory where you are going to perform LOCO, the following files have been collected:
  • An ORM file: e.g. 28MARCH_ORM_ALL.TXT
  • A Disp file: e.g. 28MARCH_DISP_1.TXT
  • in the common directory:
  • deltaRPOS.txt -- the RPOS ramp file used for dispersion measurements
  • MagnetSettings.txt -- the ramp file that is the source file for the creation of optics.sdds
  • bad_bpm_list.dat -- the bad bpm list
  • bad_corrector_list.dat -- the bad corrector list
  • parameters2vary.sdds -- parameters to vary and their step values. Unlike LOCO TCL, this file is never overwritten.
  • machine_parameters.sdds -- the intiial values of the BPM calibrations, tilts etc.
  • MADX input files that are required in the common directory:
  • booster.madx -- parent file
  • booster.ele -- magnet definitions
  • booster.seq -- lattice file
  • qsdqsf.dat -- QL and QS K values in MADX format
  • DC.dat -- the strength of the DC elements

The latest versions of the MADX input files can be downloaded from here

An example of the files in the common directory can be found in common.zip

The files that are generated in the results directory are:

  • machine_params_vs_t.sdds file that is the LOCO output.
  • twiss_vs_t.twi file in sdds format. It is in binary sdds if Elegant is used and in ascii format if MADX is used.
  • optics_vs_t.sdds file is created from MagnetSettings.txt after interpolation.
If the -D option is given on the commandline
  • all the plot files are deposited in the plots directory (unless the location has been changed in .loco). The plots directory is automatically created by mfitit.

NOTE: There are so many plots generated that the program slows down tremendously. Only recommended if the plots are really necessary!
Todo: number of plots to save will be added as an option.

Running with MPI on a Mac

mfitit has been successfully tested on a Mac running both an i7 and and i5. Unfortunately, on this unclustered machine, only the onboard CPUs will be used. For example

mpiexec -np 2 mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT -u

will used 2 processors and create the machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds files in the results directory.

The minimum number of processors required for mfitit is 2. One will become the master while the other becomes the slave.

Running with MPI on the CLX Cluster

Running mfitit on a cluster is where the real fun begins! For example

mpiexec -np 30 -machinefile /usr/local/openmpi/etc/openmpi-default-hostfile mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT -u -r 0.5 0.5

uses 30 processors from the available processors in the openmpi-default-hostfile for creating the machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds files in the results directory.

The options means that mfitit will use
  • the current offsets of the magnets in the input *.TXT files.
  • the rfactor of 0.5 will be used in both the calculation of the dispersion and in the rpos effect in LOCO.

Benchmarking

The whole point of doing MPI is to speed up the LOCO calculation. In this particular implementation, MPI LOCO only divides up the calculation of the Jacobian onto the given the number of CPUs less 1. These CPUs are called slaves. The SVD calculations (which is done 3 times by default) is always done on one CPU which is called the master. The calculation speedup of increasing the number of CPUs for one Booster ramp breakpoint is shown here:

The same data plotted as a stacked histogram and compared to serial LOCO is shown here:

It is not surprising that as the number of CPUs is increased the total calculation time decreases. But as the communications between CPUs start to dominate, the speed up decreases. It is clear that the asymptotic number of CPUs is about 30 in this implementation. Furthermore, since the SVD is always calculated by the master CPU, its completion time is unaffected by the number of slave CPUs as expected. The 1 CPU scenario of serial LOCO has the same completion time as 2 CPUs MPI LOCO because essentially one slave is performing the calculation of the Jacobian and one master is calculating the SVD. This shows that 2 CPU communication is not a bottleneck in this implementation.

  • Bottom line: Using 30 CPUs, MPI LOCO is 5.8 times faster than serial LOCO that uses 1 CPU.

Debugging

A serial debugger like gdb can be used to debug MPI applications. First either define DEBUG_MPI or comment out the #ifdef DEBUG_MPI in LOCO.cpp.
It is easier to debug on a Mac rather than on the CLX cluster because the processes are on the same CPU.

When mfitit is run, it will print out the pid's of the processes. For example running the following produces two pids

mpiexec -np 2 mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT 

This gives
PID 75870 on ad115712-mac.fnal.gov ready for attach
PID 75869 on ad115712-mac.fnal.gov ready for attach

gdb can then attach to the pids. For example to attach the master which has the smallest pid

gdb -pid 75869

and the slave
gdb -pid 75870

gdb should print out something like this

............................................ done
Reading symbols for shared libraries + done
0x00007fff88662386 in __semwait_signal ()
(gdb) 

Give the following commands to gdb with its reply. In this example the master is attached.

(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007fff88662386 in __semwait_signal ()
(gdb) n
Single stepping until exit from function __semwait_signal, 
which has no line number information.
0x00007fff886634c8 in cerror ()
(gdb) 
Single stepping until exit from function cerror, 
which has no line number information.
0x00007fff886634db in cerror_nocancel ()
(gdb) 
Single stepping until exit from function cerror_nocancel, 
which has no line number information.
0x00007fff88650bba in cthread_set_errno_self ()
(gdb) 
Single stepping until exit from function cthread_set_errno_self, 
which has no line number information.
0x00007fff80bb508b in cthread_set_errno_self ()
(gdb) 
Single stepping until exit from function cthread_set_errno_self, 
which has no line number information.
0x00007fff886634fc in cerror_nocancel ()
(gdb) 
Single stepping until exit from function cerror_nocancel, 
which has no line number information.
0x00007fff80c377c8 in nanosleep ()
(gdb) 
Single stepping until exit from function nanosleep, 
which has no line number information.
0x00007fff80c37652 in sleep ()
(gdb) 
Single stepping until exit from function sleep, 
which has no line number information.
LOCO::calculate_Jacobian (this=0x7fff5db12150, cycle_time=0.0030983011199999999, machine_header=0x7fd35f581168 "/Users/cytan/expt/booster/loco_new/common/MHeader.sdds", machine_params_sdds=@0x7fff5db112b8, yerr=@0x7fff5db11250) at LOCO.cpp:2952
2952        while (0 == i)

and to break out of the whiie loop, issue the following command
(gdb) set var i=7
Current language:  auto; currently c++
(gdb) n
2958      MPI_Comm_size(MPI_COMM_WORLD, &num_processors);

Do the same for the slave and debugging can proceed as normal.

Some pictures saved for this Wiki