LOCO in MPI¶
This is an implementation of LOCO that utilizes multi-processors to speed up getting to the solution.
The source code can be retrieved by issuing the following command:
git clone http://cdcvs.fnal.gov/projects/booster_loco_mpi
Requirements¶
Builds on MacOSX 10.8.4 and LINUX on CLX cluster are supported
Compiling on MacOSX¶
The requirements for building mfitit:- python (comes pre-installed)
- Apple Accelerate Framework
- Apple's implementation of BLAS and LAPACK used for SVD. (comes pre-installed)
- git
- This is available at: git for Mac OS X
- gnuplot
- use MacPorts to install
sudo port install gnuplot
- Boost library
- use MacPorts to install
port install boostBoost.Process 0.5. This is the beta version of the process library and needs to be downloaded and installed separatel
- GNU Scientific library
- use MacPorts to install
sudo port install gsl
- OpenMPI
- use MacPorts to install
port install openmpi-defaultFor some stupid reason the above command must be run twice because the first invocation will inevitably fail.
For completeness set mpi to openmpi-mp-fortran after the install completessudo port select --set mpi openmpi-mp-fortran
- MADX for MAC can be downloaded at madx homepage
- Elegant is no longer supported.
Compiling on LINUX (CLX cluster)¶
Unlike LOCO in C++, LOCO in MPI will use an old version of the Boost library that is already installed.
A special version of Boost::process has been written by J. deVoy to supplement the old Boost library.
All the necessary libraries OPENMPI, GNU Scientific Library, LAPACK, ATLAS, and MADX have been installed on the cluster and thus mfitit should build without any problems.
Building¶
Edit the Makefile to put in the appropriate paths if necessary. The Makefile has been set up so that it knows whether it is run on a Mac or on the CLX cluster.
Run
make
mfitit should be created. Move mfitit to your local path.
A pre-built version of mfitit for MacOSX or CLX LINUX can be requested from the author: cytan@fnal.gov
Running¶
mfitit assumes the same directory structure as fitit. The file options can be changed in .loco
Checking the configuration¶
mfitit -h
The output should look something like this
~/expt/booster/loco_mpi/optimize@xanadu% ./mfitit -h mfitit version 2-25-gc9247c5-45 config file: /Users/cytan/.loco gnuplot font = /Library/Fonts/Arial.ttf plots dir = ./plots common dir = ./common headers dir = /Users/cytan/expt/booster/loco_new/common results dir = ./results work dir = /tmp magnet settings = MagnetSettings.txt rpos file = deltaRPOS.txt qsdqsf file = qsdqsf.dat lattice = booster.madx machine params file = machine_parameters.sdds parameters to vary file = parameters2vary.sdds MADX is the tracking program MADX = /Users/cytan/bin/madx command line options and their DEFAULT values generic options : -h [ --help ] this message -A [ --ormA ] arg ORM input file. Either one file with both H & V data or two files separated by , or space. H first, V second -d [ --dfile ] arg dispersion input file name -r [ --rfactor ] arg (=1 1) rfactors. dresp rfactor first, loco rfactor second -s [ --smin ] arg (=0.0025000000000000001) smin (s): start time for processing the ORM data -S [ --Sthreshold ] arg (=0.14999999999999999) Sthreshold: singular value threshold -t [ --trange ] arg (=3 30) start and stop time in ms (separated by a space) for fitting the ramp data -c [ --config ] arg (=/Users/cytan/.loco) LOCO configuration file -u [ --useoffset ] use magnet current offsets in Magnet.txt -n [ --nfit_iter ] arg (=3) number of LOCO fit iterations -D [ --debug ] enable debugging -v [ --version ] print version config file options: --fitit.gnufont arg gnuplot font --fitit.plots_dir arg (=./plots) plot directory --fitit.common_dir arg (=./common) common directory --fitit.results_dir arg (=./results) results directory --fitit.headers_dir arg (=./headers) headers directory --fitit.work_dir arg (=/tmp) work directory for temporary files --fitit.rposfile arg (=deltaRPOS.txt) rpos ramp filename --fitit.elegant arg elegant path --fitit.madx arg madx path --fitit.bad_bpm_list arg (=bad_bpm_list.dat) bad bpm list --fitit.bad_corrector_list arg (=bad_corrector_list.dat) bad corrector list --fitit.lattice arg (=machine.lte) elegant lattice file --fitit.magnet_settings arg (=MagnetSettings.txt) magnet settings up the ramp file --fitit.orm_ele arg (=ORM.ele) elegant ORM setup file --fitit.twiss_ele arg (=twiss.ele) elegant twiss setup file --fitit.mlattice arg (=booster.madx) madx lattice file --fitit.machine_params_sdds arg (=machine_parameters.sdds) initial machine element calibrations --fitit.params2vary_sdds arg (=parameters2vary.sdds) parameters to vary and their step change --fitit.qsdqsf_fname arg (=qsdqsf.dat) QL and QS K values in madx format
Creating machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds¶
The requirements are that you have the following files:
- in your directory where you are going to perform LOCO, the following files have been collected:
- An ORM file: e.g. 28MARCH_ORM_ALL.TXT
- A Disp file: e.g. 28MARCH_DISP_1.TXT
- in the common directory:
- deltaRPOS.txt -- the RPOS ramp file used for dispersion measurements
- MagnetSettings.txt -- the ramp file that is the source file for the creation of optics.sdds
- bad_bpm_list.dat -- the bad bpm list
- bad_corrector_list.dat -- the bad corrector list
- parameters2vary.sdds -- parameters to vary and their step values. Unlike LOCO TCL, this file is never overwritten.
- machine_parameters.sdds -- the intiial values of the BPM calibrations, tilts etc.
- MADX input files that are required in the common directory:
- booster.madx -- parent file
- booster.ele -- magnet definitions
- booster.seq -- lattice file
- qsdqsf.dat -- QL and QS K values in MADX format
- DC.dat -- the strength of the DC elements
The latest versions of the MADX input files can be downloaded from here
An example of the files in the common directory can be found in common.zip
The files that are generated in the results directory are:
- machine_params_vs_t.sdds file that is the LOCO output.
- twiss_vs_t.twi file in sdds format. It is in binary sdds if Elegant is used and in ascii format if MADX is used.
- optics_vs_t.sdds file is created from MagnetSettings.txt after interpolation.
- all the plot files are deposited in the plots directory (unless the location has been changed in .loco). The plots directory is automatically created by mfitit.
NOTE: There are so many plots generated that the program slows down tremendously. Only recommended if the plots are really necessary!
Todo: number of plots to save will be added as an option.
Running with MPI on a Mac¶
mfitit has been successfully tested on a Mac running both an i7 and and i5. Unfortunately, on this unclustered machine, only the onboard CPUs will be used. For example
mpiexec -np 2 mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT -u
will used 2 processors and create the machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds files in the results directory.
The minimum number of processors required for mfitit is 2. One will become the master while the other becomes the slave.
Running with MPI on the CLX Cluster¶
Running mfitit on a cluster is where the real fun begins! For example
mpiexec -np 30 -machinefile /usr/local/openmpi/etc/openmpi-default-hostfile mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT -u -r 0.5 0.5
uses 30 processors from the available processors in the openmpi-default-hostfile for creating the machine_params_vs_t.sdds, twiss_vs_t.twi and optics_vs_t.sdds files in the results directory. The options means that mfitit will use
- the current offsets of the magnets in the input *.TXT files.
- the rfactor of 0.5 will be used in both the calculation of the dispersion and in the rpos effect in LOCO.
Benchmarking¶
The whole point of doing MPI is to speed up the LOCO calculation. In this particular implementation, MPI LOCO only divides up the calculation of the Jacobian onto the given the number of CPUs less 1. These CPUs are called slaves. The SVD calculations (which is done 3 times by default) is always done on one CPU which is called the master. The calculation speedup of increasing the number of CPUs for one Booster ramp breakpoint is shown here:
The same data plotted as a stacked histogram and compared to serial LOCO is shown here:
It is not surprising that as the number of CPUs is increased the total calculation time decreases. But as the communications between CPUs start to dominate, the speed up decreases. It is clear that the asymptotic number of CPUs is about 30 in this implementation. Furthermore, since the SVD is always calculated by the master CPU, its completion time is unaffected by the number of slave CPUs as expected. The 1 CPU scenario of serial LOCO has the same completion time as 2 CPUs MPI LOCO because essentially one slave is performing the calculation of the Jacobian and one master is calculating the SVD. This shows that 2 CPU communication is not a bottleneck in this implementation.
- Bottom line: Using 30 CPUs, MPI LOCO is 5.8 times faster than serial LOCO that uses 1 CPU.
Debugging¶
A serial debugger like gdb can be used to debug MPI applications. First either define DEBUG_MPI or comment out the #ifdef DEBUG_MPI in LOCO.cpp.
It is easier to debug on a Mac rather than on the CLX cluster because the processes are on the same CPU.
When mfitit is run, it will print out the pid's of the processes. For example running the following produces two pids
mpiexec -np 2 mfitit -A 28MARCH_ORM_ALL.TXT -d 28MARCH_DISP_1.TXT
This gives
PID 75870 on ad115712-mac.fnal.gov ready for attach PID 75869 on ad115712-mac.fnal.gov ready for attach
gdb can then attach to the pids. For example to attach the master which has the smallest pid
gdb -pid 75869
and the slave
gdb -pid 75870
gdb should print out something like this
............................................ done Reading symbols for shared libraries + done 0x00007fff88662386 in __semwait_signal () (gdb)
Give the following commands to gdb with its reply. In this example the master is attached.
(gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007fff88662386 in __semwait_signal () (gdb) n Single stepping until exit from function __semwait_signal, which has no line number information. 0x00007fff886634c8 in cerror () (gdb) Single stepping until exit from function cerror, which has no line number information. 0x00007fff886634db in cerror_nocancel () (gdb) Single stepping until exit from function cerror_nocancel, which has no line number information. 0x00007fff88650bba in cthread_set_errno_self () (gdb) Single stepping until exit from function cthread_set_errno_self, which has no line number information. 0x00007fff80bb508b in cthread_set_errno_self () (gdb) Single stepping until exit from function cthread_set_errno_self, which has no line number information. 0x00007fff886634fc in cerror_nocancel () (gdb) Single stepping until exit from function cerror_nocancel, which has no line number information. 0x00007fff80c377c8 in nanosleep () (gdb) Single stepping until exit from function nanosleep, which has no line number information. 0x00007fff80c37652 in sleep () (gdb) Single stepping until exit from function sleep, which has no line number information. LOCO::calculate_Jacobian (this=0x7fff5db12150, cycle_time=0.0030983011199999999, machine_header=0x7fd35f581168 "/Users/cytan/expt/booster/loco_new/common/MHeader.sdds", machine_params_sdds=@0x7fff5db112b8, yerr=@0x7fff5db11250) at LOCO.cpp:2952 2952 while (0 == i)
and to break out of the whiie loop, issue the following command
(gdb) set var i=7 Current language: auto; currently c++ (gdb) n 2958 MPI_Comm_size(MPI_COMM_WORLD, &num_processors);
Do the same for the slave and debugging can proceed as normal.