Project

General

Profile

FaRX

FaRX is a C++ software package designed to produce comprehensive weekly status updates concerning the NOvA Far Detector and its hardware. In this wiki, we provide an overview of how to acquire and run this package, and how to interpret the results. We also provide a short tutorial on how to modify this package in order to produce new metrics.

Acquiring and running FaRX

Module History updates can be run using FaRX on the novadq gpvm out of the S16-10-07 tag. If you are just doing an update, skip to the instructions below.

FaRX is committed to the novasoft svn repository. Acquiring it is as easy as setting up novasoft, moving into your favorite distribution, and executing the command:

addpkg_svn -h FaRX

It is then a matter of simply compiling and running the package

cd FaRX
make
./bin/run

FaRX is not part of the ART framework, but it does depend upon several ROOT objects in order to compile and run. Keep this in mind if you begin to encounter odd errors complaining about missing definitions for TTree or TH1F objects--it is possible that your environment has not been set up to recognize ROOT.

Running FaRX takes approximately five minutes as of April 2015. The program is, by default, verbose enough that you can attend its progress if you are concerned that it might have hung up.

Keeping FaRX up to date

FaRX does not run automatically. At this point, it is usually just run after maintenance has been done.

Why Module_History(all).csv ?

The largest single source of information used by FaRX is the Ash River Hardware QA database. This database contains, among other things, information on when each piece of hardware attached to the Far Detector was added and removed. This timing information is used by FaRX to determine when a repair was attempted. Note that FaRX interprets all non-initial installation hardware swaps as attempted repairs.

FaRX does not access the Ash River Hardware QA database directly. Raw database information can be difficult to interpret, and a straightforward interpretation may result in "ghost" entries: TECCs or TECs which remain "on" the detector after having been removed. The reason for this is straightforward: the installation time recorded for an APD or FEB is the time at which the piece is attached to the detector. However, the 'installation' time for a TEC is the time that it was attached to an APD, and for a TECC, the time that it was attached to an FEB. Removal times are defined in the same way. In addition, there are also simply false database entries--records that were made incorrectly, or accidentally. Without careful bookkeeping, this can result in false information about the status of the detector.

FaRX gets around these problems in a two-step process (plus a number of smaller tweaks, which can be found in the comments of the StatusWatcher class).

First, instead of directly accessing the Ash River Hardware QA database, it accesses the results of the Module_History(all) SQL script found here . This script places some reasonable time constraints on the relative FEB and APD install/remove times to eliminate many false database entries.

Second, the results of this script are loaded into memory, and checked for inconsistencies: e.g., cases in which an APD is installed over an existing APD. It attempts to reconcile these inconsistencies by generating move-by-move histories of each module on the detector, and then tweaking histories with logical conflicts to represent the most consistent set of 'real' histories (again, see the StatusWatcher class for more details).

STEP 1: Update Module_History(all).csv

Before running FaRX for an update, the FaRX maintainer should begin by executing the Module_History(all) SQL script . After clicking "Execute" to the right of this script, you will be directed to a new page with an "Export to CSV" button located at the top left. Click on this button, and save the resulting file with a date-appropriate name, e.g., Module_Histories_2018_4_8.csv. The date in the filename must be the most recent Sunday, and the numbers should not be zero-padded. Move this file to novadq@novadqgpvm01.fnal.gov:/home/novadq/Releases/FDHardwareReport/S16-10-07/FaRX.

STEP 2: Run the FaRX script

Logon to novadq@novadqgpvm01.fnal.gov.

cd /home/novadq/Releases/FDHardwareReport/S16-10-07/FaRX
setup_nova -r S16-10-07
nohup python AutoRunFaRX.py &

While the script is running, you can tail the logfile if you wish to monitor progress. FaRX logfiles are in /home/novadq/Releases/FDHardwareReport/logs.
This script will automatically update the Module Histories on the Hardware Watchlist website.

Interpreting FaRX Plots

Basic Plots

A good place to start is with the results/StatusReportSummaries_BasicPlots directory. This directory contains basic information on noisy/quiet/non-reporting channel issue rates in simple plots. These plots are generally used to determine the overall state of the detector, in terms of issue rates. At the present time (Apr 6 2015), all plots are contained in file BasicPlots.root.

Before describing individual plots, we first comment that total issue rates can be split into three regimes--modules in each of these regimes behave slightly differently. A module with an issue rate > 95% tends to be a complete failure, which can only be fixed by repair work. A module with an issue rate < 50% tends to be a transient problem, which will often disappear entirely by the next week. Meanwhile, a module with an issue rate > 50%, but < 95%, will generally oscillate between high and low issue rates over the course of several months, before finally becoming a permanent member of the > 95% issue rate regime, or dropping off of the radar and ceasing to be a problem.

As a result of this, we have historically documented our issue rates with different definitions of what issue rate qualifies as a problem: either > 95%, > 50% or, >0%. Most of the plots to be found in BasicPlots.root include results for each of these three regimes, in order to paint a complete picture of the detector's issue modules.

BasicPlots.root:/ChannelRatesCanvas is a good place to start. Here, we show the percent of all channels on the detector that are a problem, as a function of time:

Example IssueChannelsRateCanvas

"Problem" defined as "on a module with an issue rate > X", for the X={0%,50%,95%}. We show both the number of channels that are a problem (the solid lines), as well as the projected number of channels that would have been a problem, had no maintenance work been done (the dashed lines). This projection is done by assuming that had work not been done on a module, then the issue rate would have remained at least as high as it had been before work was done.

BasicPlots.root:/SavedChannelsCanvas shows the ratio (projected_issue_channels_w/o_work - actual issue channels)/projected_issue_channels_w/o_work. This is a rough measure of how much many more channels would have been a problem without maintenance work having been done:

Example SavedChannelsCanvas

BasicPlots.root:/ModuleRatesCanvas shows the number of all modules on the detector that are a problem, in each regime, as a function of time:

Example ModuleRatesCanvas

Finally, BasicPlots.root:/SwapsMadeCanvas.ps provides a rough measure of maintenance operations that are successful vs unsuccessful in addressing high issue rates:

Example SwapsMadeCanvas

There are a few caveats concerning these plots. For one, the threshold for a module being a problem is set in FaRX/Main.C, in the parameter sent to MakeStatusReport() in the last line of the following code block:

//--------
// Initiate the StatusWatcher instance, and load the DB, Watchlist, and Return
// Watchers into it
//
// Note that the parameter sent to MakeStatusReports()
// indicates the 'problem' issue rate around which FIXED, FAILED - type statuses
// are determined. The MakeStatusReports function MUST be called if the results
// that come later are to be trusted. It is in this step that most metrics
// describing the status of the detector are determined.

watcher::StatusWatcher TheStatusWatcher;
TheStatusWatcher.LoadHistories( &TheDBWatcher, &TheWatchlistWatcher, &TheReturnWatcher );
TheStatusWatcher.MakeStatusReports(0.00);

The default value is 0.0, meaning that any non-zero issue rate is a problem. You may change this value if you wish, noting that it will affect the assignation of module statuses FIXED, FAILED, etc. throughout the program.

In addition, FaRX current counts work as having been done on a module if: (1) the issue rate on this position had been non-zero prior to a hardware swap, and (2) a hardware swap has been made in the time between last week's Nearline Watchlist, and this week's. If work is done that does not satisfy these criteria, then that work will not show up on these plots.

The shown plots are stacks. This means that each "category" of color in a given column represents the sum of all entries in that category, as well as all entries in the categories plotted below that one in the same column.

Finally, the second plot in BasicPlots.root:/SwapsMadeCanvas.ps--Del % Issue Channels Per Swap--is exactly that: the change in the % of issue channels (with > 0% issue rates being a problem) from week to week, divided by then number of hardware swaps recorded. This change is signed so that a positive change means that the percent of issue channels has gone done; a negative change means that the percent of issue channels has gone up. This metric is meant to provide a rough single measure of success, but on weeks that only one or two swaps are made, the general increase in issue rates from all sources may result in a net "negative" improvement. Therefore, on weeks with a low number of hardware swaps made, this metric may report a falsely "negative" overall outcome.

Advanced plots

The folder results/StatusReportSummaries_Advanced holds more in-depth views of the issue rates observed in the detector, including kinds of problems (Hi/Lo ADC counts, FEB dropouts) not included in the Basic Plots. All of the these plots are currently contained in the file AdvancedPlots.root. A good place to start is with AdvancedPlots.root:/ChannelRatesCanvas:

Example Advanced ChannelRatesCanvas

This plot shows the percent of channels that suffer from each type of issue rate, as shown in the TLegend instance at the top left. Note that if a channel is both noisy and experiences some rate of Hi ADC counts, then it will contribute to both issue rates equally--the rates displayed are not exclusive. Note that the sharp downward peak in Hi/Lo ADC and dropout rates on 02-17-2015 is artificial--these metrics were being introduced and refined in the weeks surrounding this date, and were not produced on that date. Therefore, the Hi/Lo ADC and dropout rates of 0.00 on 02-17-2015 are not correct, and should not be used in any analysis.

The next place to look might be AdvancedPlots.root:/ChannelRates_ZoomCanvas, which shows a close-up of the lower-rate issue types (excluding the typically overwhelming All and Non-Reporting issue rates):

Example Advanced ChannelRatesCanvas (Zoom)

Finally, AdvancedPlots.root:/ModuleRatesCanvas shows the percent of modules that suffer from each type of issue, with each module given equal weight regardless as to how high each issue rate is for that module:

Example Advanced ModuleRatesCanvas

Dropout plots

There are two major categories of FEB dropouts: single FEB dropouts, in which some FEB becomes unacceptably noisy and is automatically removed from a run; and DCM-wide FEB dropouts, in which a whole DCM drops out of a run. The dropouts that are of most concern, in terms of hardware maintenance, are those in which a single FEB drops out frequently. The file FaRX/DropoutsPlots/DropoutPlots.root contains a single plot which illustrates the two regimes of FEB dropouts:

Displayed are all non-zero dropout average rates for FEBs, with rates averaged over the four most recent Nearline Watchlists. The low-rate structure shows FEBs which drop out only occasionally, including cases in which a whole DCM will drop out; the high-rate structure shows FEBs which drop out frequently.

Returns Plots

The folder results/StatusReportSummaries_Returns holds information about APDs that were returned to Caltech for repair. An overarching theme to these plots is: do the APDs that Caltech repairs end up back on the detector? If so, do they perform at the same level as APDs that did not need repair?

The file ReturnsAnalysis.root:/ReturnsAnalysis_Basic contains five plots that can be used to answer that question: this shows, for each APD problem type (electrical, cooling, and vacuum), the number of times that any APD was swapped off of the detector; the number of times that a returned APD was swapped off of the detector; and the number of times at an APD of each problem type was swapped off of the detector. More rigorous analysis options can be found in the ReturnsWatcher class. Two examples are shown below:

The file ReturnsAnalysis.root:/ReturnsAnalysis_Advanced contains a number of TH2F objects showing the number of APD swaps made as a function of position on the detector, where the x-axis is diblock, and the y-axis is DCM. This plot is made both for all APDs; and for APDs which had been returned for vacuum problems, and which are now on the detector. An example is shown below:

The file reuslts/StatusReportSummaries_Returns/TotalCounts.txt records the number of returns found in the Returns lists stored in the FaRX/csv/Return_Records/ directory, as well as how many of those are currently on the detector. These numbers are shown both for all returned APDs, and by problem type. An example is shown below:

More information about Returns analysis can be found in the ReturnsWatcher and StatusWatcher classes.

Making Weekly Repair Request Lists

The most important weekly output for the FaRX maintainer, is a spreadsheet describing what work ought to be done during the next beam downtime. Downtime may occur unexpectedly--and it often does. Therefore, this list should be produced on a weekly basis, whether or not downtime is planned. Generally, it should be produced as soon as that week's Nearline Watchlist has been made available (Tuesday at around 7:00 AM, as of April 7 2015). It should then be uploaded to a publicly-available location, and presented at the next Operations meeting for discussion.

Part I: Identifying Problem Modules

The major source of raw material for the weekly Repair Request list is the file results/StatusReportSummaries_Rates/MaintenanceList.csv. This file (which can be opened by Excel, Numbers, or any other spreadsheet program) contains information about each module which has had an issue rate greater than 0.0 at some point since the Nearline Watchlist was introduced (October 21, 2014).

Begin by opening up this file in Excel, Numbers, or Open Office, and sorting by the column TOT RATE in descending order. TOT RATE is the total issue rate of this module, including all possible sources (noisy, quiet, non-reporting, Hi/Lo ADC counts, and dropout). All rates as presented in this spreadsheet are average rates, taken over the last four Nearline Watchlists.

Practice has shown that modules with average noisy/non-reporting issue rates > 90% tend to be modules that we will have to work on before they get better. Therefore, we begin by ignoring any modules with an average total issue rate < 90%.

We then check to see which modules have issue rates that are > 90% due to high noisy/non-reporting rates, and those that are > 90% due to influence from other causes (e.g., hi/lo ADC counts). We do this by consulting the columns TYPE and TYPE FRAC, and SUB and SUB FRAC. TYPE and TYPE FRAC display the dominant problem type, and what issue rate that problem type is reporting; SUB and SUB FRAC display the next most dominant problem type, and what issue rate that problem type is reporting. In all cases, dominance is decide simply by which issue rate is highest / next-highest.

Below, we see a MaintenanceList.csv, sorted in descending order by TOT FRAC, and color-coded to signify whether a module has a > 90% noisy/non-reporting rate (red); a > 90% total issue rate that was pushed into the > 90% regime by Hi ADC counts (orange); or a > 90% total issue rate that was pushed into the > 90% regime by Lo ADC counts (purple):

We see that there are twelve modules with noisy/non-reporting rates > 90%. We can tell more about them by look at the columns LAST REPAIR TYPE and LAST REPAIR DATE. "LAST REPAIR DATE" records the date of the first Nearline Watchlist produced after work was done on that location. This is done for the sake of simplicity: in general we are less interested in the particular time that a repair was made, and more in the general timeframe. By grouping work in this manner, we can easily group repair work by the period in which it was done.

Similarly, "LAST REPAIR TYPE" displays a record of what actions were taken at that position, during the last repair period during which work was done at that position. Hardware is abbreviated as A=APD, T=TEC, F=FEB, CC=TECC. A "+" sign after a hardware abbreviation means that that type of hardware was added to that position; a "-" sign, that it was removed. As an example, the sequence "A+A-T+T-" means that an APD/TEC pair was removed, and replaced.

Much more detailed information about work done at a given position can be found by looking up its complete history file. These too can be found in the results/StatusReportSummaries_Rates directory, where the history of each module listed in MaintenanceList.csv is named as BPP_BLOCK_PLANE_POSITION.txt. BLOCK, PLANE, and POSITION are integers describing the coordinates of that module (e.g., BPP_2_28_5.txt). An example history, color-coded, is shown below for the position (Block, Plane, Position) = (2, 28, 5). The report shown here was generated on January 7, 2015:

As shown above, a module history is composed of alternating blocks of one-line Nearline Watchlist summaries, and one-line Maintenance summaries. Each time that a part of the detector is installed or removed, a Maintenance summary is added to that module; each time that a module registers a non-zero issue rate on some week's Nearline Watchlist, a Nearline Watchlist summary is added to that module.

Reading the example above from bottom to top:

1) The first four Maintenance summaries correspond to the initial installation of hardware at (Block, Plane, Position) = (2, 28, 5).

2) On May 29, 2014, the FEB and TECC are replaced.

3) In August, the APD and TEC are removed for the retrofit; a new APD and TEC are placed on this position in late in October.

4) The first time that this position registers a non-zero issue rate on a Nearline Watchlist is on November 4, 2014.

5) The issue rate remains quite high over the next four weeks, until the APD and TEC are replaced on December 2, 2014, in an attempt to resolve the problem.

6) In this case, that replacement did not solve the problem, and the high issue rate continues until the last recorded Nearline Watchlist date on January 6, 2015.

Whenever a summary is added to a module, the overall status of the module is also recorded. These are found to the far left of each line. There are a few possible options: "NO ISSUE" means that the total noisy/quiet/non-reporting issue rate at that position was below the minimum threshold sent by the line

 TheStatusWatcher.MakeStatusReports(0.00);

in Main.C . "FIXED" means that the total issue rate of the previous week's Nearline Watchlist was greater then the minimum threshold; work was done on that position; and the total issue rate on that module is now less than the minimum threshold. "FAILED" means that the total issue rate is still higher than the minimum threshold, even after work was done.

"BROKE" is a rare (and perhaps glibly-labeled) condition in which work is done at a location which had an issue rate lower than the minimum threshold; and now that issue rate is higher than the minimum threshold. This usually results from a high minimum threshold. In that case, it is common for, e.g., a 98% issue rate to register as a 99% issue rate after work having been done. It is unlikely that the repair work actually diminished the functioning of that position--rather, a natural oscillation in issue rate pushed the value about threshold. An important additional case: if an APD/TEC was added after the retrofit, and immediately has a high issue rate -- then by definition of the "BROKE" label, it too will fall into this category. This particular source of "BROKE" designation appears in the example module history, above.

"NEW ISSUE" means that an issue rate has jumped above the minimum threshold, after having been below that threshold in the previous week. "PENDING" means that an issue rate was above the minimum threshold last week, and is still there this week. This name reflects the notion that we have usually already assigned such channels to our maintenance list--we are aware of the problem, and are waiting until the next downtime to address it.

Finally, "AUTOFIXED" and "AUTOFAILED" refer to cases in which no work was done at a location, but its issue rate has dropped below or above the minimum problem threshold, respectively.

Part II: Oscillating Modules and DSO Scan Diagnoses

After identifying modules with a noisy/non-reporting issue rate > 90%, the FaRX maintainer is advised to check the module histories of each such module, as described in the previous section. A particularly important item to check, is whether these high issue rates are stable or oscillating. The TOT SLOPE column in MaintenanceList.csv is meant to be a first-order measure of this oscillation, and is calculated simply as (tot issue rate this week - tot issue rate four weeks ago) / 4. However, it should be used in tandem with the module history record to determine whether a module falls into the "oscillating" category.

If issue rates have been oscillating up and down over the past several weeks, it is best not to try replacing hardware at this position; this module may eventually fall off the radar and cease to be a problem. If they have been constant or near-constant for the past four weeks (at a minimum), then they are candidates for hardware replacement. "Oscillating" modules should not be placed on the week's Repair Request list.

Once non-oscillating modules with an issue rate > 90% have been identified, then this list of modules, as well as their histories, should then be sent to the DSO Scan Expert (Tian Xin, as of Apr 7 2015).

The DSO Scan Expert will determine the most likely source of the problems at each module, and respond with a suggested course of action (e.g., "Replace APD/TEC", "Replace FEB/TECC", or "Check communication cables"). The FaRX maintainer should add these diagnoses to the MaintenanceList.csv under column "SUGGESTED ACTION". After this list is presented at the next Operations meeting, the group will decide whether the recommended action should be taken during the next beam downtime. That decision should be placed in the "ACT?" column. An example is shown below:

Note that there are two lines not highlighted in red which we have also provided comments on--these are modules which had shown noisy/non-reporting rates > 90% for the previous week, and were therefore pending maintenance work--however, their noisy/non-reporting rates are now < 90%, with Hi/Lo ADC issue rates pushing the total about 90%. This is a relatively new problem mode, as of Apr 7 2015, and so we are tagging these modules for future study.

We also show above the four columns FEB SN, TECC SN, APD SN, and TEC SN. These columns show the serial numbers of each piece of hardware currently registered as being located at that position on the detector. Database problems can occasionally lead to there being no APD, TEC, FEB, or TECC at some location on the detector; in that case, the entry for that module in the SN column would be blank. The FaRX maintainer should follow up on such cases, and determine what has gone wrong--it is rarely the case that part of the detector is actually missing an APD/TEC or FEB/TECC.

Part III: Identifying High Dropout Rate Modules

Next, the FaRX maintainer should check the file results/DropoutRates/AllDropoutRecords.csv. Again, this file can be opened by any spreadsheet editing program. The structure of results/DropoutRates/AllDropoutRecords.csv is identical to that of results/StatusReportSummaries_Rates/MaintenanceList.csv, but only modules with a four-week average dropout rate of greater than 0.00 are reported.

First, sort the spreadsheet by column DROPOUT RATE (AVG) in descending order. The quantity in this column is the number of times that each FEB has dropped out per subrun, divided by the number of subruns--e.g., the average number of times that each FEB drops out, per subrun.

We then define three regimes: FEBs with a dropout rate > 1.0; a dropout rate >0.1 but <1.0; and a dropout rate > 0.0 but < 0.1 . The latter two groups are defined somewhat arbitrarily, though they should roughly separate one-off FEB dropouts from FEBs that are dropping out more than once per subrun. The former group (dropout rate > 1.0) is defined to include FEBs that drop out very frequently--it is likely that something is going on at such locations, and that these positions may benefit from a hardware replacement.

The FaRX maintainer should identify the FEBs with dropout rates > 1.0, and put them on the Repair Request list with the Suggested Action "Check/tighten communication cables":

Simply tightening the communication cables has been found to be very effective at addressing these high-dropout rate FEBs. If this tightening fails, the next step would be to treat this module as you would any other problem module: send that module's identity and history to the DSO Analysis Expert, ask for a suggested course of action, and add that suggestion to the next week's Repair Request list.

As when identifying high noisy/non-reporting rate modules, it is useful to make certain that these rates are not oscillating with time. For that reason, the FaRX maintainer should at least compared the DROPOUT RATE (CUR), DROPOUT RATE (AVG), and DROPOUT SLOPE quantities. If DROPOUT SLOPE is high, or if DROPOUT RATE (CUR) is very different from DROPOUT RATE (AVG), then it is quite possible that these high dropout rates are spurious, and will disappear with time. However, there is one important distinction separating high-dropout rate modules from high noisy/non-reporting rate modules: checking/tightening cables is a very low-cost repair option, and can be done very quickly. Therefore, if in doubt about whether a high dropout rate is oscillating or not--it is reasonable to err on the side of checking/tightening that cable.

Part IV: Slow / No Coolers

As of Apr 8 2015, the Run Coordinator is responsible for identifying slow/no coolers on the detector, and determining the appropriate course of action. DocDB 11016 contains a table of cooling problem symptoms, and the rates of success of different techniques for solving these problems. This may be used to determine what should be done with a cooling problem.

Once cooling problems have been identified, module histories can be generated for these positions by uncommenting the following block in your Main.C file,

//-------
// If you want to produce a spreadsheet
// describing modules with cooling problems,
// add a vector of coordinates to [coolprobs]
// for each module.

TString coolFolder = "results/CoolingProblems";
bool coordsAreBPP = false;
std::vector<int> coolcoords(3);
std::vector< std::vector<int> > coolprobs;
coolcoords[0] = 1; coolcoords[1] = 11;  coolcoords[2] = 52; coolprobs.push_back(coolcoords);
coolcoords[0] = 1; coolcoords[1] = 10;  coolcoords[2] =  8; coolprobs.push_back(coolcoords);
coolcoords[0] = 6; coolcoords[1] =  3;  coolcoords[2] =  4; coolprobs.push_back(coolcoords);
TheStatusWatcher.MakeOutputCSV( coolFolder, coolprobs, coordsAreBPP);

and modifying the vector of coordinates [std::vector<std::vector<int> > coolprobs] to contain the coordinates for each module that you want to know more about. You can save these coordinates in either BPP or DDF format--just be consistent, and make certain to change [bool coordsAreBPP] to reflect your choice. NB that while the code block above is set up to create a folder detailing cooling problem modules, you can save information about any group of modules that you would like to investigate using the MakeOutputCSV command--simply copy and modify the code shown above, providing an appropriate label for the new save folder.

After recompiling and running FaRX, you will have a new results/CoolingProblems folder which contains:

1) A MaintenanceList.csv of the same format as the noisy/non-reporting MaintenanceList.csv
2) An individual history file for each coordinate vector sent to [coolprobs], with each file labeled by module coordinate

Finally, grab the histories from MaintenanceList.csv, and modify the SUGGESTED ACTION and ACT? columns to reflect what the Run Coordinator and Ops group have agreed upon:

Part V: Putting It All Together

Finally, collect the results of Steps I-IV into a single spreadsheet:

Note that this combined spreadsheet has a new column that was not present in the original MaintenanceList.csv files, "PROB TYPE". In this column, the FaRX maintainer should record the original reason that this channel was being considered. In addition, to make it easier to coordinate repairs, entries should be sorted by (BLOCK, PLANE, POSITION); this way, Ash River techs can easily organize the work by position, in order to make the requested repairs as efficiently as possible.

At this stage, it is not necessary to record information on issue rates, or on historical repairs. These things should have already been considered during the initial analysis of the week's problem modules. However, it is very helpful to provide the serial numbers of the APD, TEC, FEB, and TECC currently recorded at each position. This information provides a double-check that work is being done on the correct module. Alternately, there may be database errors, in which the wrong hardware is registered at some position. If the Ash River techs find a discrepancy between the hardware recorded as having been installed on some position, and the hardware actually installed on that position, then the FaRX maintainer should correct this in the Ash River Hardware QA database. If information in the database is incorrect, then the wrong settings may be applied to that module--this will lead to poorer detector performance.

This final table should be presented at the next Operations meeting, at which point SUGGESTED ACTIONS may be modified, and the decision whether or not to ACT? will be made. Once these decisions are made, the FaRX maintainer should upload this spreadsheet to a public location. As of Apr 8 2015, these weekly updates are uploaded to DocDB 12804 .

History Dump

It is uncommon, but the FaRX maintainer may at some point want to look through a large number of module histories--for example, all module histories for some Diblock. In this case, using the MakeOutputCSV( ) command might be unwieldy. Instead, one may uncomment the lines:

//-------
// Dump all histories to file. Passing the second parameter as FALSE means
// to output history files with names based on the DDF coordinate system
// (DIBLOCK, DCM, FEBPORT). As TRUE, with names based on the BPP coordinate
// system (BLOCK, PLANE, POSITION)
TheStatusWatcher.DumpAllHistories("results/AllHistories_DDF", false);
TheStatusWatcher.DumpAllHistories("results/AllHistories_BPP", true);

The DumpAllHistories function prints the history of each module on the detector to its own file. These output files can be organized by (DIBLOCK, DCM, FEBPORT) coordinates (the first line), or by (BLOCK, PLANE, POSITION) coordinates (the second line). It may be the case that both types of organization are needed, but keep in mind that the combined size of all output files is large (~200 MB, for either the DDF-organized files or the BPP-organized files). As such, running the DumpAllHistories( ) command is not recommended for casual use.

Permanently Inactive Modules

Three modules are permanently inactive, and will always register a non-reporting rate of 100%. FaRX automatically ignores these modules when producing the noisy/non-reporting rate results/StatusReportSummaries_Rates/MaintenanceList.csv file. However, they come up often enough as to merit their own section. These modules are:

Module MFV 02897, (Block, Plane, Position) = (6,30,11):

Module MFV 05611, (Block, Plane, Position) = (9,22,10):

Module MFV 10646, (Block, Plane, Position) = (26,21,11):

Again--these modules will always have a 100% non-reporting rate. They are incorrigible. You should not attempt to do any work on these positions. At some point, you will steadfastly ignore this warning, and place one or three of these modules on some weekly maintenance list. This will bring shame upon your family name for no more than three generations, but no less than one week.

Using DSO Scans to Diagnose Hardware Problems

UNDER CONSTRUCTION

Using Navicat to address database problems

UNDER CONSTRUCTION

Bothering Jaroslav

It is best to bother Jaroslav between the hours of 4 and 6 AM on Saturday and Sunday mornings. This is especially true of mornings following late-night run issues.

It is customary to begin your bother with a statement along the lines of "There's a bee in the control room," or "The contrast on the novacr01 display is set a bit higher than usual." It is absolutely imperative that you share these worries with Jaroslav as soon as possible, without consulting either Yahoo Answers or the MINOS shifter for an answer. Such actions brings Jaroslav great joy, and his tone should not be mistaken for one of incredulous despair.

FD Installation