Project

General

Profile

Bug #2205

Occational station fail on reboot.

Added by Vitali Tupikov almost 9 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Start date:
11/23/2011
Due date:
% Done:

100%

Estimated time:
Spent time:
Duration:

Description

During LLRF operation for months it was noticed, that occasionally (like once in 30 times) the station reboot failed. The symptom of the failure as follows:
1) slot0 controller responds fine to backdoor terminal commands;
2) TRIG2 line signaling to slot0 from MFC about data ready stays low all a time.

History

#1 Updated by Vitali Tupikov almost 9 years ago

  • % Done changed from 0 to 70

It took few weeks and more than 100 power reboot to get failed condition on Linac development station. Below the link to study in LLRF log book and copy of observed symptoms.

http://www-bd.fnal.gov/cgi-mach/machlog.pl?nb=llrf&action=view&page=279&anchor=085739&hilite=08:57:39- target=_top> arrowgif </a>

1) There is no TRIG2 line present (the line stay low all a time starting from very beginin);
2) Slot0 was booted up OK, it reached command line prompt on serial backdoor terminal;
3) Command request to slot0 from backdoor terminal is reasonable (MfcDump(mfcId,1) for example);
4) ACNET devices become available through ACNET pages, thou none of the scalar is updated. The initial scalar values are bogus (like 1.23e+19); <\li>
5) Phase lock LED of RF module is OFF;

After symptom analysis with Brian and Philip, we narrowed down the list of possible sources of failures:
1) No LO Clock to MFC;
2) FPGA is not configured;
3) Clock IC is not configured;

I made a measurements of clock frequency and power level of 805 Ref Test point on front panel of RF module, which sources the LO Clock. As it is seen in the log book at

http://www-bd.fnal.gov/cgi-mach/machlog.pl?nb=llrf&action=view&page=279&anchor=084310&hilite=08:43:10- target=_top> arrowgif </a>

the frequency is OK, but power level is below of spec to MFC's clock input.

#2 Updated by Vitali Tupikov almost 9 years ago

  • Status changed from New to Feedback

#3 Updated by Vitali Tupikov almost 9 years ago

I changed status to feedback, assuming that Brian is to decide what to do next.

#4 Updated by Brian Chase almost 9 years ago

So, the power level is 15 dB low. We have evidence that the board did not boot properly. Can we make a connection? Suggested tests:
Boot the system without any clock and see what happens.
Figure out why the LO signal is low and fix it.
Test system with fixed clock.

Note that this problem with the LO generation may be systemic. We have no data on this.

#5 Updated by Vitali Tupikov almost 9 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 70 to 100

I am closing the thread with the following conclusion.

http://www-bd.fnal.gov/cgi-mach/machlog.pl?nb=llrf&action=view&page=279&anchor=074401&hilite=07:44:01- target=_top> arrowgif </a>

Summarizing all carried out studies of occasional fails of LLRF station on bootup time the following can be concluded: the source of the fail is likely in the handshake implementation of "TRIG2_DAQ_Ready_for_Read" trigger generation in FPGA. Though it is not evident from handshake mechanism documentation how it could be possible, I would assume the FPGA and DSP bootup racing conditions are to blame. Unfortunately the troubleshoot process for confirming it is complicated by very rare occurrence of fails. To resolve the problem I would recommend to make a slight modification to the TRIG2 generation scheme in FPGA by eliminating DPI10 acknowledging signal from DSP, which clears flip-flop used for TRIG2 setup. Instead of DPI10, the counter restarted by the same DAQ_over pulse that set TRIG2 up can be used to generate flip-flop clearing signal internally in FPGA.
Hopefully the same modification will fix a failed soft-reset feature for LEL as well.

#6 Updated by Brian Chase over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF