Bug #21215

One-shots returning wrong data?

Added by Richard Neswold over 2 years ago. Updated about 1 year ago.

DPM CLIB Support
Target version:
Start date:
Due date:
% Done:


Estimated time:


Bill reports that he's occasionally getting DPM_PEND for his BBM, one-shot requests. He's also seeing the wrong data being returned when it finally succeeds. The situation is this:

  • Once a minute, he reads L:CISM followed by L:RF3ISM as one-shots.
  • If he gets a DPM_PEND while reading L:CISM, he retries up to five times.
  • Eventually he gives up and moves on to L:RF3ISM.
  • At this point, there's a chance that the reading from the second device looks like data from the first device. These devices are on different front-ends, so it doesn't appear to be a front-end problem.

He provides an example:

Value Cycle
n-2 n-1 n
L:CINT sum 0 0 0
beam cycle 0 0 0
all cycles 99,444,749 99,445,740 99,447,448
L:RF3INT sum 14,176,050,814 0 14,176,639,292
beam cycle 15,665,785 0 15,666,365
all cycles 55,190,564 99,446,708 55,192,253

In cycle (n - 1), it can be seen that the data returned for L:RF3INT looks like the data expected for L:CINT.

We need to set up a test to see if the issue is in DPM or the CLIB interface to DPM.


#1 Updated by Richard Neswold over 2 years ago

I created an Erlang client which reads the devices once a minute. I wrote it in Erlang to avoid using CLIB so, if my client has issues, then DPM is suspect. If not, we can look at CLIB's code for the problem. It's currently spitting out content like this:

L:RF3ISM(1sh) ->  18,996,620,948,      21,264,196,      67,651,150
  L:CISM(1sh) ->               0,               0,     111,939,124
  L:CISM(per) ->               0,               0,     111,940,023
L:RF3ISM(per) ->  18,997,136,177,      21,264,729,      67,652,049
time to perform the one-shot requests
ignoring {marker,#Ref<>}
  L:CISM(1sh) ->               0,               0,     111,940,024
L:RF3ISM(1sh) ->  18,997,136,692,      21,264,730,      67,652,050
  L:CISM(per) ->               0,               0,     111,940,923
L:RF3ISM(per) ->  18,997,651,847,      21,265,263,      67,652,949
time to perform the one-shot requests
ignoring {marker,#Ref<>}
L:RF3ISM(1sh) ->  18,997,652,361,      21,265,264,      67,652,950
  L:CISM(1sh) ->               0,               0,     111,940,924
  L:CISM(per) ->               0,               0,     111,941,823
L:RF3ISM(per) ->  18,998,167,185,      21,265,797,      67,653,849

When Bill sees the problem again, I'll look back at the output to see if I saw it too.

#2 Updated by Richard Neswold over 2 years ago

I modified the test program to ask for the one-shot data one at a time. The previous version set up both one-shot requests at the same time, which doesn't mimic what Bill's application is doing. This version waits for the first result before asking for the second. It will also retry the request up to 5 times if it gets an error status.

The test program has been restarted.

#3 Updated by Richard Neswold over 2 years ago

Bill informed me that if his BBM application sees a problem, it doesn't necessarily mean I'm going to see it. He recommended I check for the error condition instead of waiting for him to tell me the error occurred and then seeing if the test program saw it too.

I added code to the test program to look for bad data and generate an easy-to-see message when it happens.

#4 Updated by Richard Neswold over 2 years ago

We restarted Bill's test app to use Charlie's Java-based DPM. We've already seen the DPM_PEND/retry errors. We'll keep running to see if the "wrong data" error occurs.

So far this indicates the DPM_PEND problem is in CLIB.

#5 Updated by Richard Neswold over 2 years ago

  • Category changed from Data Pool Manager to DPM CLIB Support
  • Assignee changed from Richard Neswold to Charles King

We've seen incorrect data being returned using the new Java DPM code base. Charlie is looking into the CLIB support.

#6 Updated by Richard Neswold about 1 year ago

This issue is still open, Charlie.

Also available in: Atom PDF