Bug #21215
One-shots returning wrong data?
0%
Description
Bill reports that he's occasionally getting DPM_PEND
for his BBM, one-shot requests. He's also seeing the wrong data being returned when it finally succeeds. The situation is this:
- Once a minute, he reads
L:CISM
followed byL:RF3ISM
as one-shots. - If he gets a
DPM_PEND
while readingL:CISM
, he retries up to five times. - Eventually he gives up and moves on to
L:RF3ISM
. - At this point, there's a chance that the reading from the second device looks like data from the first device. These devices are on different front-ends, so it doesn't appear to be a front-end problem.
He provides an example:
Value | Cycle | |||
---|---|---|---|---|
n-2 | n-1 | n | ||
L:CINT |
sum | 0 | 0 | 0 |
beam cycle | 0 | 0 | 0 | |
all cycles | 99,444,749 | 99,445,740 | 99,447,448 | |
L:RF3INT |
sum | 14,176,050,814 | 0 | 14,176,639,292 |
beam cycle | 15,665,785 | 0 | 15,666,365 | |
all cycles | 55,190,564 | 99,446,708 | 55,192,253 |
In cycle (n - 1), it can be seen that the data returned for L:RF3INT
looks like the data expected for L:CINT
.
We need to set up a test to see if the issue is in DPM or the CLIB interface to DPM.
History
#1 Updated by Richard Neswold over 2 years ago
I created an Erlang client which reads the devices once a minute. I wrote it in Erlang to avoid using CLIB so, if my client has issues, then DPM is suspect. If not, we can look at CLIB's code for the problem. It's currently spitting out content like this:
L:RF3ISM(1sh) -> 18,996,620,948, 21,264,196, 67,651,150 L:CISM(1sh) -> 0, 0, 111,939,124 L:CISM(per) -> 0, 0, 111,940,023 L:RF3ISM(per) -> 18,997,136,177, 21,264,729, 67,652,049 time to perform the one-shot requests ignoring {marker,#Ref<0.0.3.171>} L:CISM(1sh) -> 0, 0, 111,940,024 L:RF3ISM(1sh) -> 18,997,136,692, 21,264,730, 67,652,050 L:CISM(per) -> 0, 0, 111,940,923 L:RF3ISM(per) -> 18,997,651,847, 21,265,263, 67,652,949 time to perform the one-shot requests ignoring {marker,#Ref<0.0.3.212>} L:RF3ISM(1sh) -> 18,997,652,361, 21,265,264, 67,652,950 L:CISM(1sh) -> 0, 0, 111,940,924 L:CISM(per) -> 0, 0, 111,941,823 L:RF3ISM(per) -> 18,998,167,185, 21,265,797, 67,653,849
When Bill sees the problem again, I'll look back at the output to see if I saw it too.
#2 Updated by Richard Neswold over 2 years ago
I modified the test program to ask for the one-shot data one at a time. The previous version set up both one-shot requests at the same time, which doesn't mimic what Bill's application is doing. This version waits for the first result before asking for the second. It will also retry the request up to 5 times if it gets an error status.
The test program has been restarted.
#3 Updated by Richard Neswold over 2 years ago
Bill informed me that if his BBM application sees a problem, it doesn't necessarily mean I'm going to see it. He recommended I check for the error condition instead of waiting for him to tell me the error occurred and then seeing if the test program saw it too.
I added code to the test program to look for bad data and generate an easy-to-see message when it happens.
#4 Updated by Richard Neswold over 2 years ago
We restarted Bill's test app to use Charlie's Java-based DPM. We've already seen the DPM_PEND/retry errors. We'll keep running to see if the "wrong data" error occurs.
So far this indicates the DPM_PEND problem is in CLIB.
#5 Updated by Richard Neswold over 2 years ago
- Category changed from Data Pool Manager to DPM CLIB Support
- Assignee changed from Richard Neswold to Charles King
We've seen incorrect data being returned using the new Java DPM code base. Charlie is looking into the CLIB support.
#6 Updated by Richard Neswold about 1 year ago
This issue is still open, Charlie.