Project

General

Profile

June 27th

Date:
06/27/2017

Attendees:
Anna M, Brandon W., Brian Y., Marc M., Margherita V.W, Steve W., Vladimir P., Robert I., Joe B., Felipe H.

Agenda:
- FIFE workshop 2017 feedback
- AOB

Discussion:

1. FIFE workshop 2017 feedback

ANNA:
Two presentations were made. Also experiments gave feedback.
Alex gave feedback for NOvA and to keep track he has opened a ticket (RITM0575905).
Issues faced by NOvA, who has much larger datasets are not experienced by other experiments such as G-2 and Protodune (MCC9).
They have been using POMS and they are pleased with it.

  1. More and more PROXY errors are happening when opening Campaign page and it takes in general
    long time even when successful.
  1. There is a misunderstanding on the number presented for jobs completed and located. Alex
    is confused on how to interpret them.
    We are also still getting mismatched numbers; Marc found a bug related to the number of located jobs
    and he is working on.

Proxy error: Robert is suggesting there is a problem at the level of POMS configuration, related to the
Cheaper Subsystem on the server side.

BRIAN:
would it be possible to change the query producing he datasets?

MARC:
we should consider also caching data considering the amount of data always returned.

STEVE:
maybe the simplest thing would be to put a cache between POMS and SAM.

MARC:
maybe we could add an ajax callback and update the data as it comes in..

ANNA:
whichever implementations, we need to find a solution. So far g-2 and ProtoDUNE are not getting
problems as NOvA does since they do not have huge datasets; they have the "keepup" process which
keeps adding files.

STEVE:
nginx would help because of the ability to add caching.

ANNA:
So, as far as releases, we agreed to delay next release but have a bug fix release, focusing on the
proxy error problem. We will have Vladimir and Marc look at this and we will also try Robert's suggestion to
turn off "Cheaper".
Also, check on labels for jobs completed and located which is causing
confusion for Alex.
As far as maybe reducing information on the campaign page, we should not do that since POMS is actually
providing information on both job status and files status .

MARC:
maybe we could have a cronjob to check on files stats.

ANNA:
lets' have a plan for the fixes: there are some ideas on what to look into; try different implementations
and compare and see which one works best.
Then we will focus on how to present the page with the active campaigns.
Another discussion with NOvA is about what is considered a "successful" job; for Condor it could be successful even if
no output is returned but this is not successful the experiment.

BRIAN:
Alex said that he looks at different sources like Graphana besides POMS.

ANNA:
Fifemon and Sam Station provide different information, jobs and files. Our goal with POMs is to give both types in a centralized way. As Alex stated:"We have an
ambitious goal"

BRIAN:
Heard from Alex that they want to use campaigns information to see what is missing.

ANNA:
they can't compare Fifemon and POMS. POMS is a workflow management, FIFEMON is monitoring providing stats.

JOE:
Fifemon can be tagged by campaign but will not provide files info.

ANNA:
as we said in the past, we need effort to provide correct stats: if we don't have the correct info than
it's better to say N/A instead of showing the wrong number.
But main concern is Proxy problem.

MARC:
it could be good if we can get access to admin apache logs to see how many proxy errors and see what's going on.
Sometimes we get proxy error not just from main page, but reloading it goes away..

ANNA:
let's try the different potential solutions for investigating proxy error and see what happens.

Action Items:
Marc and Vladimir will investigate possible solutions.