Qiming presented a detailed introduction to the Message Analyzer application that he is developing for NOvA DAQ. Along with Qiming, the people involved in the discussion were Jim, Marc, Ron, Chris, and Kurt. This page has a compilation of the notes that were taken by various individuals.
Does everyone understand the task here and what they are to do?
Spies on message facility messages.
Filtering can be on hostname also. Seems to be missing a piece of filtering data - which is instance name.
What is application format?
Name of the process (type). The module and context are not used. There may not be enough information to distinguish deployments where there are multiple applications on one node.
Here are the fields that are available. All are not populated well for online applications.
// Get methods
bool empty() const;
ErrorObj ErrorObject() const;
timeval timestamp() const;
std::string timestr() const;
std::string severity() const;
std::string category() const;
std::string hostname() const;
std::string hostaddr() const;
std::string process() const;
long pid() const;
std::string application() const;
std::string module() const;
std::string context() const;
std::string file() const;
long line() const;
std::string message() const;
Need hard definitions for each of these fields.
Severity is level based, not discrete value.
Art is not setting the application and process fields, but message facility is probably doing this.
What is category format?
Filtering based on levels or enumerations?
Does not include analysis for “not seem” in a period of time.
Ron asked if the MA system should be able to evaluate how well it is responding and how well its timing is working.
Should the system recognize global system states? yes - and it does allow for specifying this.
Pattern recognition - once it becomes true, it cannot be made false. How do conditions get reset? they do not.
A really neat project for a student would be to take this system and implement in prolog.
Should we talk with IIT about this? What about Gene Cooperman?
user action functions: supports arbitrary constant arguments to the function. The functor receives these arguments in a “parse_argments” call for setting instance state variables. These constant parameters are not used on the function call itself, only the incoming message is.
Hooked up to the resource manager? yes. some of the functions are tied to this behavior (count_percent) because they make use of resource names in the arguments.
Their is a global repository of the components that are alive in the system. Qiming calls these participants. This is only used by the count_percent function and nothing else.
The counts in the repository are set by hand by the GIU (fixed).
It knows about system messages directly (from resource manager and run control). * run control? * resource manager?
Should timeout constraints be available in the system?
Looks like something that could work well with EPICS and the alarm handler?
Looks like the tool is meant to be interactive. This is how reset of rules could occur (of alarms).
“group by” and valid time window (run?). Specification of enumerated values?
Qiming recommended state handling using a custom function, as in state==InHardwareConfig, or in
category==ConnectionFailure, where InHardwareConfig and ConnectionFailure are enumerated types,
that are defined in the fhicl configuration. This takes care of a bunch of issues.
Ron wants the system to be used in a general way - simplistic, so category should not be an enumeration.
Question of filtering on category came up - should messages with arbitrary category be allowed in the system if the category is not filtered on in the configuration file. I said that arbitrary categories should be allowed if ‘*’ is used in the category filter. If category is used, then the category must be listed in the enum field.
What about valid time window for alarms? What if the condition is automatically cleared (e.g. by rebooting a node). They are talking about adding “hold-off” case, where alarm cannot happen within a certain time after it has occurred.
How should events be reset automatically?
Should conditions be allowed to change state from true to false?
Should there be latched alarms i.e. alarms that occurred and not acknowledged, but went back to normal state?
How long is data for rules valid?
How do we generate a series of failures? (I think we need failure situations documented).
What should the procedure be for documenting failure scenarios? * FMEA * How do they manifest in the system? * What is the fhicl MA encoding for them? * Can we generate the stream of these messages?
MA handles problems that are exceptional - outside of what is reasonable handled in standard error handling within an application.
“Real-time event correlation analysis tool”
- Pattern recognition
- Correlation analysis
Fed with MessageFacility log messages
“flexible and user-configurable”
MessageFacility log messages
- application [in art, this is module type + instance label]; process name from some applications, application name in others
- message body
1. filtering based on (severity, “application”, category); this application is not the same as the above! Category is intended to be categorical, not “arbitrary”. But the system doesn’t enforce this. Severity is level-based (filter “bad or higher”, not “exactly bad”)
2. matching (of message body against regex)
3. testing the frequency of message occurrence in a certain period of time
4. recognition pattern is called a “Condition”
MessageAnalyzer identifies events based on a single condition. Can also use logically correlated conditions (patterns) to identify events, e.g. DCM_Heartattack when either InHardwareConfig or DuringDataTaking:
Event_DCM_Heartattack := DCM_Heartattack && (InHardwareConfig || DuringDataTaking)
Currently, once a condition turns TRUE, it can not turn FALSE again (until everything is reset).
An event identification rule is called a “Rule”. The elements inside a “Rule” have to be (primitive) “Conditions”; they can’t be other “Rules”. Qiming thinks this might be relaxed, to allow “Rules” to be composed from other “Rules”.
In order to avoid multiplicity of specific conditions, we use wildcards or regular expressions to collapse similar conditions.
DCM_Heartattack := RunControl: dcm-?? missed heartbeats
But now we have a problem: we can’t look for a single node has missed a heartbeat.
To help solve this, introduce “Extended Conditions”
1. allows 1-n, n-1, n-m definitions
2. output is an array of boolean values (1 or 2 dimensional); primitive conditions return a single boolean value
Formal language introduced to allow restricting the correlations, when wanted
There isn’t much sophistication or flexibility in handing resetting of conditions; only coarse control exists.
What is the precise meaning of a “domain”? How does this relate to relational databases?
“source” is determined from metadata; “target” is parsed from the message body (using regular expressions)
It might be best to give && and || equal precedence; right now, && has the higher precedence. NOPE! C and C++ give && higher precedence than ||; so it makes sense to leave things this way. What about other languages? Do we care?
Participants can be grouped or ungrouped. “DCM01” is grouped; “Data logger” is not grouped.
Future development (what is needed for expanded use)
1. Specify all the necessary metadata to identify a message. We should think of both multi-process and multi-threaded and multi-module “programs”.
2. Have a firm definition for each metadata item.
3. If necessary, add more metadata fields to the MessageLogger
Will this package become part of our delivered software? Built with cetbuildtools, packaged and distributed with UPS, etc.?
One “rule engine” instance monitors one “system”, by definition. That is, the definition of the “system” is that for which a single rule engine is absorbing and analyzing messages. One does not configure a single rule engine to process what one wants to call two “systems”.
Kurt's notes¶Based on message facility messages. Pattern recognition plus correlation analysis.
Pattern recognition is currently based on MF standard fields, but this could be abstracted out.
Pattern recognition is delivered by Filtering on predefined fields, Matching of the body text, and Testing of frequency.
What does "application" mean in the Filtering step?
- somewhat application-specific. But, we think that something less loose would be better.
- need to come up with a mapping of MF fields to process characteristics in both an online and an offline context.
- actually, we need the superset of online and offline fields since a reconstruction program running in an online environment has characteristics of both (multiple instances of the same application that we want to differentiate and modules inside the application)
Filtering uses different methods for different fields (e.g. severity greater than X, but regex for MF category)
Testing of frequency is based on the generated timestamp, not the receipt time of the messages.
Should the MsgAnalyzer test whether the timestamps on the inputs are reasonably close to "now"? (Configurable so that replaying of message is possible.)
Once conditions become true, they stay true until reset externally.
Can a rule be used as part of the definition of another rule?
Afterwards, message, message to Run Control.
Uses wildcards or regex expressions to collapse similar conditions.
The output of an extended condition is an array of boolean values with one or two dimensions. The array grows over time as new messages arrive.
Questions about C1.$t = C2.$s = bn03
How about storing the messages in a DB table in memory and letting users write SQL queries?
Labelled regex groups
operation that is missing: not
can't specify the universe
source and target have special, pre-defined meanings.
source is determined from the metadata, and the target is parsed out of the message body.
add a convenience method to FHICL.cpp to return a regex from a string
how to facilitate better use of category?
NOvA probably needs a "category fest" to clean up the categories.
at the end, it would be great to revisit resets or scrolling time windows - how do conditions become false again after becoming true?
parser written in Boost::Spirit
fix the parsing to make && and || equal precedence
--- lunch break ---
for tracking the expected number of sources, maybe we pass N as a message
when comparing COUNT >= 3, it would be nice to specify whether to alarm only once when N==3 or every time that N is 3 or larger
currently, the processing happens each time that a message arrives
Qiming is considering adding a buffer between DDS and the MsgAna
The reset functionality in the MsgAna GUI resets everything.
--- move to Libra ---
What changes would be needed to support additional projects?
Chris' compilation of what we wrote on the whiteboards¶
1. LQCD (cluster monitoring).
2. NOvA DAQ (clearing of alarms).
NOvA DCS (actions -> external message).
3. Reconstruction farm.
1. Is messagefacility metadata sufficient?
2. What is the correspondence between domains and a relational DB
3. Should the message analyzer feed into a more formal alarm handler?
Issues to be addressed.
1. Module name, app name, instance name.
2. GUI / rule engine separation (daemon running).
3. Handle high rate bursts of messages.
5. Connection with system state.
6. Timing constraints.
7. Setup and teardown functions in custom code. Revisit API between
FHICL and custom functions.
8. Custom functions in scripting languages.
9. Subclassing: hierarchical categories in metadata (not just single
10. Generic posting of alarms.
11. "NOT" operation in boolean expressions.
12. Define, "universe."
13. "Group By" or aggregate.
BN NOvA DAQ scenarios.
A. "Send would block." (5s).
B. R_C complains about lost heart beat (2s).
C. Data logger complains about missing data (2s).
D. "Socket disconnected" (1s).
1. BN, "crashes" -> all DCMs start yelling: (A).
2. BNEVB application crashes: (D).
3. Network interface goes down.
4. DDS daemon dies: (B).
5. DDS daemon hangs: (B).
6. DDS daemon memory use increases forever.
7. BNEVB app sees "too large" trigger rate.
8. Data driven trigger falling behind.
9. Data driven trigger rate outside spec.
10. Data driven trigger observes corrupt data.
11. Data Driven Trigger gave up because the analysis of an event was taking too long.