Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global Filtered Topic (Suppressed Alarms) #10

Closed
slominskir opened this issue Feb 18, 2021 · 13 comments
Closed

Global Filtered Topic (Suppressed Alarms) #10

slominskir opened this issue Feb 18, 2021 · 13 comments

Comments

@slominskir
Copy link
Member

It would be useful to have a global mask topic that is a filtered version of the active-alarms topic, but filtered by global filter rules. A Kafka Streams app could do the filtering. A command topic would also likely be needed to instruct the Streams app what to filter. Currently each consumer (client) can do local filtering based on criteria such as category or location, but this applies only to the local client (not all clients).

@slominskir
Copy link
Member Author

slominskir commented Feb 18, 2021

We may actually want a separate global topic per each control room (as each control room may want to filter out things differently). For example, we may want a mask topic for the following control rooms:

CEBAF
LERF
CHL
UITF

Alternatively, each control room could get a separate instance of the alarm system (but with lots of overlap/duplicate alarms)

@slominskir
Copy link
Member Author

In progress: https://github.com/JeffersonLab/alarms-filter

Streams app that provides ability to configure a set of output topics with custom filters applied to the active-alarms topic. This allows any consumer to share the filtered topic (as opposed to a consumer local filter, which isn't shared)

@theojlab
Copy link

We might want to consider having a isMaskable attribute on alarms. A scenario I'm thinking of is something like the radcon tritium alarms that were in Hall A a while back. Michele had to put them into "SITE" so that they wouldn't be disabled when Hall A wasn't running because they had to be armed all the time, not just when Hall A was taking beam. Maybe we can up with other options, but the ones I see now are. 1) Put things "out-of-place" to avoid them being masked or 2) allow items to be excluded from the facility mask.

@slominskir slominskir changed the title Global Mask Topic Global Mask Topic (Disabled Alarms) Feb 23, 2021
@slominskir
Copy link
Member Author

slominskir commented Mar 15, 2021

Terminology note:

ANSI/ISA 18.2-2016 defines three alarm states in which an alarm is "turned off"

  • Suppressed (by design): "prevent the annunciation of the alarm to the operator when the alarm is active"; "EXAMPLE: shelve, suppress by design, remove from service"
  • Shelved: "temporarily suppress an alarm, initiated by the operator, with engineering controls (e.g., time-limited) that unsuppress the alarm"
  • Out-of-Service: "state of an alarm during which the alarm indication is indefinitely suppressed, typically manually, for reasons such as maintenance"

I believe we use the term shelve and out-of-service as intended by the standard, but we casually use many names for "suppressed by design" including:

  • Filtered
  • Masked
  • Disabled
  • Off

This is likely because the word "suppressed" is fairly generic and in fact occurs in the definition of shelved and out-of-service (the distinction there is shelved is temporary / time limited whereas out-of-service is indefinite. The third category "suppressed by design" is a sort of catch-all that we're using for the scenario where a portion of the machine is turned off because we're not using it so we don't want to see alarms from that part of the machine. We may need to clarify which terms we are going to use,

I believe the distinction on who does the "turning off" isn't very useful as it could always be an operator.

Both Out-of-Service and our "turn off a portion of the machine" use case are indefinite, so they actually could be one in the same and we could consolidate to just two distinct off states (indefinite vs temporary/timed/shelved). However, it might be useful to make a distinction between out-of-service (for maintenance) and "turned-off" (not needed for program). The distinction is being able to see what is broken vs what is just not needed - though not needed is a super set of broken as it could mean anything - too many broken items, not enough money in the budget, not compatible with current machine configuration, undergoing an upgrade, etc.

Another factor that could be used to make a distinction is whether you can easily turn off an entire group of alarms all-at-once vs one-at-a-time. It's tempting to say that is another difference between out-of-service and "Turn off", but it is possible in the future users would like to use wildcard expressions / grouping filters to select any suppression action.

@slominskir
Copy link
Member Author

slominskir commented Mar 15, 2021

Note: it is possible for an alarm to be in two states at once: (1) flagged as out-of-service (broken) OR shelved, (2) filtered out of view because it is located in a portion of the machine that isn't part of the program

It is fine that these combinations are possible, even desirable as what is part of the program probably should be tracked separately from what is broken (and definitely separately from what is shelved).

Depending on how we handle out-of-service we could actually be in all three states at once (which is fine). Separating what is broken from what is shelved is not as critical, but not a problem either (unless ops abuses it in lieu of creating temporary shelved items). Since an alarm should rarely be out for maintenance we could add a boolean to the registered-alarms topic. Alternatively, we currently have an indefinite option on shelving, which could be used as "out-of-service". The GUI calls this disabled now. We could rename the shelved-alarms topic to suppressed-alarms to clarify that both indefinite and timed suppressed alarms are captured there if we take the definition of shelved to mean only timed suppressions. Or just document clearly that we use an indefinite shelving to mean out-of-service.

@slominskir
Copy link
Member Author

slominskir commented Mar 15, 2021

I forgot to mention "suppressed by", which is a reference to a parent alarm that suppresses a given alarm in a hierarchy. Also on delays. Given these five suppression modes how about we use the following definitions:

Alarm Suppression States

Precedence Name Duration Definition
1 Disabled Indefinite A broken alarm can be flagged as out-of-service
2 Filtered Indefinite An alarm can be "suppressed by design" - generally a group of alarms are filtered out when not needed for the current machine program
3 Masked Only while parent alarm is active An alarm can be suppressed by a parent alarm to minimize confusion during an alarm flood and build an alarm hierarchy
4 Delayed Short with expiration An alarm with an on-delay is temporarily suppressed to minimize fleeting/chattering
5 Shelved Short with expiration A nuisance alarm can be temporarily shelved with a short expiration date

@slominskir
Copy link
Member Author

We need to determine if we must independently track each of these alarm suppression states (should they be mutually exclusive?) because they may overlap in all sorts of ways and at transitions the correct effective suppression state must be determined ideally without coupling the various mechanisms that suppress alarms. There is a calculated "effective" suppression state, but it might be best to calculate it by combining the independent states to avoid confusion and provide a clear audit trail.

If any item on the list changes then all suppression rules must be re-applied in order of precedence (starting at 1). For example if an alarm is removed from a filter all the other suppression states must be re-evaluated, meaning we need to store state information about what other rules would have been in effect if the filter hadn't been in effect - such as continuous shelving.

It is possible to create a "suppressed-alarms" topic, with message key of both alarm name and suppression reason such that we can store all of the possible suppression state in one topic. We could create a new topic alarm_state that stores the calculated effective state (could even compute final effective state considering acknowledgements and active-alarms too). Or clients can do the calculation. If we have a "suppressed-alarms" topic, our current filter-app prototype will need to change as it writes to a separate topic.

@slominskir slominskir changed the title Global Mask Topic (Disabled Alarms) Global Mask Topic (Suppressed Alarms) Mar 15, 2021
@slominskir slominskir changed the title Global Mask Topic (Suppressed Alarms) Global Filtered Topic (Suppressed Alarms) Mar 15, 2021
@slominskir
Copy link
Member Author

slominskir commented Mar 15, 2021

Filter flow might look like:

Source -> [active-alarms] -> disabled-app -> [non-disabled-alarms] -> filter-app -> [non-filtered-alarms] -> mask-app -> [non-masked-alarms] -> delay-app -> [non-delayed-alarms] -> state-calculator -> [operator-alarms]

Each suppression app would update suppressed-alarms topic

  • state-calculator computes effective state
    • honors acknowledgements
    • honors suppression precedence
    • honors active state
    • honors registered alarms (optional) - do you want to see state "Normal" alarms listed?

States:

NORMAL <-> ABNORMAL (active)

ACKNOWLEDGED <-> UNACKNOWLEDGED

SUPPRESSED <-> UNSUPPRESSED

  • plus combinations and variants of suppressed

One app might be able to do the whole process?

@michelejoyce
Copy link

I think you're on the right track...layering the various "suppressions" That way, if there is any other type comes, up it would be easier to add.

Can this also apply to "calc/rules based alarms?" Isn't it just another form of alarm/don't alarm?

I think you're definitions are good.

@slominskir
Copy link
Member Author

Yeah, CALC alarms would likely go in that flow as well - probably before the disabled-app. The delay-app would also likely handle off-delays as well (not just on-delays).

@michelejoyce
Copy link

Layering these things makes more and more sense...
Will it be onerous?

@michelejoyce
Copy link

Especially since we're not only talking about global suppression, but instead individual and groups of alarms...

@slominskir
Copy link
Member Author

slominskir commented Mar 17, 2021

If we create a bunch of separate apps it certainly is flexible, but it'll be costly and unwieldy as well (lots of moving parts and lots of duplicate work). It might make more sense to divide suppression in general into two pieces:

  1. Suppression Rule Processor - App responsible for keeping the suppressed-alarms topic up-to-date based on registered-alarms and active-alarms and filter-commands topics
  2. Active Alarm Suppressor - App responsible for actually creating a new output topic with suppressed active-alarms

Probably consolidate all of this into the alarm-filters project and rename it alarm-suppressors instead. See: JeffersonLab/alarms-filter#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants