Thresholds

Business service monitoring helps customers to keep a watch of mission critical service delivery infrastructure and manage it to avoid any interruptions in services offered to the business.

One of the critical elements of monitoring is events/alarms. The fault management module deals mainly with configuring alarm thresholds, generating alarms and taking necessary action in case of a threshold breach. Threshold breach occurs if the current monitored value lies out of bounds of the specified range of the threshold. Alarm thresholds can be configured globally which would then apply to all the devices or for individual parameters through the provided dashboards.

The fault Management system generates alarms of different severity levels. Notification Profiles help you to create a profile to notify appropriate users/managers for a quick resolution of the problem.

Thresholds

Thresholds are the key parameters which help to classify a collected monitoring parameter to be in alarm state or not. Thresholds can be defined for any of the following collected parameter:

Global Thresholds

Click the 'Settings' tab. In the 'Fault and Notifications' section, click 'Thresholds'.The list of global thresholds defined in the system is listed.

Add: Click here to define thresholds. If you try to define a parameter for which a global threshold already exists, then the defined values are populated and you can edit the same.

Activate: Select a threshold and click this button to activate the built-in threshold.

Deactivate: Select a threshold and click this button to de-activate the built -in threshold.

Delete: Select a threshold and click this button to delete the threshold definition from the system

To define a new threshold click 'Add'.

Add Global Threshold

Sr. No	Field Name	Description
1	Type	The type of the monitor for which the threshold is to be configured
2	Resource	Displays the available resources of the chosen type For e.g. if 'Performance Monitoring' is the chosen type then this field includes values such as Disk/CPU/Interface/ Memory/ Disk IO/ Paging etc
3	Parameter	Displays the available statistics for the selected resource. Select the statistic for which the threshold is to be defined
	Formula
4	Should Not Exceed	Select this radio button option if the collected value should not exceed the threshold value that is defined
5	Should Not Lag	Select this radio button option if the collected value should not lag the threshold value that is defined
6	Info	Define the threshold value, if the actual collected value breaches [exceeds/lags] the threshold value, then an Info alarm is generated. Select the notification profile to which the fault needs to be notified
7	Warning	Define the threshold value, if the actual collected value breaches [exceeds/lags] the threshold value, then a Warning alarm is generated. Select the notification profile to which the fault needs to be notified
8	Error	Define the threshold value, if the actual collected value breaches [exceeds/lags] the threshold value, then an Error alarm is generated. Select the notification profile to which the fault needs to be notified
9	Notify To	Select the notification profile as appropriate to the error condition. The notification mechanism would work as per the parameters defined in the profile (for more information on notification profiles refer Notification Profiles
10	Generate ticket	If the threshold violation has to be logged into the service desk, then select the appropriate profile. The service requests would be logged into the service desk as per the parameters defined in the profile (for more information on generate ticket profiles refer to the section Alarm to Record Note: Different profiles can be selected for different types of alerts
11	Tolerance Count	The tolerance count that needs to be applied whenever a threshold breach happens. This helps to discount any occasional irrelevant spikes in the monitored parameter classified as an alarm. No alarm would be generated if the breach happens within the tolerance count range. 0 minute represents infinite
12	Suppress consecutive alerts	If any consecutive alarm is generated within the specified time period, the alarm is suppressed. 0 minute means the suppression of alarms continues for infinite time till the active alarm is resolved. For example, if this is not checked and MySQL database is not available, then the alarms would be generated for each poll value, until MySQL database is Available
13	Activate	If this is checked, then the threshold is activated
14	Update	Saves the global threshold
15	Cancel	Cancels the definition process and the threshold will not be saved in the system

System Thresholds

The System thresholds tab provides the list of thresholds defined for the system resources and the notifications generated against them. There are two types of system thresholds, instance level and system level thresholds.

1. Instance Level Threshold

When a threshold is configured for an instance of the device, the threshold is termed as an instance level threshold. The following example shows how to configure an instance level threshold.

1. Click Performance > Enterprise List View> Desktop

2. Click List View and click a particular node

3. Click 'Disk' and in the 'Instances' drop down, select 'C:'

4. Click the threshold icon '' for disk space utilization

5. Configure the threshold as shown below and click 'Save'

Notes: Tolerance Count - Specify tolerance count if any to discount occasional irrelevant spikes in the monitored parameter to be classified as an alarm. No alarm would be generated if the breaches happen within the tolerance count range. 0 minute represents infinite.

Suppress Consecutive Alerts - If any consecutive alarm gets generated within the specified time period, the alarm is suppressed. 0 minute means the suppression of alarms continues for infinite time till the active alarm is resolved.

2. System Level Threshold

When a threshold is configured at the host level or for all instances of the device, the threshold is termed as a system level threshold

1. Click Performance > Enterprise List View> Desktop

2. Click List View and click a particular node

3. Click 'Disk' and in the 'Instances' drop down, select 'All'

4. Click the threshold icon '' for disk space utilization

5. Configure the threshold and click 'Save'

Notes: 1. Alarms are generated with first priority at instance level and next priority at system and global levels respectively.

2. You can search for a system threshold based on threshold caption, host/resource name, parameter or instance

To view system thresholds configured, click 'Settings'. From 'Fault and Notifications' section, click 'Thresholds'. Click 'System Thresholds'. This screen will list all the system level or instance level thresholds that are currently defined.

Capacity Planning Thresholds

Here you can define global thresholds related to capacity planning.

Click the 'Settings' tab. In the 'Fault and Notifications' section, click 'Thresholds'.
Click 'Capacity Planning Thresholds'. All the capacity planning thresholds defined are listed here. Click 'Add' to add the global threshold for capacity planning as shown below.

Dynamic Thresholds

SapphireIMS allows multiple levels of alarms threshold for each of the monitoring parameters to categorize the severity of alarms into Information, Warning or Error. This helps the operator to prioritize and decide on the response and resolution plans.

Though the approach of defining static thresholds to alert when needed is necessary, yet, it is easy to miss subtle changes in behavior when you purely rely on predetermined, static thresholds only. These subtle changes often foreshadow a potential service disruption when detected. Additionally, it has been found that having a static threshold may not take into account business conditions and thus could result in false alarms thereby causing a real disruption to be missed.

Here comes the need of dynamic baseline threshold based on empirical data which correspond to business events and conditions, rather than static thresholds. The system should be able to read the empirical data and adjust its threshold dynamically. This will help improve the performance monitoring alerting mechanism and also eliminate the maintenance overhead to manually evaluate peaks and configuring them for thresholds over a period.

With the threshold baseline variation report, the operator can also get to know the changes that happen with respect to the device and can take proactive measure in resolving them.

Click the 'Settings' tab. In the 'Fault and Notifications' section, click 'Thresholds'. Click 'Dynamic Thresholds'. This screen will list all the dynamic thresholds that are currently defined

Click 'Add' to add a new dynamic threshold

Notes: 1. Static threshold has to be configured for the corresponding performance metric prior to dynamic threshold configuration of the same.

2. Dynamic threshold configuration is applicable for system performance, synthetic transaction monitoring and application monitoring performance metric

Select the resource category for which the dynamic threshold has to be configured.
Select the corresponding device categories

Select the resource type and corresponding resource parameter and click 'Add'. Similarly multiple resource parameters can be selected.

Click 'Next'

Select the host filter and based on that, select the hosts which have to be configured for dynamic thresholds

Enter the number of days for which the performance data is to be considered before base lining the threshold
Enter the baseline percentage deviation to generate the corresponding severity alarm for the performance metric selected.
Check 'Enable Notification Changes', to get notified when the baseline changes deviates as per the specified days and percentage deviation. Select the notification profile and click 'Next'

Enter the profile name and description of the dynamic threshold configuration
Select the time zone to be considered for the threshold generation
Select the business event filter and the corresponding business event. The business events listed will be based on the selected 'Time Zone' filter and 'Business Event' filter. The threshold generation will take into consideration the time interval specified in the business event selected.
Check 'Freeze Baseline' to freeze the baseline once it is calculated. There will not be any further baseline calculation performed on this rule until 'Freeze Baseline' option is unchecked.
Click 'Save' to save the threshold configuration. Once saved, it will be listed in the screen as shown below along with the baseline status and threshold status.

Check any of the baseline profiles and click 'Freeze Baseline' to freeze the baseline and click 'Unfreeze Baseline' to unfreeze the baseline for the selected rule.
Click 'Data Screen' to view the thresholds for the individual devices or resources

Alarm Correlation

Alarms are of two types:

1. Alarms generated by SapphireIMS monitoring stack based on the threshold breach

2. Alarms/events generated by a third party source and forwarded to SapphireIMS (Standard events which are supported are syslog, snmp traps and event logs)

Key properties of the alarms (internally generated or received from 3^rd party and converted into an internal format) are Severity, Description, Creation-time, Current Status, Hostname/Resource(where it occurred).

SapphireIMS allows the customers to subscribe to alarms of interest and get notified through E-mail/SMS OR trigger automatic action OR raise an incident which can be tracked to closure.

Some of the common challenges faced are:

1. Too many non-informative alarms get displayed to the operators and they end-up ignoring critical alarms at times.

2. Customers have stand-by or redundant infrastructure for mission critical services. Key stake-holders are interested in receiving alarms when both the infrastructures are down. Otherwise, they want the alarms to be sent only to the operational teams for routine actions and service restorations.

3. Services are delivered over distributed infrastructure consisting of physical or virtual environments. Customers are interested in monitoring the impact to the service rather than just getting individual component level alarms. i.e. Can the alarms carry a message of service impact rather than only component impact?

4. Customers are interested in being alerted, if the problem persists for a period of time and don’t want to be notified for the transient conditions.

5. Some devices / apps may depict unusual behavior for a moment. For example, CPU spike or packet drops in the internet leased line OR link failure, etc. The devices which perform the task generate an alert instantly (syslog on link flap). Can the alarm monitoring system analyze and generate meaningful alarms, if the problem persists?

All the above challenges commonly bring out the need for intelligently analyzing and operating on the alarms before presenting it to the operators for actions. Alarm correlation as a feature can address the above challenges.

Click the 'Settings' tab. In the 'Fault and Notifications' section, click 'Thresholds'.
Click 'Alarm Correlation'

Generic Rules

These rules can be configured by the user for specific situations when alarms are generated.

1. Enable the correlation for the generic rule to take effect. This allows all the alarms to be displayed in the transient alarm list which are then acted upon by the correlation engine or rule, else all the alarms are displayed in the alarm list view without being passed through the correlation engine.

2. If 'All Pass Filter' is selected, the alarms are not filtered i.e. they are passed through the correlation rule even if they do not match any input filters.

3. If 'No Pass Filter' is selected, then the alarm is dropped in the transient alarm list if the transient alarm is not applicable for any of the configured input alarm filters i.e. if the input alarm does not match any of the correlation rules, default action is selected.

Click 'Add' to add the generic rule

Input Alarm Filter - Here you can qualify the transient alarms which need to be passed through the correlation rule

In the above example, all the hosts that belong to the host/resource profile 'CM_HostProfile_Range_3-230' where the alarms generated have severity equal to 'ERROR' and message contains 'mysql' as well as 'unavail' are passed through the alarm correlation rule.

'Add New Profile' takes you to the Profile Manager page where you can add profiles (select hosts/resources) to be used for alarm correlation.

Output

After input filter is applied on the alarm correlation rule, the next step is to define the output or end actions to be performed on the alarms. The output can be generated as a new alarm or the existing alarm can be enriched by modifying the notification, severity or the alarm message.

In the below example, the alarm rule defined above is enriched by appending the reason "Respective host is currently down"

Rule Engine

Alarms have two conditions - Sleep and Check Active Alarm. If an alarm is in Sleep state, macros cannot be applied on it. Within (sec) or the time duration for checking the alarm state is not applicable. Occurrence of the alarm is also not applicable. Only Delay interval (mins) or the duration of the alarm in 'Sleep' state is applicable. 'After Delay' or the next action to be performed after the completion of 'Sleep' state has to be selected. This can be a configured output action or a new state or the alarm can be dropped.

If 'Check Active Alarm' condition is selected, you need to specify the time duration (Within (sec)) for checking the alarm state and occurrence of the alarm state. You can even select macros to filter hosts. (See Conditional Macros) If the alarm is available, select output action or a new state or drop the alarm. Similarly, if the alarm is unavailable, select output action or a new state or drop the alarm.

Click on 'Preview' to view the flowchart of the alarm rule engine configuration and click on 'Finish' to save the configuration

Conditional Macros

You can add conditional macros which can be used in generic rules. Macros help you to select hosts based on the severity or message criteria of the alarms

Click 'Conditional Macros' and click the 'Add' button

In the above example, the macro checks for active alarms with Severity equal to 'ERROR' and message containing 'unavail' for all resources belonging to the selected profile.

In the above example, the macro checks for the host unavailability in the transient alarm list for the corresponding Field parameters (Severity/Message) that are selected.

Source option means that the correlation rule engine gets the system or resource details from input transient alarms.

Transient Rules

These rules apply to all the active alarms and are added to update the tolerance count and to enable or disable the suppression of consecutive alarms for system or global thresholds.

Sr. No

Field Name

Description

Tolerance Count

The tolerance count that needs to be applied whenever a threshold breach happens. This helps to discount any occasional irrelevant spikes in the monitored parameter classified as an alarm. No alarm would be generated if the breach happens within the tolerance count range. 0 minute represents infinite

Enable suppression of consecutive alarms occurrence within minutes

If any consecutive alarm is generated within the specified time period, the alarm is suppressed. 0 minute means the suppression of alarms continues for infinite time till the active alarm is resolved.

For example, if this is not checked and MySQL database is not available, then the alarms would be generated for each poll value, until MySQL database is Available

Business Service Rules - You can add business service rules to configure thresholds to generate alarms. See Business Service for more details

Note: For performance monitoring of devices, all built-in thresholds defined and activated will be global and will be applied to all the devices. If you need to define a device specific threshold then refer to 'Types of Thresholds' given below.

Global definitions are allowed only for the Availability statistic for Service Monitor and Application monitor resources. Threshold definition for all other statistics related to these monitors will be supported at the resource level only.

Thresholds

Thresholds

Global Thresholds

System Thresholds

Capacity Planning Thresholds

Dynamic Thresholds

Alarm Correlation

Conditional Macros

Exclusion Profiles