Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
titleOn - Call Instructions

Before calling on any alarm review the following: 

  1. Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down. 
  2. If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system.  
  3. If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call. 
  4. Operators only need to call for an Xymon alarm if it is also showing on Squared Up. 

After a new Squared Up alert pops up for a production machine, If the alert is still present after 20 minutes Operations will need to call the group responsible for the system. 

  1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes Operations will need to call the system owner.
  2. Is the alert for a clustered device? Some systems, likeMailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for individual boxes, but they will inXymonin Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins. 
  3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.
  4. Xymon- Click the status icon (pictured below) along the row corresponding to the trouble server.  
  5. Some admins have placed instructions for which alarms should be ignored or contact instructions in this page (pictured below is an example of these instructions). Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) to locate the appropriate on-call. 
  6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email. 
  7. Log the contact, and appropriate follow-up activities.

...

The top level titles and subcategories are subject to change. IOC staff are encouraged to explore the application and become familiar with the various categories of systems monitored.

 

All non-green View

It may be a mouthful of a title, but it's entirely accurate. When a user can only have one Xymon window open, it should be this one. Again, SquaredUp Squared Up will display critical alerts for production machines. With those two programs running, a user should be able to determine which alerts warrant further attention without too much wasted effort. The non-green systems category provides a real time listing (up to four hours) of the most recent changes in machine status for every monitored device. It can also display the last 4 hours of event acknowledgment by a system administrator. Finally, all alerts are displayed in a dynamic expanding grid format. 

 

Current Status

Any machines with a current error condition will be displayed at the top of the page. Selecting any of the status icons to the right of the machine listing will bring up service information for that system. If there are no current error conditions, this portion of the page will display “All Monitored Systems OK”.
Users who click on the underlined name of a machine will see information deemed appropriate by the administrators. The displayed information may be the other systems in a cluster, or specific information about a given server. For example – lppbakbm01.itap.purdue.edu is a Backup Production server. When it alarms in the ‘Current non-green Systems’ window, clicking its title brings the user to the ‘Backup Production’ category. If tsm01.itap.purdue.edu was instead in alarm, they would bypass the ‘Backup Production’ category entirely, instead seeing specific instructions for tsm01 alerts. This is similar to the ‘Production Services’ example earlier in this document. It is impossible to tell which machines contain specific on-call instructions and which do not from the ‘Current non-green Systems’ view. The only consistent way to tell if a machine has further alarm instructions is to drill down to its lowest directory. Any names underlined within this category will contain specific instructions. 

...