Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Section
bordertrue

Description: 

Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO supporting  Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of synch for several key University areas.


In general, IOC staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in SquaredUpGrafana. Generally, if a critical production alert appears in Xymon for more than 20 minutes, a system administrator or equivalent will need to be contacted regarding their malfunctioning equipment. In the event of a hardware failure, contact the appropriate system owner ASAP. Each shift is responsible for learning their time of day’s appropriate contact procedures.

...

Section

Location:

Xymon: All non-green systems -  Keep this link up on workstation.

Xymon: Top View Top level view that displays alerts by group. It is suggested that one create a sidebar for this view.

Categories include:

  • ITIS Services – Production, Non-production, Pre-production machines and VMs for a number of systems such as Oracle, Banner, Blackboard/Brightspace, SAP. IOC Operators staff should only be concerned with devices nested within Production.
  • Infrastructure – There are no devices exclusively within this category that Technical Operators IOC staff are required to monitor.
  • Platform Support – Windows production and test boxes for applications. This category is in place for Windows administrators. There are no devices exclusively within this category that Technical Operators IOC Staff are required to monitor.




Panel
titleOn - Call Instructions

Before calling on any alarm review the following: 

  1. Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down. 
  2. If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system.  
  3. If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call. 
  4. Operators only need to call for an Xymon alarm if it is also showing on Squared Grafana Up. 

After a new Squared Up Grafana alert pops up for a production machine, If the alert is still present after 20 minutes Operations in Xymon, IOC will need to call the group responsible for the system. 

  1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes Operations will need to call the system owner.
  2. Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for Grafana for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up Grafana but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins. 
  3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.
  4. Xymon- Click the status icon (pictured below) along the row corresponding to the trouble server.  
  5. Some admins have placed instructions for which alarms should be ignored or contact instructions in this page (pictured below is an example of these instructions). Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) Service Offering On Call page to locate the appropriate on-call. 
  6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email. 
  7. Log the contact, and appropriate follow-up activities.

...

Panel
borderColorRED
bgColorPINK
titleColorRed
titleIMPORTANT

If there is no answer from the groups on call number, leave a voicemail. Call back again in 10 minutes. If there is no answer from the group by phone or email after this time, consult with your , leave an voicemail message and send a follow-up email. Wait another 10 minutes and if there is no answer contact the next on-call contact or manager. If no contact from the group by phone or email after this time, consult with your supervisor or on call supervisor. 

...

Panel
borderColorBlue
titleAdmins Instructions

Find the group owner of the system page by:

Going to Xymon.
Find the red production status indicator (pictured below), this should be red if there is an alarm.
Production Status Indicator.PNGImage Added
Click the red status indicator for the system affected.
Continue moving through the sub-menus until you find the page
If the page has special instructions at the top, make sure to follow these instructions when deciding if an admin should be contacted, and which admin should be contacted.
citrix.PNGImage Added

5. If the system name is clickable, there will be special instructions. Follow those instructions

System name.png

. Clickable links will appear underlined as pictured below.

Image Added

These instructions are many times instructions about when NOT to call. One example is below.
instructions.PNGImage Added

6. Once step 4 is complete, find the system in the Footprints Change and Release management CMDB.

Search for the name of the system
Click on the related CI(s) link.
Click the bubble button CI’s to “Named server”
Click on the Managed by Relationship, and click “Go To”
The on call group information should be listed here.
7. If there is no information from the previous steps

Search the communication log for the alarming system.
Review If there is no information from the previous steps

Search the communication log for the alarming system.
Review past correspondence and determine who to call.
If the server starts with a “W” it is most likely the windows on call
If the server starts with an “L” it is most likely the unix on call.(Linux)
Consult with your coworkers.
Consult your supervisor, or on call supervisor.

...

When a user can only have one Xymon window open, it should be this one. Squared Up will Grafana will display critical alerts for production machines. With those two programs running, a user should be able to determine which alerts warrant further attention without too much wasted effort. The non-green systems category provides a real time listing (up to four hours) of the most recent changes in machine status for every monitored device. It can also display the last 4 hours of event acknowledgment by a system administrator. 

...

Any machines with a current error condition will be displayed at the top of the page. Selecting any of the status icons to the right of the machine listing will bring up service information for that system. If there are no current error conditions, this portion of the page will display “All Monitored Systems OK”.
Users who click on the underlined name of a machine will see information deemed appropriate by the administrators. The displayed information may be the other systems in a cluster, or specific information about a given server. For example – lppbakbm01.itap.purdue.edu is a Backup Production server. When it alarms in the ‘Current non-green Systems’ window, clicking its title brings the user to the ‘Backup Production’ category. If tsm01.itap.purdue.edu was instead in alarm, they would bypass the ‘Backup Production’ category entirely, instead seeing specific instructions for tsm01 alerts. This is similar to the ‘Production Services’ example earlier in this document. It is impossible to tell which machines contain specific on-call instructions and which do not from the ‘Current non-green Systems’ view. The only consistent way to tell if a machine has further alarm instructions is to drill down to its lowest directory. Any names underlined within this category will contain specific instructionsMachine will show up with a color that's not green if there are any issues with the machine. Please see the Icon list for a description of the icon

Status History

This section can display messages from the last 4 hours of monitoring. Each line contains from left to right: time stamp, machine name, affected service, prior service status, and the updated machine state. Each system's title will be highlighted in red, yellow, or green; a color which corresponds to the current status of the service in question.

...

2. Clustered Systems - When a large number of these machines are in alarm, call after the 20 minute window. An example of a server cluster would be Software Remote. Some clustered systems will only alert in SquaredUp if Grafana if enough issues are logged in Xymon.

...

4. Personal Workstations-  Unless specified ignore all personal workstation alerts. Personal workstation alerts should all be squelched in SquaredUpGrafana, and no longer found within Production categories in Xymon (as of 3/16/2012).

...