Xymon Service Monitoring Tool
Description:
Xymon is the web based monitoring tool of choice for Administrators and staff supporting Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of synch for several key University areas.
In general, IOC staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in Grafana. Generally, if a critical production alert appears in Xymon for more than 20 minutes, a system administrator or equivalent will need to be contacted regarding their malfunctioning equipment. In the event of a hardware failure, contact the appropriate system owner ASAP. Each shift is responsible for learning their time of day’s appropriate contact procedures.
Location:
Xymon: All non-green systems - Keep this link up on workstation.
Xymon: Top View Top level view that displays alerts by group. It is suggested that one create a sidebar for this view.
Categories include:
- ITIS Services – Production, Non-production, Pre-production machines and VMs for a number of systems such as Oracle, Banner, Blackboard/Brightspace, SAP. IOC staff should only be concerned with devices nested within Production.
- Infrastructure – There are no devices exclusively within this category that IOC staff are required to monitor.
- Platform Support – Windows production and test boxes for applications. This category is in place for Windows administrators. There are no devices exclusively within this category that IOC Staff are required to monitor.
Before calling on any alarm review the following:
- Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
- If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system.
- If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call.
- Operators only need to call for an Xymon alarm if it is also showing on Grafana Up.
After a new Grafana alert pops up for a production machine, If the alert is still present after 20 minutes in Xymon, IOC will need to call the group responsible for the system.
- After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes Operations will need to call the system owner.
- Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Grafana for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Grafana but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins.
- Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.
- Xymon- Click the status icon (pictured below) along the row corresponding to the trouble server.
- Some admins have placed instructions for which alarms should be ignored or contact instructions in this page (pictured below is an example of these instructions). Follow any special instructions for the machine OR use the Service Offering On Call page to locate the appropriate on-call.
- Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email.
- Log the contact, and appropriate follow-up activities.
If there is no answer from the groups on call number, leave a voicemail. Call back again in 10 minutes. If there is no answer, leave an voicemail message and send a follow-up email. Wait another 10 minutes and if there is no answer contact the next on-call contact or manager. If no contact from the group by phone or email after this time, consult with your supervisor or on call supervisor.
Treat a production machine with numerous purple status alerts as if it were a red alert.
Find the group owner of the system page by:
Going to Xymon.
Find the red production status indicator (pictured below), this should be red if there is an alarm.
Click the red status indicator for the system affected.
Continue moving through the sub-menus until you find the page
If the page has special instructions at the top, make sure to follow these instructions when deciding if an admin should be contacted, and which admin should be contacted.
5. If the system name is clickable, there will be special instructions. Follow those instructions. Clickable links will appear underlined as pictured below.
These instructions are many times instructions about when NOT to call. One example is below.
6. If there is no information from the previous steps
Search the communication log for the alarming system.
Review past correspondence and determine who to call.
If the server starts with a “W” it is most likely the windows on call
If the server starts with an “L” it is most likely the unix on call.(Linux)
Consult with your coworkers.
Consult your supervisor, or on call supervisor.
Xymon Top Level View
From the top level a user may drill down to determine a number of factors. To investigate a given category, click the face or symbol next to each title. Users who look at the Production category are presented with a large list of Services. Below is the current list.
The top level titles and subcategories are subject to change. IOC staff are encouraged to explore the application and become familiar with the various categories of systems monitored.
All non-green View
When a user can only have one Xymon window open, it should be this one. Grafana will display critical alerts for production machines. With those two programs running, a user should be able to determine which alerts warrant further attention without too much wasted effort. The non-green systems category provides a real time listing (up to four hours) of the most recent changes in machine status for every monitored device. It can also display the last 4 hours of event acknowledgment by a system administrator.
Current Status
Any machines with a current error condition will be displayed at the top of the page. Selecting any of the status icons to the right of the machine listing will bring up service information for that system. If there are no current error conditions, this portion of the page will display “All Monitored Systems OK”. Machine will show up with a color that's not green if there are any issues with the machine. Please see the Icon list for a description of the icon.
Status History
This section can display messages from the last 4 hours of monitoring. Each line contains from left to right: time stamp, machine name, affected service, prior service status, and the updated machine state. Each system's title will be highlighted in red, yellow, or green; a color which corresponds to the current status of the service in question.
Prior Status: State of monitored service prior to the most recent update
Updated Status: Current state of monitored service.
Status History: Time and date of the update
Acknowledged Alerts
From time to time an administrator will have acknowledge an alert within Xymon without clearing it. In the event of production alerts a user will be presented with a check mark in place of the face / X shape as shown below.
Xymon Color Codes and Symbols
Color | Recently changed | Last change > 24 hours |
---|---|---|
Green: Status is OK |
|
|
Yellow: Warning |
|
|
Red: Critical |
|
|
Clear: No data |
|
|
Purple: No report |
|
|
Blue: Disabled |
|
|
When to Call
As previously mentioned, critical alerts in Xymon are indicated by the color red. Take time to click on the alert when it shows up in order to better prepare any future response. Generally, a system administrator should be alerted should one of their production devices is in an alarm state for more than 20 minutes. There are several exceptions to this rule as follows:
1. Hardware Failure - Contact the on-call rapidly for production machines suffering a hardware failure. These failures do not clear up on their own, and prompt response (~5 minutes) is advisable.
2. Clustered Systems - When a large number of these machines are in alarm, call after the 20 minute window. An example of a server cluster would be Software Remote. Some clustered systems will only alert in Grafana if enough issues are logged in Xymon.
3. Test Machines - Unless specified ignore all test machine alerts.
4. Personal Workstations- Unless specified ignore all personal workstation alerts. Personal workstation alerts should all be squelched in Grafana, and no longer found within Production categories in Xymon (as of 3/16/2012).
5. Purple Alerts - Treat a production machine with numerous purple status alerts as if it were a red alert.
6. Broad Spectrum Failure - If a large number of alerts are being recorded across the board for different systems, consider picking up the phone earlier than later, there could be a bigger problem emerging.
Column Quick Reference
Listed below are brief descriptions of the various column headings found within Xymon. Please refer to these if unsure of the nature of an alert.
Conn - The conn test 'pings' the host to see if it is responsive.
Content - The content column shows the status of a Web request, where a specific response was expected.
Cpu - this column will indicate the status of the processor on the machine being monitored. A yellow status will result if processing load reaches an administrator defined threshold, and a red status will result if the processing load gets even higher, generally over 95%. High load is not uncommon and generally will require no response.
Deadpath - TO BE FILLED OUT
Disk - this column will indicate the status of the drive space attached to the machine being monitored. This will generally be an indication of a full disk. Like CPU alerts, disk thresholds can be adjusted by administrators. The most common disk alert will generally be a full log file.
DNS - this column indicates the status of the domain name server being monitored.
F5Status - TO BE FILLED OUT
Files - The files test shows the status of file- and directory-checks performed on the host. This is typically tests that check the size of files or directories, or check that they exist with the correct owner/group/permissions.
fs - Reports the status of the file system.
FSCHK - Unix system utility (stands for file system check.)
hobbitd - Shows the status of the central Xymon daemon. If this is in alarm Xymon probably won't be working.
http - Shows the status of one or more Web requests sent to the server. http is now the ubiquitous method for exchanging information across a network, it is the service used when your webbrowser requests information from the Internet.
Info - The info column shows static information about how this host is configured in the Xymon system. It may also contain contact information or other device specific instructions.
kpsing - To be filled out.
ldap - Shows the status of the LDAP Directory Service on the host. ldap is commonly used for storing information about users, e.g. their login-names and passwords, so if the ldap service is not running then users may have problems getting access to their systems.
lpstat - Monitors the status of the printing system on the client. Each client runs the lpstat command to get a list of printers it can detect. If that attempt fails to contact the cups daemon the service is reported as down and "daemon not responding" (red). If the cups daemon is responding, but it reports no printers, the service is reported as "DOWN" (red). Otherwise, if it can connect to the server (cups) but gets only few printers (fewer than 450) the service is reported as marginal (yellow). Otherwise it is reported as OK (green).
Memory - The memory column shows how much of the system memory (RAM) and swap-space is being used. If memory is running low, performance of the system will begin to degrade.
Metastat - The metastat column monitors the health of any meta devices found on the client system.
Msgs - The msgs column monitors system log-files or the Event log for warnings or critical errors.
Myconns - ??? To be filled out
Mysql - Monitors the status of the MySQL server on the client. It will attempt to connect and do a select. If it succeeds the service is reported as "OK" (green), otherwise it is reported as "Not OK" (red).
Netstat - Holds the output of the netstat command and graphs the number of packets received, sent and retransmitted.
Nmbd - Monitors the status of the sambs names service (nmb) on the client, if it exists (actually, if the monitor script knows about it). The script uses the nmblookup command and tests the output for success or failure.
Pop3 - Shows the status when trying to communicate with the POP Postoffice service on the host. pop is used when a user needs to pick up from from a central mailserver.
Procs - This column will indicate the status of specific processes (set on a per machine basis - the processes monitored will vary from machine to machine).
Ports - Shows the status of select tcp ports and connections that are expected to exist on the system.
Prtdiag - prtdiag is a bash script that generates a report that describes the state of the hardware on the running machine. It can even forward alerts for fans and heat alerts. (Note it does not collect its own data.)
Raidctl - Reports back all Solaris DiskSuite RAID software faults
Smbd - Monitors the status of the samba service on the client. It will attempt a connection twice with a pause in between (it uses smbclient). If the connection is successful on the first try, the service is reported as OK (green), if it is successful on the second try it is reported as "OK on second try" (yellow). However, if it fails on the second try it will be reported as "NOT OK" (red).
Smtp - Shows the status when trying to communicate with the SMTP Mail transfer service on the host. smtp is used on mail servers that handle outgoing mail, or that relay mail from one system to another.
Ssh - Shows the status when trying to communicate with the secure shell (ssh) server on the host. ssh is commonly used for encrypted console access to Unix servers, or for copying files between systems.
Svcs - This column (Services) is identical to ‘Procs' listed above, but for Windows machines. Clicking on the status icon may result in further detail on what processes triggered the alert. Allow 20 minutes to self correct, and then follow on call procedures for the machine being monitored.
SMBD -- this column will indicate the status of the Samba server on the machine being monitored. Allow 20 minutes for this to self correct, and then follow on call procedures for machine being monitored.
SMTP - this column will indicate the status of the SMTP (simple mail transfer protocol) server running on the machine being monitored.
SMTP-auth -
SMTP-nopmx -
SMTP-starttls -
sslcert - status of one or more SSL certificates on the server. SSL certificates are needed for services that use encryption, e.g. if you have a secure webserver. Certificates are normally issued by trusted organisations such as Verisign or Thawte, and are valid for a limited period of time.
Telnet - shows the status when trying to communicate with the Telnet service on the host. Telnet is commonly used for logging in to Unix servers or network devices such as routers.
Temp - temperature
Trends - holds a collection of the graphs that show trends in the utilisation, response-times etc. for the services monitored on this host.
Vmstat -- this column will indicate the status of the virtual memory on the machine being monitored. Heavy load is not uncommon, but if the condition persists for over an hour, follow on call procedures for machine being monitored.
zfs - The zfs column monitors the health of any zfs pools found on the client system. ZFS is the Sun brand file system and logical volume manager.
If there are any errors that require a response, a brief description of the error condition and the corresponding actions taken should be included in the nightly shift logs.