...
Section | ||
---|---|---|
| ||
Description: Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO supporting Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of synch for several key University areas.
|
Section |
---|
Location: Xymon: All non-green systems systems - Keep this link up on workstation. https://monitor.itap.purdue.edu/- Xymon: Top View Top level view that displays alerts by group. It is suggested that one create a sidebar for this view. Categories include:
|
Panel | ||
---|---|---|
| ||
Before calling on any alarm review the following:
After a new Squared Up Grafana alert pops up for a production machine, If the alert is still present after 20 minutes Operations in Xymon, IOC will need to call the group responsible for the system.
|
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
If there is no answer from the groups on call number, leave a voicemail. Call back again in 10 minutes. If there is no answer, leave an voicemail message and send a follow-up email. Wait another 10 minutes and if there is no answer contact the next on-call contact or manager. If no contact from the group by phone or email after this time, consult with your supervisor or on call supervisor. |
Panel | ||||||
---|---|---|---|---|---|---|
| ||||||
Treat a production machine with numerous purple status alerts as if it were a red alert. |
...
Panel | ||||
---|---|---|---|---|
| ||||
Find the group owner of the system page by: Going to Xymon. 5. If the system name is clickable, there will be special instructions. Follow those instructionsSystem name.png. Clickable links will appear underlined as pictured below. These instructions are many times instructions about when NOT to call. One example is below. 6. Once step 4 is complete, find the system in the Footprints Change and Release management CMDB. Search for the name of the system Search the communication log for the alarming system. |
Xymon Top Level View
From the top level a user may drill down to determine a number of factors. To investigate a given category, click the face or symbol next to each title. Users who look at the Production category are presented with a large list of Services. Below is the current list.
Clicking the face alongside the Backup sub category returns the states of machines currently classified as such. Any systems that are underlined indicate the presence of critical information. Additional important information may be found in the Info category for each machine. A Technical Operator will need this further information to resolve alerts.
Clicking on tsm01 as displayed above, generates the screen shown below.
The top level titles and subcategories are subject to change. The user is encouraged to explore the application and become familiar with the various categories of systems monitored.
...
All non-green View
It may be a mouthful of a title, but it's entirely accurate. When a user can only have one Xymon window open, it should be this one. Again, SquaredUp will display critical alerts for production machines. With those two programs running, a user should be able to determine which alerts warrant further attention without too much wasted effort. The non-green systems category provides a real time listing (up to four hours) of the most recent changes in machine status for every monitored device. It can also display the last 4 hours of event acknowledgment by a system administrator. Finally, all alerts are displayed in a dynamic expanding grid format.
...
If there is no information from the previous steps Search the communication log for the alarming system. |
Xymon Top Level View
From the top level a user may drill down to determine a number of factors. To investigate a given category, click the face or symbol next to each title. Users who look at the Production category are presented with a large list of Services. Below is the current list.
The top level titles and subcategories are subject to change. IOC staff are encouraged to explore the application and become familiar with the various categories of systems monitored.
All non-green View
When a user can only have one Xymon window open, it should be this one. Grafana will display critical alerts for production machines. With those two programs running, a user should be able to determine which alerts warrant further attention without too much wasted effort. The non-green systems category provides a real time listing (up to four hours) of the most recent changes in machine status for every monitored device. It can also display the last 4 hours of event acknowledgment by a system administrator.
Current Status
Any machines with a current error condition will be displayed at the top of the page. Selecting any of the status icons to the right of the machine listing will bring up service information for that system. If there are no current error conditions, this portion of the page will display “All Monitored Systems OK”. Machine will show up with a color that's not green if there are any issues with the machine. Please see the Icon list for a description of the icon.
Status History
This section can display messages from the last 4 hours of monitoring. Each line contains from left to right: time stamp, machine name, affected service, prior service status, and the updated machine state. Each system's title will be highlighted in red, yellow, or green; a color which corresponds to the current status of the service in question.
...
2. Clustered Systems - When a large number of these machines are in alarm, call after the 20 minute window. An example of a server cluster would be Software Remote. Some clustered systems will only alert in SquaredUp if Grafana if enough issues are logged in Xymon.
...
4. Personal Workstations- Unless specified ignore all personal workstation alerts. Personal workstation alerts should all be squelched in SquaredUpGrafana, and no longer found within Production categories in Xymon (as of 3/16/2012).
5. Purple Alerts - Treat a production machine with numerous purple status alerts as if it were a red alert. (Note: The example machine listed below was NOT a production box at the time of publication.)
6. Broad Spectrum Failure - If a large number of alerts are being recorded across the board for different systems, consider picking up the phone earlier than later, there could be a bigger problem emerging.
...