Content Comparison

...

Panel

borderWidth	10
title	Table of Contents

Expand

title	Click to Expand

Table of Content Zone

Instructions for Contacting an Admin Regarding an Alarm/Outage
- Admin Follow-Up Email Template
What to Monitor
- On your workstation
- On the monitoring computers
Shift Log
Squared Up
- Xymon Alerts
  - Xymon
- Network Device Alerts
  - Network Incident On-Call Process
  - Information to provide when reporting network issues
Metasys
StruxureWare
Siemens
Cameras
UC4
Crestron Fusion

Outage Types with Unique Instructions
- Exchange Outages
- McAfee ePO Updates
- ICS (TLT) Tool Page Failure

...

Panel

borderColor	black
borderStyle	solid
title	Instructions for Contacting an Admin Regarding an Alarm/Outage

Admin Follow-up Email Template

After calling an admin to report an issue, a follow-up email should be sent to the admin's group (not the admin themselves) using the following template:

Subject: "Follow-up regarding phone call about issue XXXXXX".

Body: <Name>,

This is a follow-up email regarding my phone conversation with <admin> about <Type of alert/Issue> for the <machine / system name>.

<Operator>
ITaP OperationsCSC/IOC Service Desk Specialist
(765) 496-7272

Anchor
whattomonitor
whattomonitor

...

Panel

borderColor	black
borderStyle	solid
title	Squared Up

Squared Up will notify Ops IOC of two different types of alerts: Xymon alarms (covered in the next subsection) and Network Device Alerts (covered in the next subsection) .

Xymon Alerts:

Before calling on any alarm review the following:

Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system.
If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call.

After a new Squared Up alert pops up for a production machine, If the alert is still present after 20 minutes Operations will need to call the group responsible for the system.

...

Panel

borderColor	black
borderStyle	solid
title	Xymon

Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of synch sync for several key University areas.

In general, Operations staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in Squared Up. Alarms for McAfee (mcshield.exe) should not be reported unless they last for several hours.

While Operations staff are primarily concerned with Critical (Red) alerts, they should also be familiar with the various other colors of Xymon alerts and their meanings:

Xymon icons.PNG

Note

title	mcshield.exe

If a system has a red CPU alert, check the processes to see if it is due to mcshield. It will be on top and a high percentage.

This is McAfee running an anti-virus sweep on this machine. It will take up a lot of CPU cycles, but can be ignored.

1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes Operations IOC will need to call the system owner.

2. Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins.

3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.

4. Xymon - Click the status icon (pictured below) along the row corresponding to the trouble server.

5. Some admins have placed instructions for which alarms should be ignored or contact instructions in this page (pictured below is an example of these instructions)Click on the server name to find further instructions on who to contact or if the alarm should be ignored. Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) to locate the appropriate on-call. Further instructions on the use of the CMDB can be found here: Footprints - Configuration Management Or further use on how to use CMDB:

6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email.

Warning
If there is no answer from the groups on call number, leave a voicemail. Call back again in 10 minutes. If there is no answer from the group by phone or email after this time, consult with your supervisor or on call supervisor.

7. Log the contact, and appropriate follow-up activities.

Find the group owner of the system page by:

Going to Xymon.
Find the red production status indicator (pictured below), this should be red if there is an alarm.
Click the red status indicator for the system affected.
Continue moving through the sub-menus until you find the page

If the page has special instructions at the top, make sure to follow these instructions when deciding if an admin should be contacted, and which admin should be contacted.

5. If the system name is clickable, there will be special instructions. Follow those instructions

System name.png

These instructions are many times instructions about when NOT to call.

6. Once step 4 is complete, find the system in the Footprints Change and Release management CMDB.

Search for the name of the system
Click on the related CI(s) link.
Click the bubble button CI’s to “Named server”
Click on the Managed by Relationship, and click “Go To”
The on call group information should be listed here.

7. If there is no information from the previous steps

Search the communication log for the alarming system.
Review past correspondence and determine who to call.
If the server starts with a “W” it is most likely the windows on call
If the server starts with an “L” it is most likely the unix on call.(Linux)
Consult with your coworkers.
Consult your supervisor, or on call supervisor.

Info

icon	false
title	Purple Alerts

Treat a production machine with numerous purple status alerts as if it were a red alert.

...

Version	Old Version 9	New Version 10
Changes made by	Sean L McLane (Unlicensed)	Jennifer Murray (Unlicensed)
Saved on	Dec 14, 2018	Dec 20, 2018

Versions Compared