IOC Monitoring Instructions

The IOC operator's duties are primarily ones of observation and proactive intervention. The job revolves around inputs from Xymon and Squared Up, as well as human feedback from sources such as the CSC, Classrooms, etc.

Table of Contents

Click to Expand

107678101
- Admin Follow-Up Email Template
107678101
- On your workstation
- On the monitoring computers
107678101
Squared Up
- Xymon Alerts
  - 107678101
- 107678101
  - 107678101
  - Information to provide when reporting network issues
107678101
107678101
107678101
107678101
107678101

107678101
- Exchange Outages
- McAfee ePO Updates
- ICS (TLT) Tool Page Failure

Instructions for Contacting an Admin Regarding an Alarm/Outage

Admin Follow-up Email Template

After calling an admin to report an issue, a follow-up email should be sent to the admin's group (not the admin themselves) using the following template:

Subject: "Follow-up regarding phone call about issue XXXXXX".

Body: <Name>,

This is a follow-up email regarding my phone conversation with <admin> about <Type of alert/Issue> for the <machine / system name>.

<Operator>
CSC/IOC Service Desk Specialist
(765) 496-7272

What to Monitor

On your workstation:

Email (either Outlook or Outlook Web App)
IOC Log
Xymon
Footprints

On the monitoring computers:

Squared Up
UC4
Metasys
StruxureWare
Siemens
StationStatus

Shift Log

The IOC Shift Log (found here) is how records are kept during the shift and between shifts. While you are monitoring, one of your duties is to maintain the log. This includes being aware of existing log entries and their implications, making new log entries when certain events occur, and updating log entries as their status changes or new information is reported. The following events should be noted in the log when they occur:

All communications (e-mail, in person, phone, etc) between the IOC and other groups.
Service Alerts
UC4 Abends

Further information on the process for updating and maintaining the shift log can be found here: IOC: Shift Log Entries.

Squared Up

Squared Up will notify IOC of two different types of alerts: Xymon alarms (covered in the next subsection) and Network Device Alerts (covered in the next subsection) .

Xymon Alerts:

Before calling on any alarm review the following:

Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system.
If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call.

After a new Squared Up alert pops up for a production machine, If the alert is still present after 20 minutes Operations will need to call the group responsible for the system.

Xymon

Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of sync for several key University areas.

In general, Operations staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in Squared Up. Alarms for McAfee (mcshield.exe) should not be reported unless they last for several hours.

While Operations staff are primarily concerned with Critical (Red) alerts, they should also be familiar with the various other colors of Xymon alerts and their meanings:

Xymon icons.PNG

1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes IOC will need to call the system owner.

2. Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins.

3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.

4. Xymon - Click the status icon (pictured below) along the row corresponding to the trouble server.

5. Click on the server name to find further instructions on who to contact or if the alarm should be ignored. Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) to locate the appropriate on-call. For further use on how to use CMDB: Configuration Management Database (CMDB) .

6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email.

If there is no answer from the group's on call number, leave a voicemail. Call back again in 20 minutes. If there is no answer from the group by phone or email after this time, contact the group's manager. If you have questions, consult with CSC / IOC supervisor.

7. Log the contact, and appropriate follow-up activities.

How to find the system owner (Admin to call):

1. If the system name is clickable, there will be special instructions. Follow those instructions

System name.png

These instructions are many times instructions about when NOT to call.

2. Find the system in the Footprints Change and Release management CMDB.

Search for the name of the system
Click on the related CI(s) link.
Click the bubble button CI’s to “Named server”
Click on the Managed by Relationship, and click “Go To”
The on call group information should be listed here.

3. If there is no information from the previous steps

Search the communication log for the alarming system.
Review past correspondence and determine who to call.
If the server starts with a “W” it is most likely the windows on call
If the server starts with an “L” it is most likely the unix on call.(Linux)
Consult with your coworkers.
Consult your supervisor, or on call supervisor.

Purple Alerts

Treat a production machine with numerous purple status alerts as if it were a red alert.

Network Device Alerts

Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
Action should be considered for any Squared Up alarm that shows up in the Normal Operations Network View for 20 minutes or longer in duration. The appropriate steps should be taken as outlined in Section II, Incident On-Call Process found elsewhere in this document.
Action should be taken for network issues reported by I-Light (GigaPOP), the ITaP help desk, campus personnel, students, and/or visitors.

8:00 AM to 5:00 PM, Monday through Friday:

Issue	Action
Wireless/PAL issues that are being experienced by multiple clients. Reports of multiple data PIC service issues in an area. Report of single HIGH PRIORITY data PIC or wireless PAL outage. Call from I-Light / GigaPOP	Follow the normal on-call procedures

Outside of 8:00 AM to 5:00 PM, Monday through Friday:

Issue	Action
Wireless/PAL issues that are being experienced by multiple clients in different buildings.	Send an email to itns-pdnhlog-ext@lists.purdue.edu to notify Data Networking. If this issue seems to be high priority or widespread it may be justified to escalate to our normal on-call procedures.
Reports of multiple data PIC service issues in an area. Report of single HIGH PRIORITY data PIC or wireless PAL outage. Call from I-Light / GigaPOP	Follow the normal on-call procedures.

Network Incident On-Call Process

For all SQUARED UP alarms and trouble calls follow the on-call triage process:

For alarms originating from StruxureWare, DCM (Data Center Management) personnel should be notified in addition to proceeding with the Data Networking (ITIS) notification steps outlined below.

Verify that the building that is showing the alarm has electrical power. Pass this information to the on-call and in the follow up email.
To determine if a device is down on Squared Up due to power outage please start checking the middle screen that networking maintains.
• If it is on this list do not call.
• If a building near the building listed on the power outage list is on Squared Up do not call. (Please only call if it is still showing on Squared Up once the power outage time listed is over.)
(You can look on Purdue Campus Map https://www.purdue.edu/campus_map/ to tell if where buildings are located)
For alarms originating from StruxureWare, DCM personnel should be notified in addition to proceeding with the Data Networking notification steps outlined below.
Call Data Networking (ITIS) Primary On-Call phone (765-494-1591). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
A Footprints ticket needs to be made and assigned to ITIS Networking. Put the ticket number in the follow up email. Contact info is the on-calls name.
An email should be sent to itns-pdnhlog-ext@lists.purdue.edu. Justin McIntyre would like to be cc'd in the email also: mcintyrj@purdue.edu
You do not need to send another email when you try to reach someone. If you leave a message, include this sentence in the body of the email: "We will escalate the on-call process if there is no response within 5 minutes."
Call Data Networking (ITIS) Secondary On-Call phone (765-494-1530). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
Call Justin McIntyre at 630-675-7640. If no answer, call Richard Letts 206-790-5837. Wait 5 minutes before proceeding to Step 8.
In the event that none of the individuals from Steps 3 through 6 above has responded, repeat those steps until contact is made.

Special Notes:

Data Networking personnel who are contacted by Operations staff are responsible for providing issue status updates to Operations in a timely fashion. As a general rule this means that such feedback should be provided once you start investigating an issue, whenever an ETA for resolution has been determined and again when the issue has been resolved. Additional updates are always welcome as well, especially for extended outages.
If Squared Up, itself, ever goes down for more than 20 mins make sure to let Data Networks know that we cannot monitor their devices while it is down. Also send a ticket to the ITAP_Netwroking_II queue regarding it is down. Any part of it (Xymon, Network devices or whole page). Networking supports Squared Up now. Follow normal procedure after you call. i.e follow up email and log.

APC Alerts: Ignore all devices that have "APC" in the title. DO NOT email or call about these alerts. If there is a Struxureware alert associated with it, please follow DCM guidelines.

Ignore the following list of buildings only outside of normal business hours:

Cases where it is outside of business hours (as is also defined below), please do not call the data networks on-call staff for Squared Up alarms in any of the locations listed in the table down below. Essentially if the hostname value of the alarm contains a value from the “Squared Up Value” column of the list down below, it should be ignored. These values are typically fairly static but should any updates need to occur we’ll forward you an updated list.

- Business Hours – Monday through Friday – 7:00 AM to 5:00 PM
- (Official Purdue Holidays and Weekends Adjoining those Holiday Dates are Considered to be Non-Business Hours, Superseding the Monday through Friday Standard)
- Example: 5/25/19 through 5/27/19 (Memorial Day and the weekend adjoining it) would not be considered business hours


Building Short Name	Building Long Name	SquaredUp Value	Reason for Ignoring
-	Any APC Devices	"-apc******-"	Always ignore APCs in Squared Up
-	Any Device with the word "test"	"test"	Test devices are used for testing purposes and not on the production network.
-	Any tents	"-tent****-"	Tents are subject to tampering/environment issues and are hard to access res hall buildings after hours
844S	844 South River Road	"844s-"	No access, users aren't there after hours
AC22	Field Research Facility (ACRE Farm)	"ac22-"	No access, users aren't there after hours, feeds off of BECK
AC35	Pest Lab and Storage Facility (ACRE Farm)	"ac35-"	No access, users aren't there after hours, feeds off of BECK
AC41	Grain Drying Complex - Grain Auger (ACRE Farm)	"ac41-"	No access, users aren't there after hours, feeds off of BECK
AC42	Scales House (ACRE Farm)	"ac42-"	No access, users aren't there after hours, feeds off of BECK
AC43	USDA Soybean Research Lab (ACRE Farm)	"ac43-"	No access, users aren't there after hours, feeds off of BECK
AC44	USDA Rainulator Building Soil Erosion (ACRE Farm)	"ac44-"	No access, users aren't there after hours, feeds off of BECK
AC45	Var Test Facility (ACRE Farm)	"ac45-"	No access, users aren't there after hours, feeds off of BECK
AC46	Headquarters and Shop (ACRE Farm)	"ac46-"	No access, users aren't there after hours, feeds off of BECK
AC51	Weather Facility (ACRE Farm)	"ac51-"	No access, users aren't there after hours, feeds off of BECK
AC54	Crop Diagnostic Training Center (ACRE Farm)	"ac54-"	No access, users aren't there after hours, feeds off of BECK
AF01	Aquaculture (ASREC Farm)	"af01-"	No access, users aren't there after hours, feeds off of BECK
AFC	Anderson Flagship Center	"afc-"	Far remote site, no access
AIDC	Agricultural Information Distribution Center	"aidc-"	No access, users aren't there after hours
ASB	Airport Service Building	"asb-"	No access, airport hanger
B201	Swine Evaluation Headquarters (ASREC Farm)	"b201-"	No access, users aren't there after hours, feeds off of BECK
B401	Poultry and Hatchery Facility (ASREC Farm)	"b401-"	No access, users aren't there after hours, feeds off of BECK
B501	Sheep Research and Teaching Facility (ASREC Farm)	"b501-"	No access, users aren't there after hours, feeds off of BECK
B701	Swine Office Metabolism Facility (ASREC Farm)	"b701-"	No access, users aren't there after hours, feeds off of BECK
B713	Environmental Research Facility (ASREC Farm)	"b713-"	No access, users aren't there after hours, feeds off of BECK
B801	Farm Operations Shop and Headquarters (ASREC Farm)	"b801-"	No access, users aren't there after hours, feeds off of BECK
B901	Teaching Center and Classroom (ASREC Farm)	"b901-"	No access, users aren't there after hours, feeds off of BECK
BBCH	Purdue Baseball Clubhouse	"bbch-"	Seasonal
BBPB	Purdue Baseball Press Box	"bbpb-"	Seasonal
BTV	Boiler Television Building	"btv-"	No access
CB10	Beef Building (ASREC Farm)	"cb10-"	No access, users aren't there after hours, feeds off of BECK
COAL	Coal Handling Control/Fire Pump Building	"coal-"	No access
FSHR	TAP Fishers Remote Site	"fshr"	Charles Garwood, no access, remote site
GCMB	Golf Course Maintenance Barn	"gcmb-"	No access
GMF	Grounds Maintenance Facility	"gmf-"	No access
GMGF	Grounds Maintenance Greenhouse Facility	"gmgf-"	No access
ICSC	Indiana Corn and Soybean Innovation Center (ACRE Farm)	"icsc-"	No access, users aren't there after hours, feeds off of BECK
IDOT	Indiana Department of Transportation	"idot-"	Andy Sydelko says they can be ignored
INDY	Indianapolis External Site	"indy-"	Gro site - Susan Brock says ignore \| Esk site - Mark Sharp says ignore
INOK	Investments Warehouse	"inok-"	Andy Sydelko says they can be ignored
INSS	Intramural Storage Shed	"inss-"	No access, only one user
KKM	TAP Kokomo Site	"kkm-"	Charles Garwood, no access, remote site
LMSB	Laboratory Material Storage Building	"lmsb-"	No access
NA	TAP New Albany Site	"na-"	Charles Garwood, no access, remote site
NACC	Native American Educational and Cultural Center	"nacc-"	No access
PAGE	Thomas A. Page Pavilion	"page-"	No access
PWB	Purdue West Annex - Building B	"pwb-"	Dennis Lord - Ignore outside of 7:30AM to 4:30 PM
PWC	Purdue West - Building C	"pwc-"	Dennis Lord - Ignore outside of 7:30AM to 4:30 PM
RALR	Stadium Area - Visiting Team Locker Room	"ralr-"	No access, Seasonal
SBCH	Purdue Softball Clubhouse	"sbch-"	Seasonal
SBPB	Purdue Softball Press Box	"sbpb-"	Seasonal
SCHO	Schowe House	"scho-"	No access
SD02	Dairy Research Unit (ASREC Farm)	"sd02-"	No access, users aren't there after hours, feeds off of BECK
SIA	Subaru of Indiana Automotive	"sia-"	No access
SOCC	Purdue Women’s Soccer Building	"socc-"	No access
SPUR	Spurgeon Golf Training Center	"spur-"	No access
SWNA	State Wide New Albany	"swna-"	Off campus
SWSB	State Wide South Bend	"swsb-"	Off campus
TM02	Throckmorton Pesticide Building	"tm02-"	No access, users aren't there after hours
TM08	Throckmorton Meigs Building	"tm08-"	No access, users aren't there after hours
TM11	Throckmorton Fruit Barn	"tm11-"	No access, users aren't there after hours
TM36	Throckmorton Farm Crop Barn	"tm36-"	No access, users aren't there after hours
TPB	Rankin Track Press Box	"tpb-"	Seasonal
TURF	Intercollegiate Athletic Sports Turf Building	"turf-"	No access, users aren't there after hours
UNPD	University Police Department	"unpd-"	This is a backup connection, Mick Kiefer says ignore outside business hours
USDA	USDA Building 1 (ASREC Farm)	"usda-"	No access, users aren't there after hours, feeds off of BECK
VCPR	Veterinary Center for Paralysis Research	"vcpr-"	Dennis Barnett says ignore until business hours
VLAB	Veterinary Laboratory Animal Building	"vlab-"	Dennis Barnett says ignore until business hours
VOIN	Voinoff (Samuel) Golf Pavilion	"voin-"	Seasonal, no access
VTCH	Vision Technology 1 Building	"vtch-"	No access, not a Purdue building
VPRB	Veterinary Pathobiology Research Building	"vprb-"	Dennis Barnett says ignore until business hours
WH9	Well House 9	"wh9-"	One user, no access
WRIT	John S Wright Forestry Center	"writ-"	No access, users aren't there after hours
WH9	Well House 9	"wh9-"
WRIT	John S Wright Forestry Center	"writ-"

Information to Provide When Reporting Network Issues

For Squared Up alarms report the following information from Squared Up when reporting issue:

Date/Time that alarm started (Example: October 26, 10:00 AM)
Affected Device Name (Example: mrdh-285n-c2950-01)
Last Ping (Example: 2017-11-01 11:14:43)

For network issues reported by I-Light / GigaPOP or the ITaP help desk, campus personnel, students, or visitors, report the following information:

Date/Time that issue began or was first noticed (Example: October 26, 10:00 AM)
Affected service (Examples: wireless/PAL, data PIC(s), or I-Light / GigaPOP call back)
Location where problem is occurring (Examples: wireless on 2nd floor of Armstrong, all data PICs in the Forestry building, or I-Light / GigaPOP). Whenever possible obtain a specific building and nearest room number.
Name and phone number (or at least email) of person experiencing/reporting problem

More details can be found on the attached document below:

ITIS-Data Network Incident Resolution Process.docx

Metasys

If a yellow alarm window opens for the event or audit logs becoming full, send mail to iti-dcm@purdue.edu and acknowledge the alert.
If there is an AirStack alarm (red circle) or any red alarm window opens in Metasys, call Todd Turner at 68214 (this number should be used at all times.) Leave voicemail if Todd does not answer.

If Todd does not answer or respond through email after 20 mins, call Patrick at 765-427-3020(C), 6-1752(W), 765-421-6069(H), if he does not answer cal Jon Miller at 765-414-7646.

Once the problem has been reported you can acknowledge an alarm window to close it.
PF staff may call us from the site to ask that we contact Todd (68214) or contact the vendor, Keller-Rivest. Contact information for them is listed in the Rolodex. Call them in the order listed until you reach someone.

StruxureWare

Check calendar and Log for any special instructions regarding these systems.
Firmware/Software update pop ups - Send e-mail to ITIS Data Center Management.
Humidity alarms (low or high) in StruxureWare should be reported by email only - no phone call needed.
Dew Point Alarms These alarms should be reported by email only.
Battery alarms in StruxureWare for UPSs (devices with "apc" or "trp" in the hostname) should be reported by email only - no phone call needed.
Red alarms - Call Todd Turner (68214).
- Data Network alarms for the buildings LAMB, LYNN, ERHT, and TEL need to be reported to Data Networks. (I believe this changed to just DCM now - I will follow up)
- “Device status may be inaccurate because an attempt to transfer a device definition file (DDF) failed" alarm. Right click on the device and select "Request device scan" (per the email received by IOC on 05/06/2018).
  - Call Todd Turner (68214) if the step above doesn't clear the alarm. If Todd does not answer or respond through email after 20 mins, call Patrick at 765-427-3020(C), 6-1752(W), 765-421-6069(H), if he does not answer cal Jon Miller at 765-414-7646.
Communication lost, Connection, or Timeout errors

If ALL devices in a room are down, call right away
Ping the IP Address. If it pings it should clear. If it does not ping or clear after 10 minutes then move to step 3.
Contact Data Center Management via phone & email.

Physical Facilities will call to report PMs on the CRAC & generator units listed below. Send email to ITI Data Center Management at iti-dcm@purdue.edu when they start and stop.
This protocol also pertains to when they come in on the weekends for a key. Ask what they are working on and send email to ITI Data Center Management.

Include the building, room, device name, and Physical Facilities technician name in the e-mail.
This is only for the equipment listed below. In all other cases dealing with Physical Facilities requests for permission to do work call Todd Turner at 496-8214.

Generator Test TEL Nodes

ERHT
LAMB
LYNN
TEL

Crac units Data Centers

FREH G2 CRAC #1,2,3,4
FREH G57 CRAC 1
FREH G60 CRAC 1, ACG-2
HAAS CRAC 1,2,3
MATH B60 CRAC 1,2,3
MATH G72 Chiller
MATH G109 CRAC 5,6,13
MATH G190 CRAC 1,12,32

TEL Nodes

ERHT 5 CRAC 1,2
LAMB 20 ACG-20,21
LYNN G168 ACG-40, 41
TEL 210 CRAC 1,2,3

Cameras

StruxureWare also includes monitoring functionality for the cameras in the data centers. This view in StruxureWare should be open at all times on the large screen.

Please monitor the cameras from time to time to make sure nothing suspicious is going on. Check the Shift Log, Change and Release Management Workspace Calendar and your email for scheduled work. If it appears that the occupants do not belong in the room or are removing things during the night, notify PUPD (48221) to investigate. You will also need to call Todd Turner if you do call PUPD, or if you feel unsure.
HAAS is considered a lights-out facility with several groups who have card access. The building fire panel is located in the datacenter. The Fire & Safety group works Midnight to 8am doing panel tests across campus, and you will see them or the PUFD in the room from time-to-time.

Siemens

NOTE: There is no audio alarm for Siemens.

If the Alarms button turns red, call Patrick Finnegan.

Cell: 765-427-3020
Office: 61752
Home: 765-421-6069

Attempt to call their cell first. If you cannot reach Patrick by any of those numbers, call Todd Turner at 68214.

If the "Device Failures" button turns red, try closing Siemens completely and then restarting it.
- If Siemens is then unable to start, call Patrick Finnegan at the above listed numbers.

If the IOC receives an automated call regarding the MATH back-up chiller:

For the starting call, please listen to and acknowledge the message, then send a notification email to itidatacentermanagement@purdue.edu.
If there are any Siemens alarms that follow the back-up chiller alarm, call and email the Data Management Center on-call list.

UC4

General Job Failure

Production Control must be contacted within 30 minutes.

Between 08:00 – 17:00 on Monday – Friday (during non-holiday workdays), do not call the PCA; only send a report utilizing the shift log abend reporting tool. Follow-up emails are not necessary due to the log automatically sending out an email once the abend has been reported.
Otherwise, call the PCA (Production Control Analyst) at the number listed within UC4’s console (60287).

Down Agent

Immediately call the PCA listed within UC4’s console (60287) and send a follow-up email.

File Transfer Failure

Do not restart items with parentheses. These are child processes of the item that needs restarted.
Click ‘Restart’ once.
If it fails again, follow General Job Failure instructions.

Notifications

React as necessary following any notifications or instructions.
1. If needed, contact PCAs as with General Job Failures.

Further information regarding UC4’s configuration and operation can be found here.

Outage Types with Unique Instructions

Exchange Outage

Exchange outages are very similar to other outages with one exception. When Exchange goes down, Operations staff members are required to inform Purdue Police, Purdue Fire, and the CSC. Contact these groups both at the start, and end of the outage. It may be proper to provide updates, as the situation warrants.

24/7 call: Exchange, Purdue Police (48221), Purdue Fire (46919), CSC.

Mcafee ePO Updates

There may be times when various campus system admins need exceptions or policies changed in McAfee ePolicy Orchestrator. With the most recent ePO system we are limiting their permission sets, so in an emergency they need to be able to contact someone from the Security Engineering team to implement the changes for them. Follow on-call procedures when an admin calls the IOC with a request like this.

ICS (TLT) Tool Page Failure - Tool pages stop working on https://lslab.ics.purdue.edu/icsWeb/a/tools/

Confirm the issue is not operator only by using a zone account, or other tool, like requesting a coworker try the tool pages.
Use the CMDB to call the appropriate group.
Perform the standard phone introduction, inform the on-call of the problem, and gather their name.
Request the on-call restart the ICS Tool Pages on https://lslab.ics.purdue.edu/icsWeb/a/tools/ . Generally they will request to call back in several minutes. Please provide a call back number, especially if the outage line is not within reach.
All ICS Tool Page failures require a Global Service Alert.