The IOC operator's duties are primarily ones of observation and proactive intervention. The job revolves around inputs from Xymon and Squared Up, as well as human feedback from sources such as the CSC, Classrooms, etc.
...
Panel |
---|
borderColor | black |
---|
borderStyle | solid |
---|
title | Xymon |
---|
|
Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of sync for several key University areas. In general, Operations staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in Squared Up. Alarms for McAfee (mcshield.exe) should not be reported unless they last for several hours. While Operations staff are primarily concerned with Critical (Red) alerts, they should also be familiar with the various other colors of Xymon alerts and their meanings: 
Note |
---|
|
|
If a system has a red CPU alert, check the processes to see if it is due to mcshield. It will be on top and a high percentage.
This is McAfee running an anti-virus sweep on this machine. It will take up a lot of CPU cycles, but can be ignored.
Image Removed1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes IOC will need to call the system owner.
2. Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins.
3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.
4. Xymon - Click the status icon (pictured below) along the row corresponding to the trouble server.

5. Click on the server name to find further instructions on who to contact or if the alarm should be ignored. Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) to locate the appropriate on-call. For further use on how to use CMDB: Configuration Management Database (CMDB) .
6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email.
Warning |
---|
If there is no answer from the group's on call number, leave a voicemail. Call back again in 20 minutes. If there is no answer from the group by phone or email after this time, contact the group's manager. If you have questions, consult with CSC / IOC supervisor. |
7. Log the contact, and appropriate follow-up activities.
How to find the system owner (Admin to call):
1. If the system name is clickable, there will be special instructions. Follow those instructions

- These instructions are many times instructions about when NOT to call.

2. Find the system in the Footprints Change and Release management CMDB.
- Search for the name of the system
- Click on the related CI(s) link.
- Click the bubble button CI’s to “Named server”
- Click on the Managed by Relationship, and click “Go To”
- The on call group information should be listed here.
3. If there is no information from the previous steps
- Search the communication log for the alarming system.
- Review past correspondence and determine who to call.
- If the server starts with a “W” it is most likely the windows on call
- If the server starts with an “L” it is most likely the unix on call.(Linux)
- Consult with your coworkers.
- Consult your supervisor, or on call supervisor.
Info |
---|
icon | false |
---|
title | Purple Alerts |
---|
|
Treat a production machine with numerous purple status alerts as if it were a red alert. |
...
Panel |
---|
borderColor | black |
---|
borderStyle | solid |
---|
title | Network Device Alerts |
---|
|
Note |
---|
- Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
- Action should be considered for any Squared Up alarm that shows up in the Normal Operations Network View for 20 minutes or longer in duration. The appropriate steps should be taken as outlined in Section II, Incident On-Call Process found elsewhere in this document.
- Action should be taken for network issues reported by I-Light (GigaPOP), the ITaP help desk, campus personnel, students, and/or visitors.
|
8:00 AM to 5:00 PM, Monday through Friday: Issue | Action |
---|
- Wireless/PAL issues that are being experienced by multiple clients.
- Reports of multiple data PIC service issues in an area.
- Report of single HIGH PRIORITY data PIC or wireless PAL outage.
- Call from I-Light / GigaPOP
| - Follow the normal on-call procedures
|
Outside of 8:00 AM to 5:00 PM, Monday through Friday: Issue | Action |
---|
- Wireless/PAL issues that are being experienced by multiple clients in different buildings.
| - Send an email to itns-pdnhlog-ext@lists.purdue.edu to notify Data Networking. If this issue seems to be high priority or widespread it may be justified to escalate to our normal on-call procedures.
| - Reports of multiple data PIC service issues in an area.
- Report of single HIGH PRIORITY data PIC or wireless PAL outage.
- Call from I-Light / GigaPOP
| - Follow the normal on-call procedures.
|
Anchor |
---|
| networkincident |
---|
| networkincident |
---|
|
Network Incident On-Call ProcessFor all SQUARED UP alarms, trouble calls, and StruxureWare alarms: ERHT 5A/5B, LAMB 20, LYNN B168, or TEL 210 follow the on-call triage process: For alarms originating from StruxureWare, DCM (Data Center Management) personnel should be notified in addition to proceeding with the Data Networking (ITIS) notification steps outlined below. - Verify that the building that is showing the alarm has electrical power. Pass this information to the on-call and in the follow up email.
To determine if a device is down on Squared Up due to power outage please start checking the middle screen that networking maintains. • If it is on this list do not call. • If a building near the building listed on the power outage list is on Squared Up do not call. (Please only call if it is still showing on Squared Up once the power outage time listed is over.) (You can look on Purdue Campus Map https://www.purdue.edu/campus_map/ to tell if where buildings are located) For alarms originating from StruxureWare, DCM personnel should be notified in addition to proceeding with the Data Networking notification steps outlined below. - Call Data Networking (ITIS) Primary On-Call phone (765-494-1591). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
- A Footprints ticket needs to be made and assigned to ITIS Networking. Put the ticket number in the follow up email. Contact info is the on-calls name.
- An email should be sent to itns-pdnhlog-ext@lists.purdue.edu. Justin McIntyre would like to be cc'd in the email also: mcintyrj@purdue.edu
You do not need to send another email when you try to reach someone. If you leave a message, include this sentence in the body of the email: "We will escalate the on-call process if there is no response within 5 minutes."
- Call Data Networking (ITIS) Secondary On-Call phone (765-494-1530). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
Call Justin McIntyre at 630-675-7640. If no answer, call Richard Letts 206-790-5837. Wait 5 minutes before proceeding to Step 7.
- In the event that none of the individuals from Steps 3 through 6 above has responded, repeat those steps until contact is made.
Special Notes:- Data Networking personnel who are contacted by IOC staff are responsible for providing issue status updates to IOC in a timely fashion. As a general rule this means that such feedback should be provided once you start investigating an issue, whenever an ETA for resolution has been determined and again when the issue has been resolved. Additional updates are always welcome as well, especially for extended outages.
- If Squared Up, itself, ever goes down for more than 20 mins make sure to let Data Networks know that we cannot monitor their devices while it is down. (If it is the Xymon part down do NOT call Data Networks as it does not pertain to their devices, only when "Network Devices" section is down). IT Service Management (ITSMO) 765-496-6390/ itcr-itsmo@lists.purdue.edu will be the on-call if Squared Up is down. (If it is the Xymon part down do NOT call Data Networks as it does not pertain to their devices.)
- Battery Alarms in Squared Up: Email Data Networks only. Do NOT call. Make FP ticket, assign to Networking and contact can be left blank or jpublic. Ex's: anything with "APC" in it.
Ignore the following list of buildings outside of normal business hours:
Cases where it is outside of business hours (as is also defined below), please do not call the data networks on-call staff for Squared Up alarms in any of the locations listed in the table down below. Essentially if the hostname value of the alarm contains a value from the “Squared Up Value” column of the list down below, it should be ignored. These values are typically fairly static but should any updates need to occur we’ll forward you an updated list. - Business Hours are Monday through Friday from 7:00 AM to 5:00 PM
- Official Purdue Holidays and Weekends Adjoining those Holiday Dates are Considered to be Non-Business Hours, Superseding the Monday through Friday Standard
Example: 5/25/19 through 5/27/19 (Memorial Day and the weekend adjoining it) would not be considered business hours. Building Short Name | Building Long Name | SquaredUp Value | - | Any APC Devices | "-apc******-" | - | Any TRP Devices | "-trp******-" | 2550 | State Farm | "2550-" | 844S | 844 South River Road | "844s-" | AC22 | Field Research Facility (ACRE Farm) | "ac22-" | AC35 | Pest Lab and Storage Facility (ACRE Farm) | "ac35-" | AC41 | Grain Drying Complex - Grain Auger (ACRE Farm) | "ac41-" | AC42 | Scales House (ACRE Farm) | "ac42-" | AC43 | USDA Soybean Research Lab (ACRE Farm) | "ac43-" | AC44 | USDA Rainulator Building Soil Erosion (ACRE Farm) | "ac44-" | AC45 | Var Test Facility (ACRE Farm) | "ac45-" | AC46 | Headquarters and Shop (ACRE Farm) | "ac46-" | AC51 | Weather Facility (ACRE Farm) | "ac51-" | AC54 | Crop Diagnostic Training Center (ACRE Farm) | "ac54-" | AF01 | Aquaculture (ASREC Farm) | "af01-" | AFC | Anderson Flagship Center | "afc-" | AIDC | Agricultural Information Distribution Center | "aidc-" | ASB | Airport Service Building | "asb-" | B201 | Swine Evaluation Headquarters (ASREC Farm) | "b201-" | B401 | Poultry and Hatchery Facility (ASREC Farm) | "b401-" | B501 | Sheep Research and Teaching Facility (ASREC Farm) | "b501-" | B602 | Feed Mill (ASREC Farm) | "b602-" | B701 | Swine Office Metabolism Facility (ASREC Farm) | "b701-" | B713 | Environmental Research Facility (ASREC Farm) | "b713-" | B801 | Farm Operations Shop and Headquarters (ASREC Farm) | "b801-" | B901 | Teaching Center and Classroom (ASREC Farm) | "b901-" | BBCH | Purdue Baseball Clubhouse | "bbch-" | BBPB | Purdue Baseball Press Box | "bbpb-" | BECK | Beck Agricultural Center (ACRE Farm) | "beck-" | BTV | Boiler Television Building | "btv-" | CB10 | Beef Building (ASREC Farm) | "cb10-" | COAL | Coal Handling Control/Fire Pump Building | "coal-" | GCMB | Golf Course Maintenance Barn | "gcmb-" | GMF | Grounds Maintenance Facility | "gmf-" | GMGF | Grounds Maintenance Greenhouse Facility | "gmgf-" | ICSC | Indiana Corn and Soybean Innovation Center (ACRE Farm) | "icsc-" | IDOT | Indiana Department of Transportation | "idot-" | INOK | Investments Warehouse | "inok-" | INSS | Intramural Storage Shed | "inss-" | LMSB | Laboratory Material Storage Building | "lmsb-" | NACC | Native American Educational and Cultural Center | "nacc-" | PAGE | Thomas A. Page Pavilion | "page-" | RALR | Stadium Area - Visiting Team Locker Room | "ralr-" | SBCH | Purdue Softball Clubhouse | "sbch-" | SBPB | Purdue Softball Press Box | "sbpb-" | SCHO | Schowe House | "scho-" | SD02 | Dairy Research Unit (ASREC Farm) | "sd02-" | SIA | Subaru of Indiana Automotive | "sia-" | SOCC | Purdue Women’s Soccer Building | "socc-" | SPUR | Spurgeon Golf Training Center | "spur-" | SWNA | State Wide New Albany | "swna-" | SWSB | State Wide South Bend | "swsb-" | TAP | Purdue Technical Assistance Program | "tap-" | TM02 | Throckmorton Pesticide Building | "tm02-" | TM08 | Throckmorton Meigs Building | "tm08-" | TM11 | Throckmorton Fruit Barn | "tm11-" | TM36 | Throckmorton Farm Crop Barn | "tm36-" | TPB | Rankin Track Press Box | "tpb-" | TURF | Intercollegiate Athletic Sports Turf Building | "turf-" | UNPD | University Police Department | "unpd-" | USDA | USDA Building 1 (ASREC Farm) | "usda-" | VOIN | Voinoff (Samuel) Golf Pavilion | "voin-" | WH9 | Well House 9 | "wh9-" | WRIT | John S Wright Forestry Center | "writ-" |
For Squared Up alarms report the following information from Squared Up when reporting issue: - Date/Time that alarm started (Example: October 26, 10:00 AM)
- Affected Device Name (Example: mrdh-285n-c2950-01)
- Last Ping (Example: 2017-11-01 11:14:43)
For network issues reported by I-Light / GigaPOP or the ITaP help desk, campus personnel, students, or visitors, report the following information: - Date/Time that issue began or was first noticed (Example: October 26, 10:00 AM)
- Affected service (Examples: wireless/PAL, data PIC(s), or I-Light / GigaPOP call back)
- Location where problem is occurring (Examples: wireless on 2nd floor of Armstrong, all data PICs in the Forestry building, or I-Light / GigaPOP). Whenever possible obtain a specific building and nearest room number.
- Name and phone number (or at least email) of person experiencing/reporting problem
More details can be found on the attached document below: ITIS-Data Network Incident Resolution Process.docx
|
...