Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
borderColorblack
borderStylesolid
titleSquared Up

Squared Up will notify Ops of two different types of alerts: Xymon alarms (covered in the next subsection) , and Network Device Alerts (covered in the next subsection) .

Xymon Alerts:

Before calling on any alarm review the following:

  1. Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
  2. If a server is either specifically mentioned in an RFC or can be inferred to be part of an RFC, no calls should be placed to the group responsible for the system. 
  3. If the system is listed as anything other than production under ITIS services in the main view in Xymon do not call.

After a new Squared Up alert pops up for a production machine, If the alert is still present after 20 minutes Operations will need to call the group responsible for the system.

...

Panel
borderColorblack
borderStylesolid
titleXymon

Xymon is the web based monitoring tool of choice for Administrators and staff supporting ITSO Windows and Unix processes on the Purdue Campus. It performs a valuable duty, giving the trained eye a quick overview of the hardware and processes that may be out of synch for several key University areas.

In general, Operations staff will only need to respond to Critical (Red) alerts that affect production devices (Servers, Virtual Machines, etc.) All production Xymon alerts should also be reflected in Squared Up. Alarms for McAfee (mcshield.exe) should not be reported unless they last for several hours.

While Operations staff are primarily concerned with Critical (Red) alerts, they should also be familiar with the various other colors of Xymon alerts and their meanings:

Xymon icons.PNG

Note
titlemcshield.exe

If a system has a red CPU alert, check the processes to see if it is due to one of the top processes is mcshield mcshield. It will be on top and a high percentage.

This is McAfee running an anti-virus sweep on this machine. It will take up a lot of CPU cycles, but can be ignored.

1. After a new Xymon critical alert pops up for a production machine, begin considering which group to contact, if any. If the alert is still present after 20 minutes Operations will need to call the system owner.

2. Is the alert for a clustered device? Some systems, like Mailhub, are clustered, and thus can have several alerts before one needs to take action. Generally clustered machines will NOT alarm in Squared Up for individual boxes, but they will in Xymon. This clue can help an operator determine the severity of the Xymon alert. There are some clustered systems which react in the opposite manner - They will alert in Squared Up but not Xymon until critical mass is reached. In these cases ensure enough machines are in alert before contacting the appropriate admins.

3. Locate the night's planned maintenance in the Footprints Change and Release Management Workspace Calendar to ensure the device is not scheduled to be down.

4. Xymon - Click the status icon (pictured below) along the row corresponding to the trouble server. 

5. Some admins have placed instructions for which alarms should be ignored or contact instructions in this page (pictured below is an example of these instructions). Follow any special instructions for the machine OR use the Configuration Management Database (CMDB) to locate the appropriate on-call. Further instructions on the use of the CMDB can be found here: Footprints - Configuration Management 

6. Call the on-call and inform them of the situation, affected device, and any other issues that may be cropping up due to the alert. Send a follow-up email. For CPU, memory, and disk alerts, paste the Xymon alert text into the email.

Warning

If there is no answer from the groups on call number, leave a voicemail. Call back again in 10 minutes. If there is no answer from the group by phone or email after this time, consult with your supervisor or on call supervisor.

7. Log the contact, and appropriate follow-up activities. 


Find the group owner of the system page by:

  1. Going to Xymon.
  2. Find the red production status indicator (pictured below), this should be red if there is an alarm.
    Production Status Indicator.PNG
  3. Click the red status indicator for the system affected.
  4. Continue moving through the sub-menus until you find the page
    • If the page has special instructions at the top, make sure to follow these instructions when deciding if an admin should be contacted, and which admin should be contacted.

citrix.PNG

      5. If the system name is clickable, there will be special instructions. Follow those instructions

System name.png

    • These instructions are many times instructions about when NOT to call.

instructions.PNG

         6. Once step 4 is complete, find the system in the Footprints Change and Release management CMDB.

    • Search for the name of the system
    • Click on the related CI(s) link.
    • Click the bubble button CI’s to “Named server”
    • Click on the Managed by Relationship, and click “Go To”
    • The on call group information should be listed here.

          7. If there is no information from the previous steps

    • Search the communication log for the alarming system.
    • Review past correspondence and determine who to call.
    • If the server starts with a “W” it is most likely the windows on call
    • If the server starts with an “L” it is most likely the unix on call.(Linux)
    • Consult with your coworkers.
    • Consult your supervisor, or on call supervisor.
Info
iconfalse
titlePurple Alerts

Treat a production machine with numerous purple status alerts as if it were a red alert.


...

Panel
borderColorblack
borderStylesolid
titleNetwork Device Alerts


Note
  • Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
  • Action should be considered for any Squared Up alarm that shows up in the Normal Operations Network View for 20 minutes or longer in duration.  The appropriate steps should be taken as outlined in Section II, Incident On-Call Process found elsewhere in this document.
  • Action should be taken for network issues reported by I-Light, the ITaP help desk, campus personnel, students, and/or visitors.


8:00 AM to 5:00 PM, Monday through Friday:           
IssueAction
  • Wireless/PAL issues that are being experienced by multiple clients.
  • Reports of multiple data PIC service issues in an area.
  • Report of single HIGH PRIORITY data PIC or wireless PAL outage.
  • Call from I-Light
  • Follow the normal on-call procedures
Outside of 8:00 AM to 5:00 PM, Monday through Friday:          
IssueAction
  • Wireless/PAL issues that are being experienced by multiple clients in different buildings.
  • Send an email to itns-pdnhlog-ext@lists.purdue.edu to notify the ITIS Data Team. If this issue seems to be high priority or widespread it may be justified to escalate to our normal on-call procedures.
  • Reports of multiple data PIC service issues in an area.
  • Report of single HIGH PRIORITY data PIC or wireless PAL outage.
  • Call from I-Light
  • Follow the normal on-call procedures.

Anchor
networkincident
networkincident

Network Incident On-Call Process

For Squared Up Network alarms and trouble calls, triage process to be followed is below:

  1. Verify that the building that is showing the alarm has electrical power. Pass this information to the on-call and in the follow up email.
  2. For alarms originating from StruxureWare, DCM personnel should be notified in addition to proceeding with the ITIS-Data Team notification steps outlined below.
  3. Call ITIS-Data Networks Primary On-Call phone (765-494-1591). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
  4. An email should be sent to itns-pdnhlog-ext@lists.purdue.edu.You do not need to send another email when you try to reach someone. If you leave a message, include this sentence in the body of the email: "We will escalate the on-call process if there is no response within 5 minutes."
  5. Call ITIS-Data Networks Secondary On-Call phone (765-494-1530). If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.

  6. Call Garrett Williams at 440-429-7112.  If no answer, call Daniel Pierce 765-404-5432.  Wait 5 minutes before proceeding to Step 7.

  7. In the event that none of the individuals from Steps 3 through 6 above has responded, repeat those steps until contact is made. 

Special Notes:

  • ITIS-Data personnel who are contacted by Operations staff are responsible for providing issue status updates to Operations in a timely fashion.  As a general rule this means that such feedback should be provided once you start investigating an issue, whenever an ETA for resolution has been determined and again when the issue has been resolved.  Additional updates are always welcome as well, especially for extended outages.

Information to Provide When Reporting Network Issues

For Squared Up alarms report the following information from Squared Up when reporting issue:

  1. Date/Time that alarm started (Example: October 26, 10:00 AM)
  2. Affected Device Name (Example: mrdh-285n-c2950-01)
  3. Last Ping (Example: 2017-11-01 11:14:43)

For network issues reported by I-Light or the ITaP help desk, campus personnel, students, or visitors, report the following information:

  1. Date/Time that issue began or was first noticed (Example: October 26, 10:00 AM)
  2. Affected service (Examples: wireless/PAL, data PIC(s), or I-Light call back)
  3. Location where problem is occurring (Examples: wireless on 2nd floor of Armstrong, all data PICs in the Forestry building, or I-Light). Whenever possible obtain a specific building and nearest room number.
  4. Name and phone number (or at least email) of person experiencing/reporting problem

More details can be found on the attached document below:

View file
nameITIS-Data Network Incident Resolution Process -- Updated 2-2-18.docx
height250

...

Panel
borderColorblack
borderStylesolid
titleStruxureWare
  • Check calendar and Log for any special instructions regarding these systems.
  • Firmware/Software update pop ups - Send e-mail to ITI ITIS Data Center Management.
  • Humidity alarms (low or high) in StruxureWare should be reported by email only - no phone call needed.
  • Battery alarms in StruxureWare for UPSs (devices with "apc" or "trp" in the hostname) should be reported by email only - no phone call needed
  • Red alarms (except data network alarms and the exceptions listed above) - Call Todd Turner (68214).
    • Data Network alarms for the buildings LAMB, LYNN, ERHT, and TEL need to be reported to Data Networks, per the email received by the IOC on 07/19/2017.
    • “Device status may be inaccurate because an attempt to transfer a device definition file (DDF) failed" alarm. Right click on the device and select "Request device scan" (per the email received by IOC on 05/06/2018).
      • Call Todd Turner (68214) if the step above doesn't clear the alarm.
  • Communication lost
    1. If ALL devices in a room are down, call right away
    2. Please wait for 10 minutes for  it to try again
    3. DCM via phone & email.  
  • Physical Facilities will call to report PMs on the CRAC & generator units listed below.Send email to ITI Data Center Management at iti-dcm@purdue.edu when they start and stop.  
    1. Include the building, room, device name, and Physical Facilities technician name in the e-mail.  
    2. This is only for the equipment listed below.  In all other cases dealing with Physical Facilities requests for permission to do work call Todd Turner at 496-8214.

Generator Test TEL Nodes

  • ERHT
  • LAMB
  • LYNN
  • TEL

Crac units Data Centers

  • FREH G2 CRAC #1,2,3,4
  • FREH G57 CRAC 1
  • FREH G60 CRAC 1, ACG-2
  • HAAS CRAC 1,2,3
  • MATH B60 CRAC 1,2,3
  • MATH G72 Chiller
  • MATH G109 CRAC 5,6,13
  • MATH G190 CRAC 1,12,32

TEL Nodes

  • ERHT 5 CRAC 1,2
  • LAMB 20 ACG-20,21
  • LYNN G168 ACG-40, 41
  • TEL 210 CRAC 1,2,3

Anchor
cameras
cameras

Cameras

StruxureWare also includes monitoring functionality for the cameras in the data centers. This view in StruxureWare should be open at all times on the large screen next to the door to MATH B60.

  • Please monitor the cameras from time to time to make sure nothing suspicious is going on. Check the Shift Log, Change and Release Management Workspace Calendar and your email for scheduled work. If it appears that the occupants do not belong in the room or are removing things during the night, notify PUPD (48221) to investigate. You will also need to call Todd Turner if you do call PUPD, or if you feel unsure.
  • HAAS is considered a lights-out facility with several groups who have card access. The building fire panel is located in the datacenter. The Fire & Safety group works Midnight to 8am doing panel tests across campus, and you will see them or the PUFD in the room from time-to-time.

...

Panel
borderColorblack
borderStylesolid
titleSiemens
  • If the Alarms button turns red, call Patrick call Patrick Finnegan at 765-427-3020(C),  6-1752(W), 765-421-6069(H).
    1. Call Todd Turner (68214) if you cannot reach Patrick.
    2. There is no audio alarm.
  • If the "Device Failures" button turns red, try closing out of Siemens and starting it again
    1. If Siemens in unable to start Call Patrick Finnegan at the numbers above.
  • If IOC receives an automated call regarding the MATH Back-up Chiller reporting an issue -
    1. For the MATH backup chiller starting call, please listen to and acknowledge the message and send a notification email to this mailing list (itidatacentermanagement@purdue.edu)

    2. If there are any Siemens alarms that occur after the backup chiller call (or at any other time), then call and email the Data Center Management on-call list.

...

Panel
borderColorblack
borderStylesolid
titleUC4
  • General Job Failure - must contact Production Control within 30 minutes
    1. 08:00-17:00 M-F, non-holiday workdays, only send email using the shift log abend reporting tool.
    2. Call Production Control Analyst (PCA) at 60287 otherwise
  • Down Agent
    1. Call PCA (60287) and send follow-up email.
  • File Transfer Failure
    1. Do not restart items with parentheses (). These are child processes of the item that needs restarted.
    2. Restart once
    3. If it fails again, follow General Job Failure instructions.
  • Notification
    1. React as necessary
    2. If needed contact PCAs as with General Job Failure.

Further information regarding UC4's configuration and operation can be found here: UC4 Intro and Configuration

...