Grafana IOC Dashboard/ Network Device Alerts (*Needs updated)
Description: Grafana is used to monitor both Network device alerts and Xymon alerts in one system.
Location: Grafana Network and Xymon Alerts
BGP Session Down/Critical Link Down
If Network Alerts appear with Outage Type: BGP Session Down or Critical Link Down, call Networking on-call (765-494-1591) immediately. Then, follow up and log as normal.
- Alerts are now classified based on “Outage Type”
- Device Down – just like how alerts were before, these are down network devices
- Critical Link down – These are links that we as networking have deemed as critical if they go down. Currently this displays the hostname of the device that is down as well as the circuit attached to that link if any
- BGP Session down – These are BGP sessions that go down and we previously had little monitoring to when redundancy was impacted
- Polling time has been changed from 1 minute to 5 minutes. This means that alarms will potentially show up slower, and clear slower. This should not affect how IOC calls alarms as there is a start time for every alert.
A “normal” device down alarm will look like the following:
The new type of alarm that Networking would like to be called immediately appear as such (note this is the max alerts and just for demonstration purposes):
- Locate the night's planned maintenance in t
he Footprints Change and Release Management workspace calendarto ensure the device is not scheduled to be down. - Action should be considered for any Grafana alarm that shows up in the Normal Operations Network View for 20 minutes (EXCEPTION: BGP SESSION DOWN/CRITICAL LINK DOWN) or longer in duration. The appropriate steps should be taken as outlined in Section II, Incident On-Call Process found elsewhere in this document.
- Action should be taken for network issues reported by I-Light (GigaPOP), the ITaP help desk, campus personnel, students, and/or visitors.
7:00 AM to 5:00 PM, Monday through Friday:
Issue | Action |
---|---|
|
|
|
|
|
|
Outside of 7:00 AM to 5:00 PM, Monday through Friday:
Issue | Action |
---|---|
|
|
|
|
|
|
Network Incident On-Call Process
For all Grafana/network device alarms & trouble calls follow the on-call triage process:
Verify that the building that is showing the alarm has electrical power. Verify that an RFC is not in place for the alarming buildings or machines.
Duke Energy Indiana Outage Map: https://outagemap.duke-energy.com/#/current-outages/in
* To determine if a device is down on Grafana IOC Dashboard due to power outage, please check the middle screen that networking maintains*
This needs to be up as well for remote monitoring of IOC. It is the RFC work (NOC) tab we are looking at.
- If it is on this list do not call.
- If a building near the building listed on the power outage list is on Grafana do not call. (Please only call if it is still showing on Grafana once the power outage time listed is over.)
(You can look on Purdue Campus Map https://www.purdue.edu/campus_map/ to tell if where buildings are located)
2. Call Data Networking (PSC) Primary On-Call phone (765-494-1591) . If no answer, voice mail should be left on the phone and you should wait 5 minutes before calling again.
3. A TeamDynamix ticket should be made and assigned to IT_NETWORK_OPS. Be sure to use the IOC Network Follow Up incident form in TDX and check notify contacts too.
- To create an IOC Network Follow Up ticket, select "+ New" > Incident Form.
- Select IOC Network Follow-Up under Template.
- Select IOC Network Follow-Up under Form.
- Fill out the Title and Description.
- Adjust impact and urgency based on the outage.
- Check the box for "Notify Contact(s)" under Contact.
- Press Save.
Any subsequent communication can be done by replying to the ticket notifications or by editing the ticket in TDX.
Telephone communication with Networking should be documented one of two ways: by email or via a TDX ticket. Communication with Networking that requires or requests action on their part should be documented by TDX ticket. Communication that does not expect or require action from Networking can be handled by email.
Of course, log all communication. Similar to other monitoring systems, any network device alarm log entries should be prefaced with "Network Device:" like the example below.
EXAMPLE:
10:51 | challen | Data Networks | Micah Peercy | 765-494-1591 / itns-pdnhlog-ext@lists.purdue.edu | Outgoing call | Resume Monitoring | Network Device: nwss-b001cabs-c9348uxm-01.tcom.purdue.edu. Admin is investigating. TDX: 1699562. 11:32 - Received email: Switch is back up. |
4. You do not need to send another email when you try to reach someone. If you leave a message, include this sentence in the body of the email: "We will escalate the on-call process if there is no response within 5 minutes."
5. Call Data Networking (PSC) Secondary On-Call phone (765-494-1530) . If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.
6. Call Justin McIntyre at 630-675-7640. If no answer, call Richard Letts 206-790-5837 . Wait 5 minutes before proceeding to Step 8.
7. In the event that none of the individuals from Steps 3 through 6 above has responded, repeat those steps until contact is made.
Special Notes:
- Data Networking personnel who are contacted by IOC staff are responsible for providing issue status updates to IOC Staff in a timely fashion. As a general rule this means that such feedback should be provided once you start investigating an issue, whenever an ETA for resolution has been determined, and again when the issue has been resolved. Additional updates are always welcome as well, especially for extended outages.
- If Grafana, itself, ever goes down for more than 20 mins make sure to let Data Networks know that we cannot monitor their devices while it is down. Also send a ticket to the
ITAP_Networking_IIqueue regarding it is down. Any part of it (Xymon, Network devices or whole page). Networking supports Grafana now. Follow normal procedure after you call. i.e follow up email and log.- APC Alerts: Ignore
Ignore the following list of buildings only outside of normal business hours:
Cases where it is outside of business hours (as is also defined below), please do not call the data networks on-call staff for Grafana alarms in any of the locations listed in the table down below . Essentially if the hostname value of the alarm contains a value from the “Grafana Value” column of the list down below, it should be ignored. These values are typically fairly static but should any updates need to occur we’ll forward you an updated list.
- Business Hours – Monday through Friday - 7:00 AM to 5:00 PM
- (Official Purdue Holidays and Weekends Adjoining those Holiday Dates are Considered to be Non-Business Hours, Superseding the Monday through Friday Standard)
- Example: 11/24/22 through 11/27/22 (Thanksgiving Day and the weekend adjoining it) would not be considered business hours
Building Short Name | Building Long Name | Grafana Value | Reason for Ignoring |
- | Any APC Devices | "apc" | APC devices are managed/alerted in Struxureware, IOC will receive alerts for the edge switch |
- | All temporary tent switches (any devices with the word "tent") | "tent" | COVID tents, only responding to during business hours |
- | Any Device with the word "test" | "test" | Test devices are used for testing purposes and not on the production network. |
- | rtm.voip | rtm.voip | Reporting server, no customer impact |
844S | 844 South River Road | "844s-" | No access, users aren't there after hours |
AC22 | Field Research Facility (ACRE Farm) | "ac22-" | No access, users aren't there after hours, feeds off of BECK |
AC35 | Pest Lab and Storage Facility (ACRE Farm) | "ac35-" | No access, users aren't there after hours, feeds off of BECK |
AC41 | Grain Drying Complex - Grain Auger (ACRE Farm) | "ac41-" | No access, users aren't there after hours, feeds off of BECK |
AC42 | Scales House (ACRE Farm) | "ac42-" | No access, users aren't there after hours, feeds off of BECK |
AC43 | USDA Soybean Research Lab (ACRE Farm) | "ac43-" | No access, users aren't there after hours, feeds off of BECK |
AC44 | USDA Rainulator Building Soil Erosion (ACRE Farm) | "ac44-" | No access, users aren't there after hours, feeds off of BECK |
AC45 | Var Test Facility (ACRE Farm) | "ac45-" | No access, users aren't there after hours, feeds off of BECK |
AC46 | Headquarters and Shop (ACRE Farm) | "ac46-" | No access, users aren't there after hours, feeds off of BECK |
AC51 | Weather Facility (ACRE Farm) | "ac51-" | No access, users aren't there after hours, feeds off of BECK |
AC54 | Crop Diagnostic Training Center (ACRE Farm) | "ac54-" | No access, users aren't there after hours, feeds off of BECK |
AF01 | Aquaculture (ASREC Farm) | "af01-" | No access, users aren't there after hours, feeds off of BECK |
*AFC | *Anderson Flagship Center | "afc" | *see instructions below |
AIDC | Agricultural Information Distribution Center | "aidc-" | No access, users aren't there after hours |
ASB | Airport Service Building | "asb-" | No access, airport hanger |
B201 | Swine Evaluation Headquarters (ASREC Farm) | "b201-" | No access, users aren't there after hours, feeds off of BECK |
B401 | Poultry and Hatchery Facility (ASREC Farm) | "b401-" | No access, users aren't there after hours, feeds off of BECK |
B501 | Sheep Research and Teaching Facility (ASREC Farm) | "b501-" | No access, users aren't there after hours, feeds off of BECK |
B701 | Swine Office Metabolism Facility (ASREC Farm) | "b701-" | No access, users aren't there after hours, feeds off of BECK |
B713 | Environmental Research Facility (ASREC Farm) | "b713-" | No access, users aren't there after hours, feeds off of BECK |
B801 | Farm Operations Shop and Headquarters (ASREC Farm) | "b801-" | No access, users aren't there after hours, feeds off of BECK |
B901 | Teaching Center and Classroom (ASREC Farm) | "b901-" | No access, users aren't there after hours, feeds off of BECK |
BBCH | Purdue Baseball Clubhouse | "bbch-" | Seasonal |
BBPB | Purdue Baseball Press Box | "bbpb-" | Seasonal |
BTV | Boiler Television Building | "btv-" | No access |
CB10 | Beef Building (ASREC Farm) | "cb10-" | No access, users aren't there after hours, feeds off of BECK |
CB14 | Management/Teaching Barn (ASREC Farm) | "cb14-" | No access, users aren't there after hours, feeds off of BECK |
COAL | Coal Handling Control/Fire Pump Building | "coal-" | No access |
CRML | TAP Carmel Remote Site | "crml" | Charles Garwood, no access, remote site |
FHBC | Family Health Clinic of Burlington - Carroll County | "fhbc" | No remote login |
GCMB | Golf Course Maintenance Barn | "gcmb-" | No access |
GMF | Grounds Maintenance Facility | "gmf-" | No access |
GMGF | Grounds Maintenance Greenhouse Facility | "gmgf-" | No access |
ICSC | Indiana Corn and Soybean Innovation Center (ACRE Farm) | "icsc-" | No access, users aren't there after hours, feeds off of BECK |
IDOT | Indiana Department of Transportation | "idot-" | Andy Sydelko says they can be ignored |
INDY | Indianapolis External Site | "indy-" | Gro site - Susan Brock says ignore | Esk site - Mark Sharp says ignore |
INOK | Investments Warehouse | "inok-" | Andy Sydelko says they can be ignored |
INSS | Intramural Storage Shed | "inss-" | No access, only one user |
KKM | TAP Kokomo Site | "kkm-" | Charles Garwood, no access, remote site |
LMSB | Laboratory Material Storage Building | "lmsb-" | No access |
NA | TAP New Albany Site | "na-" | Charles Garwood, no access, remote site |
NACC | Native American Educational and Cultural Center | "nacc-" | No access |
PAGE | Thomas A. Page Pavilion | "page-" | No access |
PWB | Purdue West Annex - Building B | "pwb-" | Dennis Lord - Ignore outside of 7:30AM to 4:30 PM |
PWC | Purdue West - Building C | "pwc-" | Dennis Lord - Ignore outside of 7:30AM to 4:30 PM |
RALR | Stadium Area - Visiting Team Locker Room | "ralr-" | No access, Seasonal |
SAP (Dallas) | SAP Tunnel - Dallas Down (128.241.3.90) | "128.241.3.90" | Not critical, backup for SuccessFactors. The Atlanta one should still be called for. |
SBCH | Purdue Softball Clubhouse | "sbch-" | Seasonal |
SBPB | Purdue Softball Press Box | "sbpb-" | Seasonal |
SCHO | Schowe House | "scho-" | No access |
SD02 | Dairy Research Unit (ASREC Farm) | "sd02-" | No access, users aren't there after hours, feeds off of BECK |
SOCC | Purdue Women’s Soccer Building | "socc-" | No access |
SPUR | Spurgeon Golf Training Center | "spur-" | No access |
*SW | *State Wide | "sw**-" | *see instructions below |
TM02 | Throckmorton Pesticide Building | "tm02-" | No access, users aren't there after hours |
TM08 | Throckmorton Meigs Building | "tm08-" | No access, users aren't there after hours |
TM36 | Throckmorton Farm Crop Barn | "tm36-" | No access, users aren't there after hours |
TPB | Rankin Track Press Box | "tpb-" | Seasonal |
TURF | Intercollegiate Athletic Sports Turf Building | "turf-" | No access, users aren't there after hours |
UC | University Church | "uc-" | Ignore outside of business hours (per FP #1698941) |
UNPD | University Police Department | "unpd-" | This is a backup connection, Mick Kiefer says ignore outside business hours |
USDA | USDA Building 1 (ASREC Farm) | "usda-" | No access, users aren't there after hours, feeds off of BECK |
VCPR | Veterinary Center for Paralysis Research | "vcpr-" | Dennis Barnett says ignore until business hours |
VLAB | Veterinary Laboratory Animal Building | "vlab-" | Dennis Barnett says ignore until business hours |
VOIN | Voinoff (Samuel) Golf Pavilion | "voin-" | Seasonal, no access |
VTCH | Vision Technology 1 Building | "vtch-" | No access, not a Purdue building |
VPRB | Veterinary Pathobiology Research Building | "vprb-" | Dennis Barnett says ignore until business hours |
WH9 | Well House 9 | "wh9-" | One user, no access |
WRIT | John S Wright Forestry Center | "writ-" | No access, users aren't there after hours |
PFW No-Call List (Outside of business hours)
Building Short Name | Building Long Name | Grafana Value |
AC | Steel Dynamics Keith E. Busse Alumni Center | "f-ac-ds-115-c9300-1" |
PXO | Purdue Cooperative Extension Service | "f-pxo-ds-5a-c3650-1" |
CRI | Community Research Institute | "f-cri-ds-910-c3650-1" |
SOCR | Hefner Soccer Fields | "f-socr-ds-2u01-c3650-1" |
PNW No-Call List (Outside of business hours)
Building Short Name | Building Long Name | Grafana Value |
GYTE | Millard E. Gyte Science Building | "nw-hmd-gyte-108-x440-s1" |
GYTE | Millard E. Gyte Science Building | "nw-hmd-gyte-34-x440-s1" |
CLC | Challenger Learning Center | "nw-hmd-clc-*" |
LAWS | C. H. Lawshe Hall | "nw-hmd-laws-131-x440-s1" |
PWRS | Donald S. Powers Computer Education Building | "nw-hmd-pwrs-216-x440-s1" |
TECH | Technology Building | "nw-wst-tech-239a-x440-s1" |
Information to Provide When Reporting Network Issues
For Grafana alarms report the following information from Grafana when reporting issue:
- Date/Time that alarm started (Example: October 26, 10:00 AM)
- Affected Device Name (Example: mrdh-285n-c2950-01)
- Last Ping (Example: 2017-11-01 11:14:43)
For network issues reported by I-Light / GigaPOP or the ITaP help desk, campus personnel, students, or visitors, report the following information:
- Date/Time that issue began or was first noticed (Example: October 26, 10:00 AM)
- Affected service (Examples: wireless/PAL, data PIC(s), or I-Light / GigaPOP call back)
- Location where problem is occurring (Examples: wireless on 2 nd floor of Armstrong, all data PICs in the Forestry building, or I-Light / GigaPOP). Whenever possible obtain a specific building and nearest room number.
- Name and phone number (or at least email) of person experiencing/reporting problem
More details can be found on the attached document below:
Description: The Down Nodes section in Solarwinds is used to monitor network device alerts similarly to Grafana.
Location: Solarwinds - Requires boilerad\username and career account password to sign in. If this page does not load immediately from the link, you may need to refresh it.
If any network device alerts last longer than 20 minutes in Grafana, the down nodes section for Solarwinds can be a useful backup tool in helping us identify if an alarm should clear and/or provide information to Networking. Currently, we follow the normal on-call process described above only for Grafana. (This includes ignoring the alarms if they are related to scheduled maintenance, not notified outside of normal business hours, etc.)
* State Wide and AFC
Treat as normal outages during and after regular business hours.
Typical handling of these alerts will be that the IOC calls the networking on call about a statewide site alert and the networking on call will handle the alert from there. Networking will reach out to statewide site contact(s) in addition to calling/notifying the ECN helpdesk number(765-494-4326) as needed.
Building Sites:
SWKO - State Wide Kokomo
SWNA - State Wide New Albany
SWR - State Wide Richmond
SWCL - State Wide Columbus
SWSB - State Wide South Bend
AFC - Anderson Flagship Center
SIA - Subaru of Indiana Automative
SW Contact Information:
Jason Culp jjculp@purdue.edu
ECN Helpdesk Number 765-494-4326