/
Grafana IOC Dashboard/ Network Device Alerts (*Needs updated)

Grafana IOC Dashboard/ Network Device Alerts (*Needs updated)

Description: Grafana is used to monitor both Network device alerts and Xymon alerts in one system. 

Location:  Grafana Network and Xymon Alerts


 BGP Session Down/Critical Link Down

If Network Alerts appear with Outage Type: BGP Session Down or Critical Link Down, call Networking on-call (765-494-1591) immediately. Then, follow up and log as normal.

  • Alerts are now classified based on “Outage Type”
    • Device Down – just like how alerts were before, these are down network devices
    • Critical Link down – These are links that we as networking have deemed as critical if they go down. Currently this displays the hostname of the device that is down as well as the circuit attached to that link if any
    • BGP Session down – These are BGP sessions that go down and we previously had little monitoring to when redundancy was impacted
  • Polling time has been changed from 1 minute to 5 minutes. This means that alarms will potentially show up slower, and clear slower. This should not affect how IOC calls alarms as there is a start time for every alert.

A “normal” device down alarm will look like the following:

The new type of alarm that Networking would like to be called immediately appear as such (note this is the max alerts and just for demonstration purposes):

Network Device Alerts
  • Locate the night's planned maintenance in the Footprints Change and Release Management workspace calendar to ensure the device is not scheduled to be down.
  • Action should be considered for any Grafana alarm that shows up in the Normal Operations Network View for 20 minutes (EXCEPTION: BGP SESSION DOWN/CRITICAL LINK DOWN) or longer in duration.  The appropriate steps should be taken as outlined in Section II, Incident On-Call Process found elsewhere in this document.
  • Action should be taken for network issues reported by I-Light (GigaPOP), the ITaP help desk, campus personnel, students, and/or visitors.
7:00 AM to 5:00 PM, Monday through Friday:           

Issue

Action

  • Wireless/PAL issues that are being experienced by multiple clients .
  • Reports of multiple data PIC service issues in an area.
  • Report of single HIGH PRIORITY data PIC or wireless PAL outage.
  • Call from I-Light / GigaPOP
  • Follow the normal on-call procedures
  • Email from an associated NOC (Internet2, I-Light, etc.)
  • Outages with type: BGP Session Down (Refer to section on BGP Session Down/Critical Link Down)
  • Outages with type: Critical Link Down
  • Call Networking on-call immediately. 765-494-1591
Outside of 7:00 AM to 5:00 PM, Monday through Friday:          

Issue

Action

  • Wireless/PAL issues that are being experienced by multiple clients in different buildings.
  • Send an email to itns-pdnhlog-ext@lists.purdue.edu to notify Data Networking. If this issue seems to be high priority or widespread it may be justified to escalate to our normal on-call procedures.
  • Reports of multiple data PIC service issues in an area.
  • Report of single HIGH PRIORITY data PIC or wireless PAL outage.
  • Call from I-Light / GigaPOP
  • Follow the normal on-call procedures.
  • Email from an associated NOC (Internet2, I-Light, etc.)

Network Incident On-Call Process  

For all Grafana/network device alarms & trouble calls follow the on-call triage process:  

Verify that the building that is showing the alarm has electrical power. Verify that an RFC is not in place for the alarming buildings or machines. 

           Duke Energy Indiana Outage Map: https://outagemap.duke-energy.com/#/current-outages/in

           * To determine if a device is down on Grafana IOC Dashboard due to power outage, please check the middle screen that networking maintains*  

https://docs.google.com/spreadsheets/d/1PuToFCw5I9AMlYowQmXj8Xbyx4FNk1luynJlEbDL-7A/edit#gid=1449271944 

This needs to be up as well for remote monitoring of IOC. It is the RFC work (NOC) tab we are looking at.

  • If it is on this list do not call.
  • If a building near the building listed on the power outage list is on Grafana do not call. (Please only call if it is still showing on Grafana once the power outage time listed is over.)
    (You can look on Purdue Campus Map  https://www.purdue.edu/campus_map/  to tell if where buildings are located)  

       2.  Call Data Networking (PSC) Primary On-Call phone (765-494-1591) . If no answer, voice mail should be left on the phone and you should wait 5 minutes before calling again.  
 
       3.  A TeamDynamix ticket should be made and assigned to IT_NETWORK_OPS. Be sure to use the IOC Network Follow Up incident form in TDX and check notify contacts too


    1. To create an IOC Network Follow Up ticket, select "+ New" > Incident Form.
    2. Select IOC Network Follow-Up under Template.
    3. Select IOC Network Follow-Up under Form.
    4. Fill out the Title and Description.
    5. Adjust impact and urgency based on the outage.
    6. Check the box for "Notify Contact(s)" under Contact.
    7. Press Save.

Any subsequent communication can be done by replying to the ticket notifications or by editing the ticket in TDX.   
Telephone communication with Networking should be documented one of two ways: by email or via a TDX ticket.   Communication with Networking that requires or requests action on their part should be documented by TDX ticket. Communication that does not expect or require action from Networking can be handled by email.  

          Of course, log all communication. Similar to other monitoring systems, any network device alarm log entries should be prefaced with "Network Device:" like the example below.

EXAMPLE: 

10:51challenData NetworksMicah Peercy765-494-1591 / itns-pdnhlog-ext@lists.purdue.eduOutgoing callResume MonitoringNetwork Device: nwss-b001cabs-c9348uxm-01.tcom.purdue.edu. Admin is investigating. TDX: 1699562.

11:32 - Received email: Switch is back up.

 

        4. You do not need to send another email when you try to reach someone. If you leave a message, include this sentence in the body of the email: "We will escalate the on-call process if there is no response within 5 minutes."  
 
       5.  Call  Data Networking (PSC)  Secondary On-Call phone (765-494-1530) . If no answer, voice mail should be left on the phone and you should wait 5 minutes before proceeding.  
 
       6. Call Justin McIntyre at 630-675-7640. If no answer, call Richard Letts  206-790-5837 . Wait 5 minutes before proceeding to Step 8.  

 
       7. In the event that none of the individuals from Steps 3 through 6 above has responded, repeat those steps until contact is made.   

Special Notes:

  • Data Networking personnel who are contacted by IOC staff are responsible for providing issue status updates to IOC Staff in a timely fashion.  As a general rule this means that such feedback should be provided once you start investigating an issue, whenever an ETA for resolution has been determined, and again when the issue has been resolved.  Additional updates are always welcome as well, especially for extended outages.
  • If  Grafana, itself, ever goes down for more than 20 mins make sure to let Data Networks know that we cannot monitor their devices while it is down. Also send a ticket to the ITAP_Networking_II queue regarding it is down. Any part of it (Xymon, Network devices or whole page). Networking supports Grafana now. Follow normal procedure after you call. i.e follow up email and log.
    • APC Alerts: Ignore

Ignore the following list of buildings only outside of normal business hours:
Cases where it is outside of business hours (as is also defined below),  please do not call the data networks on-call staff for Grafana alarms in any of the locations listed in the table down below .  Essentially if the hostname value of the alarm contains a value from the “Grafana Value” column of the list down below, it should be ignored.  These values are typically fairly static but should any updates need to occur we’ll forward you an updated list.

    • Business Hours – Monday through Friday - 7:00 AM to 5:00 PM
    • (Official Purdue Holidays and Weekends Adjoining those Holiday Dates are Considered to be Non-Business Hours, Superseding the Monday through Friday Standard)
    • Example: 11/24/22 through 11/27/22 (Thanksgiving Day and the weekend adjoining it) would not be considered business hours


Building Short Name

Building Long Name

Grafana Value

Reason for Ignoring

-

Any APC Devices

"apc"

APC devices are managed/alerted in Struxureware, IOC will receive alerts for the edge switch

-

All temporary tent switches (any devices with the word "tent")

"tent"

COVID tents, only responding to during business hours

-

Any Device with the word "test"

"test"

Test devices are used for testing purposes and not on the production network.

-

rtm.voip

rtm.voip

Reporting server, no customer impact

844S

844 South River Road

"844s-"

No access, users aren't there after hours

AC22

Field Research Facility (ACRE Farm)

"ac22-"

No access, users aren't there after hours, feeds off of BECK

AC35

Pest Lab and Storage Facility (ACRE Farm)

"ac35-"

No access, users aren't there after hours, feeds off of BECK

AC41

Grain Drying Complex - Grain Auger (ACRE Farm)

"ac41-"

No access, users aren't there after hours, feeds off of BECK

AC42

Scales House (ACRE Farm)

"ac42-"

No access, users aren't there after hours, feeds off of BECK

AC43

USDA Soybean Research Lab (ACRE Farm)

"ac43-"

No access, users aren't there after hours, feeds off of BECK

AC44

USDA Rainulator Building Soil Erosion (ACRE Farm)

"ac44-"

No access, users aren't there after hours, feeds off of BECK

AC45

Var Test Facility (ACRE Farm)

"ac45-"

No access, users aren't there after hours, feeds off of BECK

AC46

Headquarters and Shop (ACRE Farm)

"ac46-"

No access, users aren't there after hours, feeds off of BECK

AC51

Weather Facility (ACRE Farm)

"ac51-"

No access, users aren't there after hours, feeds off of BECK

AC54

Crop Diagnostic Training Center (ACRE Farm)

"ac54-"

No access, users aren't there after hours, feeds off of BECK

AF01

Aquaculture (ASREC Farm)

"af01-"

No access, users aren't there after hours, feeds off of BECK

*AFC *Anderson Flagship Center"afc"*see instructions below 

AIDC

Agricultural Information Distribution Center

"aidc-"

No access, users aren't there after hours

ASB

Airport Service Building

"asb-"

No access, airport hanger

B201

Swine Evaluation Headquarters (ASREC Farm)

"b201-"

No access, users aren't there after hours, feeds off of BECK

B401

Poultry and Hatchery Facility (ASREC Farm)

"b401-"

No access, users aren't there after hours, feeds off of BECK

B501

Sheep Research and Teaching Facility (ASREC Farm)

"b501-"

No access, users aren't there after hours, feeds off of BECK

B701

Swine Office Metabolism Facility (ASREC Farm)

"b701-"

No access, users aren't there after hours, feeds off of BECK

B713

Environmental Research Facility (ASREC Farm)

"b713-"

No access, users aren't there after hours, feeds off of BECK

B801

Farm Operations Shop and Headquarters (ASREC Farm)

"b801-"

No access, users aren't there after hours, feeds off of BECK

B901

Teaching Center and Classroom (ASREC Farm)

"b901-"

No access, users aren't there after hours, feeds off of BECK

BBCH

Purdue Baseball Clubhouse

"bbch-"

Seasonal

BBPB

Purdue Baseball Press Box

"bbpb-"

Seasonal

BTV

Boiler Television Building

"btv-"

No access

CB10

Beef Building (ASREC Farm)

"cb10-"

No access, users aren't there after hours, feeds off of BECK

CB14

Management/Teaching Barn (ASREC Farm)

"cb14-"

No access, users aren't there after hours, feeds off of BECK

COAL

Coal Handling Control/Fire Pump Building

"coal-"

No access

CRML

TAP Carmel Remote Site

"crml"

Charles Garwood, no access, remote site

FHBC

Family Health Clinic of Burlington - Carroll County

"fhbc"

No remote login

GCMB

Golf Course Maintenance Barn

"gcmb-"

No access

GMF

Grounds Maintenance Facility

"gmf-"

No access

GMGF

Grounds Maintenance Greenhouse Facility

"gmgf-"

No access

ICSC

Indiana Corn and Soybean Innovation Center (ACRE Farm)

"icsc-"

No access, users aren't there after hours, feeds off of BECK

IDOT

Indiana Department of Transportation

"idot-"

Andy Sydelko says they can be ignored

INDY

Indianapolis External Site

"indy-"

Gro site - Susan Brock says ignore | Esk site - Mark Sharp says ignore

INOK

Investments Warehouse

"inok-"

Andy Sydelko says they can be ignored

INSS

Intramural Storage Shed

"inss-"

No access, only one user

KKM

TAP Kokomo Site

"kkm-"

Charles Garwood, no access, remote site

LMSB

Laboratory Material Storage Building

"lmsb-"

No access

NA

TAP New Albany Site

"na-"

Charles Garwood, no access, remote site

NACC

Native American Educational and Cultural Center

"nacc-"

No access

PAGE

Thomas A. Page Pavilion

"page-"

No access

PWB

Purdue West Annex - Building B

"pwb-"

Dennis Lord - Ignore outside of 7:30AM to 4:30 PM

PWC

Purdue West - Building C

"pwc-"

Dennis Lord - Ignore outside of 7:30AM to 4:30 PM

RALR

Stadium Area - Visiting Team Locker Room

"ralr-"

No access, Seasonal

SAP (Dallas)SAP Tunnel - Dallas Down (128.241.3.90)

"128.241.3.90"

Not critical, backup for SuccessFactors. The Atlanta one should still be called for.

SBCH

Purdue Softball Clubhouse

"sbch-"

Seasonal

SBPB

Purdue Softball Press Box

"sbpb-"

Seasonal

SCHO

Schowe House

"scho-"

No access

SD02

Dairy Research Unit (ASREC Farm)

"sd02-"

No access, users aren't there after hours, feeds off of BECK

SOCC

Purdue Women’s Soccer Building

"socc-"

No access

SPUR

Spurgeon Golf Training Center

"spur-"

No access

*SW*State Wide"sw**-"*see instructions below 

TM02

Throckmorton Pesticide Building

"tm02-"

No access, users aren't there after hours

TM08

Throckmorton Meigs Building

"tm08-"

No access, users aren't there after hours

TM36

Throckmorton Farm Crop Barn

"tm36-"

No access, users aren't there after hours

TPB

Rankin Track Press Box

"tpb-"

Seasonal

TURF

Intercollegiate Athletic Sports Turf Building

"turf-"

No access, users aren't there after hours

UCUniversity Church"uc-"Ignore outside of business hours (per FP #1698941)

UNPD

University Police Department

"unpd-"

This is a backup connection, Mick Kiefer says ignore outside business hours

USDA

USDA Building 1 (ASREC Farm)

"usda-"

No access, users aren't there after hours, feeds off of BECK

VCPR

Veterinary Center for Paralysis Research

"vcpr-"

Dennis Barnett says ignore until business hours

VLAB

Veterinary Laboratory Animal Building

"vlab-"

Dennis Barnett says ignore until business hours

VOIN

Voinoff (Samuel) Golf Pavilion

"voin-"

Seasonal, no access

VTCH

Vision Technology 1 Building

"vtch-"

No access, not a Purdue building

VPRB

Veterinary Pathobiology Research Building

"vprb-"

Dennis Barnett says ignore until business hours

WH9

Well House 9

"wh9-"

One user, no access

WRIT

John S Wright Forestry Center

"writ-"

No access, users aren't there after hours

PFW No-Call List (Outside of business hours)

Building Short Name

Building Long Name

Grafana Value

AC

Steel Dynamics Keith E. Busse Alumni Center

"f-ac-ds-115-c9300-1"

PXO

Purdue Cooperative Extension Service

"f-pxo-ds-5a-c3650-1"

CRI

Community Research Institute

"f-cri-ds-910-c3650-1"

SOCR

Hefner Soccer Fields

"f-socr-ds-2u01-c3650-1"

PNW No-Call List (Outside of business hours)

Building Short Name

Building Long Name

Grafana Value

GYTE

Millard E. Gyte Science Building

"nw-hmd-gyte-108-x440-s1"

GYTE

Millard E. Gyte Science Building

"nw-hmd-gyte-34-x440-s1"

CLC

Challenger Learning Center

"nw-hmd-clc-*"

LAWS

C. H. Lawshe Hall

"nw-hmd-laws-131-x440-s1"

PWRS

Donald S. Powers Computer Education Building

"nw-hmd-pwrs-216-x440-s1"

TECH

Technology Building

"nw-wst-tech-239a-x440-s1"

Information to Provide When Reporting Network Issues

For Grafana alarms report the following information from Grafana  when reporting issue:

  1. Date/Time that alarm started (Example: October 26, 10:00 AM)
  2. Affected Device Name (Example: mrdh-285n-c2950-01)
  3. Last Ping (Example: 2017-11-01 11:14:43)

For network issues reported by I-Light / GigaPOP or the ITaP help desk, campus personnel, students, or visitors, report the following information:

  1. Date/Time that issue began or was first noticed (Example: October 26, 10:00 AM)
  2. Affected service (Examples: wireless/PAL, data PIC(s), or I-Light / GigaPOP call back)
  3. Location where problem is occurring (Examples: wireless on 2 nd  floor of Armstrong, all data PICs in the Forestry building, or I-Light / GigaPOP). Whenever possible obtain a specific building and nearest room number.
  4. Name and phone number (or at least email) of person experiencing/reporting problem

More details can be found on the attached document below:

PSC-Data Network Incident Resolution Process.docx

 Description: The Down Nodes section in Solarwinds is used to monitor network device alerts similarly to Grafana. 

Location:  Solarwinds - Requires boilerad\username and career account password to sign in. If this page does not load immediately from the link, you may need to refresh it.


If any network device alerts last longer than 20 minutes in Grafana, the down nodes section for Solarwinds can be a useful backup tool in helping us identify if an alarm should clear and/or provide information to Networking. Currently, we follow the normal on-call process described above only for Grafana. (This includes ignoring the alarms if they are related to scheduled maintenance, not notified outside of normal business hours, etc.)


* State Wide and AFC

Treat as normal outages during and after regular business hours.

Typical handling of these alerts will be that the IOC calls the networking on call about a statewide site alert and the networking on call will handle the alert from there. Networking will reach out to statewide site contact(s) in addition to calling/notifying the ECN helpdesk number(765-494-4326) as needed.

Building Sites:

SWKO - State Wide Kokomo
SWNA - State Wide New Albany
SWR  - State Wide Richmond

SWCL - State Wide Columbus

SWSB - State Wide South Bend

AFC - Anderson Flagship Center

SIA - Subaru of Indiana Automative

SW Contact Information:

Jason Culp jjculp@purdue.edu

ECN Helpdesk Number 765-494-4326

Related content

(COMING SOON) Network Tools and Information for the Inquisitive and Intrepid IOC Operator
(COMING SOON) Network Tools and Information for the Inquisitive and Intrepid IOC Operator
Read with this
Grafana (Up to date 11/30/23)
Grafana (Up to date 11/30/23)
More like this
IOC: Common Operating Environment
IOC: Common Operating Environment
Read with this
Grafana Dashboard & New IOC Monitoring Tool
Grafana Dashboard & New IOC Monitoring Tool
More like this
Xymon Service Monitoring Tool
Xymon Service Monitoring Tool
More like this