/
Data Center Monitoring

Data Center Monitoring

In addition to remote monitoring of datacenter spaces (i.e MATH, TEL, HAAS, etc.), the IOC will perform routine in-person inspections of the physical facility(ies). These include (currently) MATH B60, MATH G109, MATH G190, and TEL.

Log each walkthrough in the IOC log, and indicate if DCM was emailed (or called, as the case may be). No email or call is needed if no issues are found. Whether or not follow-up email was sent, please select 'Yes' for the "Email?" option when making the log entry (examples near bottom of page).

Schedule:

On normal business days, walk-throughs should be performed, everyday, once on first shift, and twice on subsequent shifts. On non-business days (weekends and holidays), walk-throughs should be performed twice on all shifts:

                   

1st Shift

     09:00      (non-business days)

13:00
2nd Shift17:0021:00
3rd Shift01:0005:00

(Table 1)


Walk-throughs are performed, primarily, by the IOC secondary working from MATH inspecting MATH, and the IOC primary inspecting TEL . In the event that the secondary is unable to perform a walk-through for whatever reason, the IOC primary will assume this task.

Walk-throughs 

The Walk-Through in general


Although the majority of monitoring performed by the IOC is conducted electronically, such monitoring suffers from a number of limitations. Sensors may fail, be insensitive or insufficiently sensitive, improperly configured, or too slow to detect changes in the datacenter environment that would otherwise cause concern for a (trained) human observer.


To support an expanding (in scope and complexity) university computing system in which the datacenter(s) play(s) an important (if not central) role, and to mitigate some of the shortcomings described above, IOC personnel will conduct in-person inspections of the datacenter spaces in addition to electronic monitoring. 

These "walk-throughs" are primarily concerned with the detection of imminent or emergent failure of equipment and facility, safety issues that immediately imperil staff, and the security of the facility and the services and data contained within. These walk-throughs rely on gross examination of the datacenter space, and IOC personnel should be concerned with (and looking for) signs of several potential issues.


Major threats to the Datacenter:

  • Water
    • Water that is where it shouldn't be
      • water from a source external to the datacenter
      • water from within the datacenter
    • Water that isn't where it should be
      • could be related to above, often related to below (heat)
  • Heat
    • cooling failure/insufficiency
    • fire
  • Power Loss
    • failure of support infrastructure (cooling, monitoring, etc.) leading to equipment damage and/or loss
    • fluctuations and/or unplanned power loss may damage equipment (upon loss, resumption, or both)
    • service down-time (services are the entire point of a datacenter!)
  • Intrusion
    • theft of, or malicious damage/vandalism to, critical equipment and infrastructure
    • unintentional damage to critical equipment or support infrastructure
    • unauthorized and/or potentially hazardous access to data and services


As a matter of routine, IOC personnel should be looking for indicators of problems associated with these threats during datacenter walk-throughs. Additional observations that may indicate issues other than those explicitly described above should be considered a matter of further investigation, as well.

When performing walk-throughs, be looking for:

  • Water
    • standing/puddled water or liquid on the floor
    • water or liquid on, under, or adjacent to the racks or equipment
    • water leaking, dripping, or spraying from overhead pipes (any pipes or hoses, really)
    • water spots, dampness, or related damage to the ceiling or walls
    • the sound of water moving (flowing, dripping, bubbling, etc) where it probably shouldn't be
    • odour of glycol (akin to automotive coolant) or other fluids 
  • Heat
    • audible alarms indicating a failure of cooling equipment
    • noticeable and/or unusual areas of heat (hot spots)
    • fire
      • odour of smoke, heat, fire, or combustion
      • visible smoke, smokiness, or unexplained haze
      • activation of a fire alarm, smoke detection, or fire suppression system
      • melted, deformed, or otherwise abnormal appearance of cabling, equipment, or other materials
  • Power Loss/Electrical issues
    • power loss will likely be fairly obvious
      • widespread equipment shutdown
      • no or limited lighting
      • audible alarms from UPS and power distribution equipment
    • sparking, arcing, or arc burns/marks
    • electrical odours (ozone, "hot smells"; see above Heat)
  • Intrusion
    • broken, damaged, inoperable, or malfunctioning door locks, handles, or card readers
    • unsecured doors or doors left ajar
    • evidence of theft or damage 
      • unexpected or suspicious signs of equipment removal 
      • cut cables
      • suspiciously bent or broken rack doors, rails, or other equipment infrastructure
    • unknown and/or unexpected individuals in the datacenter at unexpected times


The Basic Walkthrough

Walkthroughs should be performed as scheduled, (above, Table 1, and here; Table 2): 

1st Shift

     09:00      (non-business days)

13:00
2nd Shift17:0021:00
3rd Shift01:0005:00

(Table 2)


It is recommended that walk-throughs consist of two distinct walkthroughs per datacenter. The first walk should focus "up"; observe and note the condition of the ceiling, the airspace above the racks and equipment, the walls and columns, and any utilities located 'up' (i.e. cabling, piping, etc.) When this first walk-through has been completed, a second should be performed, focusing this time on "down";  observe and note the condition of the racks themselves, the floor, walls, doors, and equipment. During both walks, the general condition should be observed; smells, sounds, temperature, etc.


Up:

Cabling, raceways, and overhead:

Check that cabling (if present) is together, appears undamaged, and is not falling or pulling off of raceways or conduit. Inspect ceiling for damaged tiles, indicia of water leakage (sagging tiles, discolouration, visible moisture, dripping, etc), or other defect (fig. 1).

 

(Figure 1)


Piping:

Inspect piping (if present) for visible leakage; dripping fluid or water, physical defect, or obvious damage (fig. 2). Observe above racks for ceiling debris.

(Figure 2)

Down:

Floors, walls, racks, doors:

(Figure 3)

Notice the doors on these racks; they are fluid-filled radiators (figs. 3,4). Many of the racks are fluid-cooled. Pay particular attention to the floor in front of, underneath, and to the side of the racks for leaking coolant, water, or other liquids (fig. 4, circled).

(Figure 5)


The racks in figure 5 belong to the Bell Cluster. These doors (in particular and alone) should be opened and the cooling feeds and rails inspected for leaks or damage.


Residue from a previous leak (fig. 6). Be on the lookout for such (fresh) leaks (fig. 6.2). Fluids of ANY colour are an issue!:

(Figure 6)

(Figure 6.2)

(Figure 7)

Check Clean Mats at the entrances to the data center. When excessively dirty (as pictured above, fig. 7 ), please remove and discard the top layer, revealing a fresh layer underneath. This helps to prevent dirt from being tracked into the datacenter, and ultimately, into the equipment.

Datacenter specific considerations:

MATH

  • G190 and G109

In addition to the general concerns outlined above, when performing a walk-through of MATH G190 and G109, please also inspect the Cooling Distribution Units (CDUs) located in each datacenter for any alarms (fig. 7): 


(Figure 8)


CDUs are labeled and shown in approximate location in red. Green Xs indicate doors that should be checked to verify they are closed and secured. Purple stars indicate doors that should remain closed, but unlocked. Yellow stars indicate doorways that should remain open for ventilation. 


CDUs come in several flavours. Examples are shown below, with notes:

Coolcentric

Without PDU (fig. 9):

 

(Figure 9)


Two Coolcentric CDUs with PDU (fig. 10):

 

(Figure 10)


This CDU (fig. 11) is experiencing an alarm (fig. 12, circled), with close-up view (fig. 13):

(Figure 11)


 

(Figure 12)                                                                           (Figure 13)


CoolIT

(Figure 14)

CoolIT CDUs are equipped with touchscreens that may go into screen-saver mode. Lightly touch the screen to check its status.


This CDU is experiencing an alarm (fig. 15):

(Figure 15)


Also check the following spaces:

      • The areas and hallways around B60, B22, B18, and B10. Leaks or water here could indicate an issue above (in the datacenter).
      • Freight elevator. Check that there are no packages. If any packages are found, place them in B92 (the room that opens to the freight elevator on B level) and send an email to DCM.
      • Loading dock exterior door is locked.



  • B60
  • TEL
  • HAAS
  • LAMB?

Reporting:

Emergencies involving life-threats (fire, injuries, etc) should be reported via 911!

Issues involving large volumes of water in the datacenter or electrical issues (sparking/arcing/smoke from feed panels or energized equipment) should be reported first to PUPD dispatch, followed immediately by contact to DCM via phone.

Issues of excessive heat or water should be reported immediately to DCM via phone. 

Relatively minor issues of open doors, CRAC noises, CDU errors, or other non-critical issues should be reported to DCM via email.

Log each walkthrough in the IOC log, and indicate if DCM was emailed (or called, as the case may be). Whether or not follow-up email was sent, please select 'Yes' for the "Email?" option when making the log entry.

EXAMPLE: 

If you are unsure about the severity of an issue, contact DCM via email or phone!

Related content

IOC Monitoring Instructions
IOC Monitoring Instructions
More like this
IOC Remote Monitoring (generally, and COVID-19 specifically)
IOC Remote Monitoring (generally, and COVID-19 specifically)
More like this
IOC: Common Operating Environment
IOC: Common Operating Environment
More like this
Siemens Data Center Monitoring Tool
Siemens Data Center Monitoring Tool
More like this
Xymon Service Monitoring Tool
Xymon Service Monitoring Tool
More like this