Cisco UCS Monitoring Extension

Introduction

Cisco Unified Computing System (UCS) is a system of servers, network, storage and storage network in a single platform. Cisco UCS is used for creating a more cost-effective, efficient and centrally managed data center architecture by integrating computing, networking, virtualization and data storage components and resources.

This AppDynamics Cisco UCS monitoring extension covers UCS health monitoring from both proactive and reactive perspectives:

  1. Baselining Chassis Telemetry – This solution periodically collects temperature and power supply values from UCS chassis blade-server into the AppDynamics BiQ platform. The BiQ platform applies ML/AI on the aggregated data to learn the normal power supply and temperature telemetries - so as to proactively alert when a server is getting too hot and/or if there's an outlier in the power supply volatage. This will enable AppDynamics customers to proactively monitor blade-server health (before UCS flags it as a fault), and perform remediation actions before customers are impacted.

  2. Monitoring UCS Faults - The extension periodically polls the UCS fault engine and aggregates UCS faults in AppDynamics. These faults can be further categorised into critical UCS functional areas such as Disk Health, Fan Module health, Fabric Interconnect health, Blade server health, Rack Unit health, Chassis PSU health, vNICS health etc. All faults that are visible to the UCS manager are monitored by this extension irrespective of the affected component within UCS. Only faults that have not been acknowledged in UCS are monitored. In addition, the extension reports on only Critical, Major, Minor and Warning faults. In other words, Info, Condition and Cleared Severities are all ignored and not monitored by AppDynamics.

  3. ServiceNow Integration – The UCS monitoring extension has an optional ServiceNow integration built-in; if enabled, the extension creates a ServiceNow incident with a detailed description of faults. The ServiceNow incident is auto-assigned to a pre-defined group. By default, it creates a P3 incident for Critical faults.

In summary, as this monitoring extension leverages the power of the AppDynamics BiQ platform, Cisco UCS customers can now slice and dice UCS faults in numerous dimensions for reporting and trend analysis purposes. For example, this query returns all UCS critical faults that were caused by power-supply failure, and had a direct (or a knock-on) effect on a server or a network components in the last 7 days.

SELECT * FROM ucs_faults WHERE cause = "power-supply" AND Severity = "critical" AND Type in ("network", "server") SINCE 7 days

In the same vein, the ADQL below uses regular expression to calculate and return the average Inlet Air Temperature of the first blade server in the first chassis.

SELECT avg(toFloat(FmTempSenIo)) AS InletAirTemp FROM ucs_server_temperature WHERE Dn REGEXP "sys/chassis-1/blade-1.*"

Better still, you can save the ADQL as a metric so it can be automatically executed every minute to plot a time-series graph.

Prerequisites

The following requirements must be met:

1) BiQ/Analytics License

2) Windows PowerShell 5.0 or later, or PowerShell Core running on Windows, Linux or macOS.

3) Before the extension is installed, the generic AppDynamics extension prerequisites mentioned here need to be met.

Please do not proceed with the extension installation if any of the aforementioned prerequisites are not met.

Installation

1) Download and unzip the UCSMonitoringExtension.zip to the/monitors directory

2) Edit only the Value property in the config.json file located at/monitors/UCSMonitoringExtension

The table below contains a description of some of the configuration properties.

Config Property NameDescription
UCSPasswordEncyptionKeyAny string of your choice. This key is used to encrypt and decrypt UCS connection details.
UCSURLSpecify the IP Address or domain name of UCS manager. Please do not include the http/s bit
analyticsEndpointThis is the analytics endpoint of your controller. This differs depending on the location of your controller. Please refer to this doc.
X-Events-API-AccountNameYou can get the global account name to use from the License page
X-Events-API-KeyCreate the analytics API Key by following the instruction in this doc. Grant Manage, Query and Publish permissions to Custom Analytics Events.
EnableServiceNowSet to 'yes' or 'no'. Other ServiceNOW properties are required if set to yes, else, ignore them.
tierIDThis is required to monitor the health of the UCS monitoring extension i.e connectivity to AppDynamics, UCS and SNOW.Follow the instructions in this doc to acquire the component (or tier) ID.

3) Launch PowerShell as an Administrator, change directory to the extensions folder and run the .\Setup.ps1 script. The Setup.ps1 script performs a one-time configuration of the following items:

  • Acquires a UCS session and exports the session details into an encrypted file in the SecureFolder.

  • If enabled, it acquires a ServiceNow session and stores it an encrypted file in the SecureFolder using AES encryption algorithm.

  • Creates AppDynamics Analytics Schemas – for UCS faults, Power Supply Stats and Chassis temperature.

  • Installs ServiceNow and UCS PowerShell module from Microsoft PSGallary. If your server is behind a firewall and it is blocked from accessing https://www.powershellgallery.com, you'd need to manually download and install the PowerShell modules – refer to the Setup.ps1 script for the module names. - Creates a file named appd.setup.complete.indicator.txt - to indicate that the setup has been successfully created, if and only if the setup was successful.

Setup
Fig. 1.0 Setup.ps1 Process

4) Login to AppDynamics Controller and navigate to Analytics – Searches – Add - 'Drag and Drop Search'. Click on the Schema drop-down and ensure all 3 UCS schemas are present.

schemas
Fig. 1.1 Analaytics Schema

5) Run the FaultFinder.ps1 script manually and ensure there are no errors

6) Restart Machine Agent

7) Repeat step 4 after 4 minutes, but this time select the PSU schema. You're expected to see some data.

verify_PSU_data
Fig. 1.3 Verify PSU Telemetry

8) Repeat step 4, but this time select the UCS Fault schema. You’re expected to see some data if there’s any UCS fault found, in addition, a ServiceNow ticket containing a summary of all faults should have been created.

verify_PSU_data
Fig. 1.4 ServiceNow Incident

The SNOW incident number is logged in the UCSMonitor log file in the machine agent’s log directory

-------- End of UCS Monitoring Extension Setup --------

If you are interested in setting up a UCS dashboard similar to the one below, click here to continue reading

(Pro-tip: Right click the dashboard and select open in new tab)

Screenshot 2019-11-01 at 20 36 37

Attachments: