Detect and resolve hardware issues to avert hardware failure in your VMware environment

Fix your hardware before it fails, with Site24x7's instant threshold-based alerts.

Proper functioning hardware is crucial to ensure flawless infrastructure for your VMware environment. Any malfunction in an individual component can lead to issues like problematic VM behavior, corrupted hard disks, and faulty processors, thus crippling the entire VMware environment. Avoid such issues by monitoring all your hardware 24x7 so you can record every event.

How can hardware failure affect your resources?

VMware ESX/ESXi hosts are prone to erratic host and virtual machine behavior, purple screen errors, corrupt disk drives, and other errors. These kinds of errors clearly indicate faulty hardware. Detecting and diagnosing them using VMware host hardware logs or Common Information Model (CIM) logs is not a wise choice, as by the time you’ve troubleshooted the errors, the hardware failure might have already affected your production environment.

It is always wiser to avert issues by taking the necessary preventive measures. This can only be achieved by monitoring your hardware with threshold limits set on each component. Continuous monitoring will insure against common hardware issues as alerts will be triggered when thresholds are breached so technicians can take corrective measures like replacing or fixing faulty hardware and ensure continued uptime.

Monitoring host hardware health using sensors

Hardware sensors check the functioning of VMware hardware. Different sensors are mounted on various hardware components, and all of these sensors periodically collect data and send it to Site24x7 for monitoring. The On-Premise Poller acts as the probe for data collection, and Site24x7 receives and displays this data as intuitive charts.

You can track the performance of your hardware using the following sensors in VMware:

Power

The power supply may not always be uniform. It may be turned off at times, there may be fluctuations, or it may run at full capacity.

Fan

“Fan Transition to Critical from less severe” and “Fan Transition to Off Line” are common errors when it comes to host fan health.

Temperature

Hardware performance depends on systems running at the optimum temperature. Temperature sensors will frequently identify errors like “Temperature Lower Critical going low,” “Temperature Transition to Critical from less severe,” “Temperature Transition to Non-recoverable from less severe,” and “Temperature Upper Critical going high.” IT teams must control the temperature before it goes beyond the appropriate limits.

Power and fan sensors
Processor and voltage sensors
Processor

Processors are prone to errors like thermal trip errors, configuration errors, machine check exceptions, correctable machine check exceptions, and internal errors (IERR), all of which can affect the performance of the CPU.

Voltage

Voltage has to be monitored at the power supply input and output. Technicians generally keep an eye on errors like “Voltage Limit Exceeded” and “Voltage Transition to Critical from less severe.”

Storage

Storage sensors differ by storage type, and information on disk storage is required for capacity planning.

Watchdog

Since the Watchdog sensor monitors the system board, it’s important to monitor it.

Memory

Memory has a great impact on resource allocation and is prone to errors like configuration errors, uncorrectable error-correcting code (ECC) errors, “Memory Transition to Critical,” and “Memory Critical Overtemperature.”

Battery

The status of the battery, battery on array, and battery on controller have to be closely watched. The color code depicts the battery health and it must never be red.

Other

Any hardware outside of the above categories is grouped as “other” sensors in VMware.

Watchdog, battery and other sensors

For all the sensors above, any change or deviation from the ideal performance should be tracked and reported to ensure uninterrupted performance and optimal hardware health.

How does Site24x7 help?

With Site24x7, you can configure threshold limits for all the sensors connected to the above hardware. You can diagnose and fix hardware issues as well as stay on top of your VMware hardware. Site24x7 enables you to:

Monitor your virtual environment, and locate the source of potential issues using performance metrics. Sign up now!