Help Docs

How to Monitor Bare Metal Server (BMS) in Huawei Cloud

Site24x7 continuously monitors your Bare Metal Server (BMS) metrics, giving your team real-time visibility into CPU utilization, memory consumption, network throughput, TCP connection states, GPU and NPU health, and process activity across your bare metal instances.

Catch resource bottlenecks, hardware degradation, and network anomalies before they impact your workloads and keep your bare metal infrastructure running at peak performance.

Use cases

CPU analysis: Monitor CPU usage breakdown (e.g., user, kernel, I/O wait) with load average to identify resource bottlenecks and avoid unnecessary scaling.

Network health: Track error and drop rates to detect NIC issues early, preventing connectivity problems caused by misconfigurations or network faults.

Hardware monitoring: Observe GPU/NPU health, temperature, and errors to detect hardware degradation early and prevent failures in AI or compute workloads.

Setup and configuration

BMS resources are auto-discovered and monitored during the Huawei Cloud integration. To enable monitoring, follow these steps:

  • Navigate to Cloud > Huawei > Add Huawei Monitor. Follow the steps to add a Huawei Cloud monitor.
  • While adding or editing a Huawei Cloud monitor, select BMS from the Service/Resource Types drop-down menu and click Save.
  • Navigate to Cloud > Huawei, select the created Huawei monitor, and then click Bare Metal Server.

Supported metrics

CPU

Metric name

Description

Units

CPU UsageOverall percentage of CPU capacity currently consumed by all processes on the BMS.Percentage
Idle CPU UsagePercentage of CPU time the processor spent idle with no pending tasks.Percentage
User Space CPU UsagePercentage of CPU time spent executing user-space application processes.Percentage
Kernel Space CPU UsagePercentage of CPU time spent executing kernel-space operations on behalf of processes.Percentage
Other Process CPU UsagePercentage of CPU time consumed by processes not categorized as user, kernel, or system.Percentage
Nice Process CPU UsagePercentage of CPU time spent on processes running at a reduced scheduling priority (nice > 0).Percentage
IO Wait CPU UsagePercentage of CPU time spent waiting for disk I/O operations to complete.Percentage
CPU Interrupt TimePercentage of CPU time spent handling hardware interrupt requests.Percentage
CPU Software Interrupt TimePercentage of CPU time spent processing software interrupt requests.Percentage
1 Minute Load AverageAverage number of processes in an uninterruptible state over the past one minute.Count
5 Minute Load AverageAverage number of processes in an uninterruptible state over the past five minutes.Count
15 Minute Load AverageAverage number of processes in an uninterruptible state over the past 15 minutes.Count

Memory

Metric name

Description

Units

Available MemoryAmount of physical memory currently available for allocation to new processes.GB
Memory UsagePercentage of total physical memory currently in use by the OS and processes.Percentage
Free MemoryAmount of physical memory not currently allocated to any process or cache.GB
Memory BuffersAmount of physical memory allocated to kernel I/O buffers for block device operations.GB
Memory CacheAmount of physical memory used as page cache for recently accessed files.GB
Total Open FilesTotal number of file descriptors currently open across all processes on the server.Count

Network

Metric name

Description

Units

Inbound BandwidthRate of data received by the server's network interface per second.Bit/second
Outbound BandwidthRate of data transmitted by the server's network interface per second.Bit/second
Packet Receive RateNumber of packets received per second by the network interface.Count/second
Packet Send RateNumber of packets transmitted per second by the network interface.Count/second
Receive Error RateRate of errors detected in packets received by the network interface.Percentage
Transmit Error RateRate of errors detected in packets transmitted by the network interface.Percentage
Receive Drop RateRate at which inbound packets are being dropped by the network interface.Percentage
Transmit Drop RateRate at which outbound packets are being dropped by the network interface.Percentage
NTP OffsetDifference in milliseconds between the server's system clock and the NTP reference time source.Milliseconds

TCP Connections

Metric name

Description

Units

Total TCP ConnectionsTotal number of TCP connections in all states currently tracked by the kernel.Count
TCP ESTABLISHEDNumber of TCP connections in the ESTABLISHED state, actively exchanging data.Count
TCP SYN_SENTNumber of TCP connections in the SYN_SENT state, awaiting a remote SYN-ACK.Count
TCP SYN_RECVNumber of TCP connections in the SYN_RECV state, having received a SYN and sent a SYN-ACK.Count
TCP FIN_WAIT1Number of TCP connections in the FIN_WAIT1 state, having sent a FIN and awaiting acknowledgement.Count
TCP FIN_WAIT2Number of TCP connections in the FIN_WAIT2 state, waiting for the remote side to send a FIN.Count
TCP TIME_WAITNumber of TCP connections in the TIME_WAIT state, waiting for the timeout period to expire.Count
TCP CLOSENumber of TCP connections in the CLOSE state.Count
TCP CLOSE_WAITNumber of TCP connections in the CLOSE_WAIT state, waiting for the local application to close the socket.Count
TCP LAST_ACKNumber of TCP connections in the LAST_ACK state, waiting for the final ACK after sending a FIN.Count
TCP LISTENNumber of sockets in the LISTEN state, accepting incoming connection requests.Count
TCP CLOSINGNumber of TCP connections in the CLOSING state, where both sides have sent a FIN simultaneously.Count
TCP Retransmission RateRate at which TCP segments are being retransmitted due to loss or timeout.Percentage

GPU

Metric name

Description

Units

GPU Health StatusOperational health state of the GPU device, indicating whether the card is functioning normally.Count
GPU UsagePercentage of GPU compute capacity currently utilized by active workloads.Percentage
GPU Memory UsagePercentage of total GPU memory currently consumed by active workloads.Percentage
GPU Encoder UsagePercentage of the GPU's hardware video encoder currently in use.Percentage
GPU Decoder UsagePercentage of the GPU's hardware video decoder currently in use.Percentage
GPU Free MemoryAmount of GPU memory not currently allocated to any workload.MB
GPU Used MemoryAmount of GPU memory currently allocated to active workloads.MB
GPU TemperatureCurrent operating temperature of the GPU.Celsius
GPU Power DrawCurrent power consumption of the GPU.Watts
GPU Graphics ClocksCurrent operating frequency of the GPU graphics engine.MHz
GPU Memory ClocksCurrent operating frequency of the GPU memory interface.MHz
GPU SM ClocksCurrent operating frequency of the GPU streaming multiprocessor array.MHz
GPU Video ClocksCurrent operating frequency of the GPU video encode/decode engine.MHz
GPU Performance StateCurrent performance state level of the GPU, where lower values indicate higher performance modes.Count
GPU Volatile Correctable ECCNumber of single-bit ECC memory errors detected and corrected in the current session.Count
GPU Volatile Uncorrectable ECCNumber of multi-bit ECC memory errors detected that could not be corrected in the current session.Count
GPU Aggregate Correctable ECCCumulative count of correctable single-bit ECC memory errors since the last driver reset.Count
GPU Aggregate Uncorrectable ECCCumulative count of uncorrectable multi-bit ECC memory errors since the last driver reset.Count
GPU Retired Page Single BitNumber of GPU memory pages retired due to persistent single-bit ECC errors.Count
GPU Retired Page Double BitNumber of GPU memory pages retired due to uncorrectable double-bit ECC errors.Count
GPU PCI Rx ThroughputRate of data received by the GPU over the PCIe bus per second.MB/second
GPU PCI Tx ThroughputRate of data transmitted by the GPU over the PCIe bus per second.MB/second

NPU

Metric name

Description

Units

NPU Device HealthOperational health state of the Neural Processing Unit (NPU) device.Count
NPU AI Core UsagePercentage of NPU AI core compute capacity currently in use.Percentage
NPU Memory UsagePercentage of total NPU memory currently consumed by active workloads.Percentage
NPU AI CPU UsagePercentage of the NPU's embedded AI CPU currently in use.Percentage
NPU Control CPU UsagePercentage of the NPU's control CPU currently in use for management operations.Percentage
NPU Memory Bandwidth UsagePercentage of available NPU memory bandwidth currently consumed.Percentage
NPU Memory FrequencyCurrent operating frequency of the NPU memory subsystem.MHz
NPU AI Core FrequencyCurrent operating frequency of the NPU AI core array.MHz
NPU Used MemoryAmount of NPU memory currently allocated to active inference or training workloads.MB
NPU Single Bit ErrorsNumber of single-bit ECC errors detected in NPU memory.Count
NPU Double Bit ErrorsNumber of double-bit ECC errors detected in NPU memory, which are uncorrectable.Count
NPU PowerCurrent power consumption of the NPU device.Watts
NPU TemperatureCurrent operating temperature of the NPU device.Celsius

Process

Metric name

Description

Units

Total ProcessesTotal number of processes currently present on the BMS in all states.Count
Running ProcessesNumber of processes currently in the running state and actively consuming CPU.Count
Idle ProcessesNumber of processes currently idle and not consuming CPU or waiting for I/O.Count
Zombie ProcessesNumber of processes that have terminated but whose exit status has not yet been collected by a parent process.Count
Blocked ProcessesNumber of processes currently blocked waiting for a resource such as disk I/O or a lock.Count
Sleeping ProcessesNumber of processes in a sleep state, waiting for an event or timer before resuming execution.Count

Threshold configuration

You can configure thresholds and alerts for all BMS metrics to detect performance degradation proactively or connection issues.

  1. Go to Admin > Configuration Profiles > Threshold and Availability.
  2. Create or edit your Threshold Profile for BMS.
  3. Assign the profile to the respective monitors to trigger alerts.

IT Automation

Use Site24x7's IT Automation to resolve common issues with BMS performance:

  1. Go to Admin > IT Automation Templates. Then, click Add Automation Templates.
  2. Create an automation rule by selecting the automation Type (e.g., Server reboot, clear queue).
  3. Map the created rules to the BMS, for automatic execution during alerts.

Configuration rules

Use Configuration Rules to simplify bulk setup across BMS instances. Automatically assign Threshold Profiles, Notification Profiles, Tags, and Monitor Groups when new monitors are discovered.

Related articles

Was this document helpful?

Would you like to help us improve our documents? Tell us what you think we could do better.


We're sorry to hear that you're not satisfied with the document. We'd love to learn what we could do to improve the experience.


Thanks for taking the time to share your feedback. We'll use your feedback to improve our online help resources.

Shortlink has been copied!