How to Monitor Bare Metal Server (BMS) in Huawei Cloud

Site24x7 continuously monitors your Bare Metal Server (BMS) metrics, giving your team real-time visibility into CPU utilization, memory consumption, network throughput, TCP connection states, GPU and NPU health, and process activity across your bare metal instances.

Catch resource bottlenecks, hardware degradation, and network anomalies before they impact your workloads and keep your bare metal infrastructure running at peak performance.

Use cases

CPU analysis: Monitor CPU usage breakdown (e.g., user, kernel, I/O wait) with load average to identify resource bottlenecks and avoid unnecessary scaling.

Network health: Track error and drop rates to detect NIC issues early, preventing connectivity problems caused by misconfigurations or network faults.

Hardware monitoring: Observe GPU/NPU health, temperature, and errors to detect hardware degradation early and prevent failures in AI or compute workloads.

Setup and configuration

BMS resources are auto-discovered and monitored during the Huawei Cloud integration. To enable monitoring, follow these steps:

Navigate to Cloud > Huawei > Add Huawei Monitor. Follow the steps to add a Huawei Cloud monitor.
While adding or editing a Huawei Cloud monitor, select BMS from the Service/Resource Types drop-down menu and click Save.
Navigate to Cloud > Huawei, select the created Huawei monitor, and then click Bare Metal Server.

Supported metrics

CPU

Metric name	Description	Units
CPU Usage	Overall percentage of CPU capacity currently consumed by all processes on the BMS.	Percentage
Idle CPU Usage	Percentage of CPU time the processor spent idle with no pending tasks.	Percentage
User Space CPU Usage	Percentage of CPU time spent executing user-space application processes.	Percentage
Kernel Space CPU Usage	Percentage of CPU time spent executing kernel-space operations on behalf of processes.	Percentage
Other Process CPU Usage	Percentage of CPU time consumed by processes not categorized as user, kernel, or system.	Percentage
Nice Process CPU Usage	Percentage of CPU time spent on processes running at a reduced scheduling priority (nice > 0).	Percentage
IO Wait CPU Usage	Percentage of CPU time spent waiting for disk I/O operations to complete.	Percentage
CPU Interrupt Time	Percentage of CPU time spent handling hardware interrupt requests.	Percentage
CPU Software Interrupt Time	Percentage of CPU time spent processing software interrupt requests.	Percentage
1 Minute Load Average	Average number of processes in an uninterruptible state over the past one minute.	Count
5 Minute Load Average	Average number of processes in an uninterruptible state over the past five minutes.	Count
15 Minute Load Average	Average number of processes in an uninterruptible state over the past 15 minutes.	Count

Memory

Metric name	Description	Units
Available Memory	Amount of physical memory currently available for allocation to new processes.	GB
Memory Usage	Percentage of total physical memory currently in use by the OS and processes.	Percentage
Free Memory	Amount of physical memory not currently allocated to any process or cache.	GB
Memory Buffers	Amount of physical memory allocated to kernel I/O buffers for block device operations.	GB
Memory Cache	Amount of physical memory used as page cache for recently accessed files.	GB
Total Open Files	Total number of file descriptors currently open across all processes on the server.	Count

Network

Metric name	Description	Units
Inbound Bandwidth	Rate of data received by the server's network interface per second.	Bit/second
Outbound Bandwidth	Rate of data transmitted by the server's network interface per second.	Bit/second
Packet Receive Rate	Number of packets received per second by the network interface.	Count/second
Packet Send Rate	Number of packets transmitted per second by the network interface.	Count/second
Receive Error Rate	Rate of errors detected in packets received by the network interface.	Percentage
Transmit Error Rate	Rate of errors detected in packets transmitted by the network interface.	Percentage
Receive Drop Rate	Rate at which inbound packets are being dropped by the network interface.	Percentage
Transmit Drop Rate	Rate at which outbound packets are being dropped by the network interface.	Percentage
NTP Offset	Difference in milliseconds between the server's system clock and the NTP reference time source.	Milliseconds

TCP Connections

Metric name	Description	Units
Total TCP Connections	Total number of TCP connections in all states currently tracked by the kernel.	Count
TCP ESTABLISHED	Number of TCP connections in the ESTABLISHED state, actively exchanging data.	Count
TCP SYN_SENT	Number of TCP connections in the SYN_SENT state, awaiting a remote SYN-ACK.	Count
TCP SYN_RECV	Number of TCP connections in the SYN_RECV state, having received a SYN and sent a SYN-ACK.	Count
TCP FIN_WAIT1	Number of TCP connections in the FIN_WAIT1 state, having sent a FIN and awaiting acknowledgement.	Count
TCP FIN_WAIT2	Number of TCP connections in the FIN_WAIT2 state, waiting for the remote side to send a FIN.	Count
TCP TIME_WAIT	Number of TCP connections in the TIME_WAIT state, waiting for the timeout period to expire.	Count
TCP CLOSE	Number of TCP connections in the CLOSE state.	Count
TCP CLOSE_WAIT	Number of TCP connections in the CLOSE_WAIT state, waiting for the local application to close the socket.	Count
TCP LAST_ACK	Number of TCP connections in the LAST_ACK state, waiting for the final ACK after sending a FIN.	Count
TCP LISTEN	Number of sockets in the LISTEN state, accepting incoming connection requests.	Count
TCP CLOSING	Number of TCP connections in the CLOSING state, where both sides have sent a FIN simultaneously.	Count
TCP Retransmission Rate	Rate at which TCP segments are being retransmitted due to loss or timeout.	Percentage

GPU

Metric name	Description	Units
GPU Health Status	Operational health state of the GPU device, indicating whether the card is functioning normally.	Count
GPU Usage	Percentage of GPU compute capacity currently utilized by active workloads.	Percentage
GPU Memory Usage	Percentage of total GPU memory currently consumed by active workloads.	Percentage
GPU Encoder Usage	Percentage of the GPU's hardware video encoder currently in use.	Percentage
GPU Decoder Usage	Percentage of the GPU's hardware video decoder currently in use.	Percentage
GPU Free Memory	Amount of GPU memory not currently allocated to any workload.	MB
GPU Used Memory	Amount of GPU memory currently allocated to active workloads.	MB
GPU Temperature	Current operating temperature of the GPU.	Celsius
GPU Power Draw	Current power consumption of the GPU.	Watts
GPU Graphics Clocks	Current operating frequency of the GPU graphics engine.	MHz
GPU Memory Clocks	Current operating frequency of the GPU memory interface.	MHz
GPU SM Clocks	Current operating frequency of the GPU streaming multiprocessor array.	MHz
GPU Video Clocks	Current operating frequency of the GPU video encode/decode engine.	MHz
GPU Performance State	Current performance state level of the GPU, where lower values indicate higher performance modes.	Count
GPU Volatile Correctable ECC	Number of single-bit ECC memory errors detected and corrected in the current session.	Count
GPU Volatile Uncorrectable ECC	Number of multi-bit ECC memory errors detected that could not be corrected in the current session.	Count
GPU Aggregate Correctable ECC	Cumulative count of correctable single-bit ECC memory errors since the last driver reset.	Count
GPU Aggregate Uncorrectable ECC	Cumulative count of uncorrectable multi-bit ECC memory errors since the last driver reset.	Count
GPU Retired Page Single Bit	Number of GPU memory pages retired due to persistent single-bit ECC errors.	Count
GPU Retired Page Double Bit	Number of GPU memory pages retired due to uncorrectable double-bit ECC errors.	Count
GPU PCI Rx Throughput	Rate of data received by the GPU over the PCIe bus per second.	MB/second
GPU PCI Tx Throughput	Rate of data transmitted by the GPU over the PCIe bus per second.	MB/second

NPU

Metric name	Description	Units
NPU Device Health	Operational health state of the Neural Processing Unit (NPU) device.	Count
NPU AI Core Usage	Percentage of NPU AI core compute capacity currently in use.	Percentage
NPU Memory Usage	Percentage of total NPU memory currently consumed by active workloads.	Percentage
NPU AI CPU Usage	Percentage of the NPU's embedded AI CPU currently in use.	Percentage
NPU Control CPU Usage	Percentage of the NPU's control CPU currently in use for management operations.	Percentage
NPU Memory Bandwidth Usage	Percentage of available NPU memory bandwidth currently consumed.	Percentage
NPU Memory Frequency	Current operating frequency of the NPU memory subsystem.	MHz
NPU AI Core Frequency	Current operating frequency of the NPU AI core array.	MHz
NPU Used Memory	Amount of NPU memory currently allocated to active inference or training workloads.	MB
NPU Single Bit Errors	Number of single-bit ECC errors detected in NPU memory.	Count
NPU Double Bit Errors	Number of double-bit ECC errors detected in NPU memory, which are uncorrectable.	Count
NPU Power	Current power consumption of the NPU device.	Watts
NPU Temperature	Current operating temperature of the NPU device.	Celsius

Process

Metric name	Description	Units
Total Processes	Total number of processes currently present on the BMS in all states.	Count
Running Processes	Number of processes currently in the running state and actively consuming CPU.	Count
Idle Processes	Number of processes currently idle and not consuming CPU or waiting for I/O.	Count
Zombie Processes	Number of processes that have terminated but whose exit status has not yet been collected by a parent process.	Count
Blocked Processes	Number of processes currently blocked waiting for a resource such as disk I/O or a lock.	Count
Sleeping Processes	Number of processes in a sleep state, waiting for an event or timer before resuming execution.	Count

Threshold configuration

You can configure thresholds and alerts for all BMS metrics to detect performance degradation proactively or connection issues.

Go to Admin > Configuration Profiles > Threshold and Availability.
Create or edit your Threshold Profile for BMS.
Assign the profile to the respective monitors to trigger alerts.

IT Automation

Use Site24x7's IT Automation to resolve common issues with BMS performance:

Go to Admin > IT Automation Templates. Then, click Add Automation Templates.
Create an automation rule by selecting the automation Type (e.g., Server reboot, clear queue).
Map the created rules to the BMS, for automatic execution during alerts.

Configuration rules

Use Configuration Rules to simplify bulk setup across BMS instances. Automatically assign Threshold Profiles, Notification Profiles, Tags, and Monitor Groups when new monitors are discovered.

On this page

Use cases

Setup and configuration

Supported metrics

Threshold configuration

IT Automation

Configuration rules

How to Monitor Bare Metal Server (BMS) in Huawei Cloud

Use cases

Setup and configuration

Supported metrics

CPU

Memory

Network

TCP Connections

GPU

NPU

Process

Threshold configuration

IT Automation

Configuration rules

Related articles