How to Monitor Bare Metal Server (BMS) in Huawei Cloud
Site24x7 continuously monitors your Bare Metal Server (BMS) metrics, giving your team real-time visibility into CPU utilization, memory consumption, network throughput, TCP connection states, GPU and NPU health, and process activity across your bare metal instances.
Catch resource bottlenecks, hardware degradation, and network anomalies before they impact your workloads and keep your bare metal infrastructure running at peak performance.
Use cases
CPU analysis: Monitor CPU usage breakdown (e.g., user, kernel, I/O wait) with load average to identify resource bottlenecks and avoid unnecessary scaling.
Network health: Track error and drop rates to detect NIC issues early, preventing connectivity problems caused by misconfigurations or network faults.
Hardware monitoring: Observe GPU/NPU health, temperature, and errors to detect hardware degradation early and prevent failures in AI or compute workloads.
Setup and configuration
BMS resources are auto-discovered and monitored during the Huawei Cloud integration. To enable monitoring, follow these steps:
- Navigate to Cloud > Huawei > Add Huawei Monitor. Follow the steps to add a Huawei Cloud monitor.
- While adding or editing a Huawei Cloud monitor, select BMS from the Service/Resource Types drop-down menu and click Save.
- Navigate to Cloud > Huawei, select the created Huawei monitor, and then click Bare Metal Server.
Supported metrics
CPU
Metric name | Description | Units |
| CPU Usage | Overall percentage of CPU capacity currently consumed by all processes on the BMS. | Percentage |
| Idle CPU Usage | Percentage of CPU time the processor spent idle with no pending tasks. | Percentage |
| User Space CPU Usage | Percentage of CPU time spent executing user-space application processes. | Percentage |
| Kernel Space CPU Usage | Percentage of CPU time spent executing kernel-space operations on behalf of processes. | Percentage |
| Other Process CPU Usage | Percentage of CPU time consumed by processes not categorized as user, kernel, or system. | Percentage |
| Nice Process CPU Usage | Percentage of CPU time spent on processes running at a reduced scheduling priority (nice > 0). | Percentage |
| IO Wait CPU Usage | Percentage of CPU time spent waiting for disk I/O operations to complete. | Percentage |
| CPU Interrupt Time | Percentage of CPU time spent handling hardware interrupt requests. | Percentage |
| CPU Software Interrupt Time | Percentage of CPU time spent processing software interrupt requests. | Percentage |
| 1 Minute Load Average | Average number of processes in an uninterruptible state over the past one minute. | Count |
| 5 Minute Load Average | Average number of processes in an uninterruptible state over the past five minutes. | Count |
| 15 Minute Load Average | Average number of processes in an uninterruptible state over the past 15 minutes. | Count |
Memory
Metric name | Description | Units |
| Available Memory | Amount of physical memory currently available for allocation to new processes. | GB |
| Memory Usage | Percentage of total physical memory currently in use by the OS and processes. | Percentage |
| Free Memory | Amount of physical memory not currently allocated to any process or cache. | GB |
| Memory Buffers | Amount of physical memory allocated to kernel I/O buffers for block device operations. | GB |
| Memory Cache | Amount of physical memory used as page cache for recently accessed files. | GB |
| Total Open Files | Total number of file descriptors currently open across all processes on the server. | Count |
Network
Metric name | Description | Units |
| Inbound Bandwidth | Rate of data received by the server's network interface per second. | Bit/second |
| Outbound Bandwidth | Rate of data transmitted by the server's network interface per second. | Bit/second |
| Packet Receive Rate | Number of packets received per second by the network interface. | Count/second |
| Packet Send Rate | Number of packets transmitted per second by the network interface. | Count/second |
| Receive Error Rate | Rate of errors detected in packets received by the network interface. | Percentage |
| Transmit Error Rate | Rate of errors detected in packets transmitted by the network interface. | Percentage |
| Receive Drop Rate | Rate at which inbound packets are being dropped by the network interface. | Percentage |
| Transmit Drop Rate | Rate at which outbound packets are being dropped by the network interface. | Percentage |
| NTP Offset | Difference in milliseconds between the server's system clock and the NTP reference time source. | Milliseconds |
TCP Connections
Metric name | Description | Units |
| Total TCP Connections | Total number of TCP connections in all states currently tracked by the kernel. | Count |
| TCP ESTABLISHED | Number of TCP connections in the ESTABLISHED state, actively exchanging data. | Count |
| TCP SYN_SENT | Number of TCP connections in the SYN_SENT state, awaiting a remote SYN-ACK. | Count |
| TCP SYN_RECV | Number of TCP connections in the SYN_RECV state, having received a SYN and sent a SYN-ACK. | Count |
| TCP FIN_WAIT1 | Number of TCP connections in the FIN_WAIT1 state, having sent a FIN and awaiting acknowledgement. | Count |
| TCP FIN_WAIT2 | Number of TCP connections in the FIN_WAIT2 state, waiting for the remote side to send a FIN. | Count |
| TCP TIME_WAIT | Number of TCP connections in the TIME_WAIT state, waiting for the timeout period to expire. | Count |
| TCP CLOSE | Number of TCP connections in the CLOSE state. | Count |
| TCP CLOSE_WAIT | Number of TCP connections in the CLOSE_WAIT state, waiting for the local application to close the socket. | Count |
| TCP LAST_ACK | Number of TCP connections in the LAST_ACK state, waiting for the final ACK after sending a FIN. | Count |
| TCP LISTEN | Number of sockets in the LISTEN state, accepting incoming connection requests. | Count |
| TCP CLOSING | Number of TCP connections in the CLOSING state, where both sides have sent a FIN simultaneously. | Count |
| TCP Retransmission Rate | Rate at which TCP segments are being retransmitted due to loss or timeout. | Percentage |
GPU
Metric name | Description | Units |
| GPU Health Status | Operational health state of the GPU device, indicating whether the card is functioning normally. | Count |
| GPU Usage | Percentage of GPU compute capacity currently utilized by active workloads. | Percentage |
| GPU Memory Usage | Percentage of total GPU memory currently consumed by active workloads. | Percentage |
| GPU Encoder Usage | Percentage of the GPU's hardware video encoder currently in use. | Percentage |
| GPU Decoder Usage | Percentage of the GPU's hardware video decoder currently in use. | Percentage |
| GPU Free Memory | Amount of GPU memory not currently allocated to any workload. | MB |
| GPU Used Memory | Amount of GPU memory currently allocated to active workloads. | MB |
| GPU Temperature | Current operating temperature of the GPU. | Celsius |
| GPU Power Draw | Current power consumption of the GPU. | Watts |
| GPU Graphics Clocks | Current operating frequency of the GPU graphics engine. | MHz |
| GPU Memory Clocks | Current operating frequency of the GPU memory interface. | MHz |
| GPU SM Clocks | Current operating frequency of the GPU streaming multiprocessor array. | MHz |
| GPU Video Clocks | Current operating frequency of the GPU video encode/decode engine. | MHz |
| GPU Performance State | Current performance state level of the GPU, where lower values indicate higher performance modes. | Count |
| GPU Volatile Correctable ECC | Number of single-bit ECC memory errors detected and corrected in the current session. | Count |
| GPU Volatile Uncorrectable ECC | Number of multi-bit ECC memory errors detected that could not be corrected in the current session. | Count |
| GPU Aggregate Correctable ECC | Cumulative count of correctable single-bit ECC memory errors since the last driver reset. | Count |
| GPU Aggregate Uncorrectable ECC | Cumulative count of uncorrectable multi-bit ECC memory errors since the last driver reset. | Count |
| GPU Retired Page Single Bit | Number of GPU memory pages retired due to persistent single-bit ECC errors. | Count |
| GPU Retired Page Double Bit | Number of GPU memory pages retired due to uncorrectable double-bit ECC errors. | Count |
| GPU PCI Rx Throughput | Rate of data received by the GPU over the PCIe bus per second. | MB/second |
| GPU PCI Tx Throughput | Rate of data transmitted by the GPU over the PCIe bus per second. | MB/second |
NPU
Metric name | Description | Units |
| NPU Device Health | Operational health state of the Neural Processing Unit (NPU) device. | Count |
| NPU AI Core Usage | Percentage of NPU AI core compute capacity currently in use. | Percentage |
| NPU Memory Usage | Percentage of total NPU memory currently consumed by active workloads. | Percentage |
| NPU AI CPU Usage | Percentage of the NPU's embedded AI CPU currently in use. | Percentage |
| NPU Control CPU Usage | Percentage of the NPU's control CPU currently in use for management operations. | Percentage |
| NPU Memory Bandwidth Usage | Percentage of available NPU memory bandwidth currently consumed. | Percentage |
| NPU Memory Frequency | Current operating frequency of the NPU memory subsystem. | MHz |
| NPU AI Core Frequency | Current operating frequency of the NPU AI core array. | MHz |
| NPU Used Memory | Amount of NPU memory currently allocated to active inference or training workloads. | MB |
| NPU Single Bit Errors | Number of single-bit ECC errors detected in NPU memory. | Count |
| NPU Double Bit Errors | Number of double-bit ECC errors detected in NPU memory, which are uncorrectable. | Count |
| NPU Power | Current power consumption of the NPU device. | Watts |
| NPU Temperature | Current operating temperature of the NPU device. | Celsius |
Process
Metric name | Description | Units |
| Total Processes | Total number of processes currently present on the BMS in all states. | Count |
| Running Processes | Number of processes currently in the running state and actively consuming CPU. | Count |
| Idle Processes | Number of processes currently idle and not consuming CPU or waiting for I/O. | Count |
| Zombie Processes | Number of processes that have terminated but whose exit status has not yet been collected by a parent process. | Count |
| Blocked Processes | Number of processes currently blocked waiting for a resource such as disk I/O or a lock. | Count |
| Sleeping Processes | Number of processes in a sleep state, waiting for an event or timer before resuming execution. | Count |
Threshold configuration
You can configure thresholds and alerts for all BMS metrics to detect performance degradation proactively or connection issues.
- Go to Admin > Configuration Profiles > Threshold and Availability.
- Create or edit your Threshold Profile for BMS.
- Assign the profile to the respective monitors to trigger alerts.
IT Automation
Use Site24x7's IT Automation to resolve common issues with BMS performance:
- Go to Admin > IT Automation Templates. Then, click Add Automation Templates.
- Create an automation rule by selecting the automation Type (e.g., Server reboot, clear queue).
- Map the created rules to the BMS, for automatic execution during alerts.
Configuration rules
Use Configuration Rules to simplify bulk setup across BMS instances. Automatically assign Threshold Profiles, Notification Profiles, Tags, and Monitor Groups when new monitors are discovered.
