How to monitor GaussDB openGauss in Huawei Cloud
Site24x7 provides comprehensive monitoring for openGauss on GaussDB, including CPU usage, session activity, SQL performance, WAL logs, and disaster recovery replication.
This solution equips enterprise database administrators (DBAs) with full-stack visibility, enabling them to effectively maintain service-level agreements (SLAs) for critical online transaction processing (OLTP) workloads.
Use cases
Idle transaction detection: Alerts for idle-in-transaction sessions help prevent lock accumulation, which can degrade concurrent query throughput.
Disaster recovery compliance: Monitoring shared recovery point objective (RPO) and recovery time objective (RTO) ensures adherence to regulatory recovery objectives.
WAL space management: Alerts for inactive replication slots and transaction log (xlog) size prevent excessive WAL log accumulation, safeguarding disk space.
Setup and configuration
GaussDB openGauss resources are auto-discovered and monitored during the Huawei Cloud integration. To enable monitoring, follow the steps below:
- Navigate to Cloud > Huawei > Add Huawei Monitor. Follow the steps to add a Huawei Cloud monitor.
- While adding or editing a Huawei Cloud monitor, select GaussDB openGauss from the Service/Resource Types drop-down and click Save.
- Navigate to Cloud > Huawei, select the created Huawei monitor, and then click GaussDB openGauss to view the performance metrics.
Supported metrics
CPU and memory
Metric name | Description | Units |
| CPU Usage | Overall percentage of CPU capacity currently consumed by the GaussDB V5 instance. | Percentage |
| Memory Usage | Percentage of total memory capacity currently in use by the GaussDB V5 instance. | Percentage |
| User-mode CPU Time Percentage | Proportion of CPU time spent executing user-space database processes. | Percentage |
| Kernel-mode CPU Time Percentage | Proportion of CPU time spent executing kernel-space operations on behalf of the database. | Percentage |
| Disk I/O Wait Time Percentage | Proportion of CPU time spent waiting for disk I/O operations to complete. | Percentage |
| Swap Memory Usage | Percentage of total swap space currently in use by the instance. | Percentage |
| Total Swap Memory | Total amount of swap space configured and available to the instance. | MB |
Network
Metric name | Description | Units |
| Data Write Volume | Volume of data received by the instance from clients per second. | Byte |
| Outgoing Data Volume | Volume of data sent by the instance to clients per second. | Byte |
| Retransmission Ratio | Proportion of TCP segments that are retransmitted, indicating network reliability issues. | Percentage |
Disk and I/O
Metric name | Description | Units |
| Disk IOPS | Total number of read and write I/O operations completed per second on the instance storage. | IOPS |
| Disk Write Throughput | Volume of data written to disk per second by the instance. | KB/second |
| Disk Read Throughput | Volume of data read from disk per second by the instance. | KB/second |
| Average Time per Disk Write | Average time in milliseconds needed to complete a single disk write operation. | Milliseconds |
| Average Time per Disk Read | Average time in milliseconds needed to complete a single disk read operation. | Milliseconds |
| Disk I/O Bandwidth Usage | Percentage of total available disk I/O bandwidth being consumed. | Percentage |
| IOPS Usage | Percentage of the provisioned IOPS limit currently in use by the instance. | Percentage |
Instance and disk
Metric name | Description | Units |
| Used Instance Disk Size | Amount of disk storage currently consumed by instance-level data files. | MB |
| Total Instance Disk Size | Total disk capacity allocated to the GaussDB V5 instance. | MB |
| Instance Disk Usage | Percentage of the total instance disk capacity currently in use. | Percentage |
| Buffer Hit Rate | Percentage of data block read requests served from the shared buffer cache without requiring a disk read. | Percentage |
| Deadlocks | Number of deadlock events detected and resolved by the database engine in the monitoring period. | Count |
| Response Time of 80% SQL Statements | Latency value below which 80% of all SQL statement executions completed (P80). | Milliseconds |
| Response Time of 95% SQL Statements | Latency value below which 95% of all SQL statement executions completed (P95). | Milliseconds |
| System Database Size | Total size of all system database schemas on the instance. | MB |
| User Database Total Size | Combined size of all user-created databases on the instance. | MB |
Component and disk
Metric name | Description | Units |
| Used Disk Size | Amount of disk storage consumed by component-level data (separate from instance-level accounting). | MB |
| Total Disk Size | Total disk capacity available at the component level. | MB |
| Disk Usage | Percentage of component-level disk capacity currently in use. | Percentage |
Replication
Metric name | Description | Units |
| Primary Node Flow Control Duration | Duration during which the primary node is applying flow control to limit replication write speed. | Milliseconds |
| Standby RTO Duration | Estimated RTO for the standby node based on current redo progress. | Milliseconds |
| Standby Node Redo Progress | Measure of how far the standby node has progressed in replaying redo logs relative to the primary. | Byte |
| WAL Log Size in Replication Slot | Total size of WAL log data retained in active replication slots pending consumption by replicas. | Byte |
| Xlog Rate | Rate at which xlog sequence numbers are advancing on the primary node. | Byte |
| Shard Log Gap of DR Cluster | Difference in log volume between the primary shard and the disaster recovery cluster. | Byte |
| Size of Shard Logs to Be Replayed | Volume of shard logs on the DR cluster that have been received but not yet replayed. | Byte |
| Flushing Rate of Shard Logs in DR | Rate at which shard logs are being flushed to disk in the disaster recovery cluster. | Byte/second |
| Replay Rate of Shard Logs in DR | Rate at which shard logs are being replayed by the disaster recovery cluster. | Byte/second |
| Shard RPO | RPO for the shard, reflecting the maximum potential data loss in the event of failover. | Seconds |
| Shard RTO | RTO for the shard, reflecting the estimated time to recover after a failure. | Milliseconds |
| Inactive Replication Slots | Number of replication slots that are no longer being consumed, causing WAL log accumulation on the primary. | Count |
| Size of Read Replica Logs Not Replayed | Volume of logs received by read replica nodes that have not yet been applied. | MB |
| Difference Between Redo and Receipt Positions | Gap between the volume of logs received and the volume of logs replayed on the standby node. | MB |
Sessions
Metric name | Description | Units |
| User Logins per Second | Rate at which new user login events are occurring on the instance per second. | Count/second |
| User Logouts per Second | Rate at which user session logout events are occurring on the instance per second. | Count/second |
| Lock Waiting Session Ratio | Proportion of active sessions currently blocked waiting to acquire a lock. | Percentage |
| Active Session Rate | Proportion of total sessions that are actively executing queries. | Percentage |
| CN Connections | Number of connections currently established to the coordinator node. | Count |
| Online Sessions | Total number of user sessions currently connected to the instance. | Count |
| Active Sessions | Number of sessions currently executing a query or transaction. | Count |
| Online Session Rate | Proportion of the maximum session limit currently occupied by connected sessions. | Percentage |
| Sessions Waiting for Locks | Number of sessions currently blocked waiting to acquire a database lock. | Count |
| Waiting Sessions | Total number of sessions in any waiting state, including lock waits and I/O waits. | Count |
| Online Clients | Number of distinct client addresses with active connections to the instance. | Count |
| Active Clients | Number of distinct client addresses that have executed at least one query recently. | Count |
| Login Attempts with Incorrect Passwords | Number of authentication failures due to incorrect passwords in the monitoring period. | Count |
Transactions
Metric name | Description | Units |
| User Committed Transactions per Second | Rate of user-initiated transactions successfully committed per second. | Count/second |
| User Rollback Transactions per Second | Rate of user-initiated transactions rolled back per second. | Count/second |
| Background Committed Transactions per Second | Rate of background process transactions committed per second, such as autovacuum and checkpoint work. | Count/second |
| Background Rollback Transactions per Second | Rate of background process transactions rolled back per second. | Count/second |
| Average Response Time of User Transactions | Mean time in milliseconds taken to complete a user-initiated transaction from start to commit or rollback. | Milliseconds |
| User Transaction Rollback Rate | Proportion of user transactions that ended in a rollback rather than a commit. | Percentage |
| Background Transaction Rollback Rate | Proportion of background transactions that ended in a rollback. | Percentage |
| Maximum Execution Duration of DB Transactions | Elapsed time of the longest currently running transaction on the instance. | Milliseconds |
| Idle Transactions | Number of transactions that are open but currently idle, holding locks or resources without executing. | Count |
| Oldest Two-Phase Commit Transaction Duration | Age of the oldest prepared (two-phase commit) transaction still pending resolution. | Milliseconds |
SQL
Metric name | Description | Units |
| Data Definition Language/s | Rate of DDL statements (CREATE, ALTER, DROP) executed per second. | Count/second |
| Data Manipulation Language/s | Rate of DML statements (INSERT, UPDATE, DELETE, SELECT) executed per second. | Count/second |
| Data Control Language/s | Rate of DCL statements (GRANT, REVOKE) executed per second. | Count/second |
| DDL and DCL Rate | Combined rate of DDL and DCL statements as a proportion of total SQL activity. | Percentage |
| Slow SQL Statements in System Database | Number of slow-running SQL statements detected in system database schemas. | Count |
| Slow SQL Statements in User Database | Number of slow-running SQL statements detected in user-created database schemas. | Count |
| SELECT Distribution | Distribution of SELECT statement execution across nodes or time, showing read workload spread. | Count |
| UPDATE Distribution | Distribution of UPDATE statement execution across nodes or time, showing write workload spread. | Count |
| INSERT Distribution | Distribution of INSERT statement execution across nodes or time. | Count |
| DELETE Distribution | Distribution of DELETE statement execution across nodes or time. | Count |
| SQL Statements | Total count of SQL statements executed in the monitoring period. | Count |
| Read Requests | Total number of read (SELECT) requests processed by the instance per second. | Count/second |
| INSERT Request Response Time | Average response time in milliseconds for INSERT statement execution. | Milliseconds |
| UPDATE Request Response Time | Average response time in milliseconds for UPDATE statement execution. | Milliseconds |
| DELETE Request Response Time | Average response time in milliseconds for DELETE statement execution. | Milliseconds |
| Read Request Response Time | Average response time in milliseconds for SELECT statement execution. | Milliseconds |
Resources
Metric name | Description | Units |
| Used Dynamic Memory | Amount of dynamic memory currently allocated for query execution and other runtime operations. | MB |
| Dynamic Memory Usage | Percentage of the dynamic memory pool currently consumed by active operations. | Percentage |
| Used Memory | Total process memory currently consumed by the GaussDB V5 database process. | MB |
| Thread Pool Usage | Percentage of the thread pool capacity currently occupied by active worker threads. | Percentage |
Xlog
Metric name | Description | Units |
| Data Volume to Be Flushed to Disks | Amount of dirty data in memory that has not yet been written to disk by the checkpoint process. | MB |
| Physical Reads per Second | Number of physical disk read operations performed by the database engine per second. | Count/second |
| Physical Writes per Second | Number of physical disk write operations performed by the database engine per second. | Count/second |
| Xlogs | Total number of xlog files currently present on the instance. | Count |
| Xlog Size | Total disk space consumed by xlog files on the instance. | MB |
| CN Temporary Directory Size | Disk space consumed by temporary files in the coordinator node's temporary directory. | MB |
Status
Metric name | Description | Units |
| DN Status | Health and operational status of the Data Node instances in the GaussDB V5 cluster. | Status |
| Replication Slot Directory Size | Total disk space consumed by all replication slot directories on the primary node. | MB |
| Heap and Index Tuple Inconsistencies | Number of detected inconsistencies between heap table tuples and their corresponding index entries, indicating data integrity issues. | Count |
Threshold configuration
You can configure thresholds and alerts for all GaussDB openGauss metrics to detect performance degradation proactively or connection issues.
- Go to Admin > Configuration Profiles > Threshold and Availability.
- Create or edit your Threshold Profile for GaussDB openGauss.
- Assign the profile to the respective monitors to trigger alerts.
IT Automation
Use Site24x7's IT Automation to resolve common issues with GaussDB openGauss performance:
- Go to Admin >IT Automation Templates. Then, click Add Automation Templates.
- Create an automation rule by selecting the automation Type (e.g., Server reboot, clear queue).
- Map the created rules to the GaussDB openGauss, for automatic execution during alerts.
Configuration rules
Use Configuration Rules to simplify bulk setup across GaussDB openGauss instances. Automatically assign Threshold Profiles, Notification Profiles, Tags, and Monitor Groups when new monitors are discovered.
