Alibaba Cloud Elastic High Performance Computing (E-HPC) Monitoring Integration
Site24x7 provides in-depth monitoring for Alibaba Cloud Elastic High Performance Computing (E-HPC) clusters, giving visibility into compute utilization, job activity, and resource distribution. By tracking metrics such as CPU usage, memory consumption, node performance, job execution status, and queue-level utilization, Site24x7 helps administrators ensure efficient scheduling, resource allocation, and system stability. Once your Alibaba Cloud account is integrated, all E-HPC clusters are automatically discovered and monitored.
Use cases
- Cluster health tracking: Monitor total and used CPUs, memory, and node utilization across clusters.
- Job monitoring: Identify running, queued, and failed jobs to optimize scheduling.
- User and project insights: Track CPU and memory consumption at user and project levels.
- Queue optimization: Analyze queue-level performance for efficient job distribution.
- Proactive alerts: Configure automations and alerts for high utilization or job failures.
Setup and configuration
- Log in to your Site24x7 account and navigate to Cloud > Alibaba Cloud > Add Monitor.
- In the Edit Alibaba Cloud Monitor page, select Elastic High Performance Computing (E-HPC) from the Service Types list.
- Once added, go to Cloud > Alibaba > E-HPC to view dashboards and performance metrics.
Supported metrics
Cluster Health
| Metric name | Description | Unit |
|---|---|---|
| Cluster Total CPUs | The total number of CPUs available in the E-HPC cluster. | Count |
| Cluster Total Memory | The total amount of memory available across all nodes in the cluster. | Bytes |
| Cluster Total Nodes | The total number of nodes in the E-HPC cluster. | Count |
| Cluster Used CPUs | The number of CPUs currently in use within the cluster. | Count |
| Cluster Used Core Time | The total core time consumed by all running jobs in the cluster. | Seconds |
| Used Memory | The total memory currently in use across all nodes in the cluster. | Bytes |
Node Performance
| Metric name | Description | Unit |
|---|---|---|
| Node CPU Usage by Cluster | The percentage of CPU utilization for each node within the cluster. | Percentage |
| Node Memory Usage by Cluster | The percentage of memory utilization for each node within the cluster. | Percentage |
| Node Used CPU by Cluster | The number of CPUs used by each node in the cluster. | Count |
| Node Used Memory by Cluster | The amount of memory used by each node in the cluster. | Bytes |
| Node Load 1m by Cluster | The system load average over the past one minute for each node. | Count |
| Node Network In Rate by Cluster | The rate of inbound network traffic to each node in the cluster. | Bytes/second |
Job Execution
| Metric name | Description | Unit |
|---|---|---|
| Running Jobs | The number of jobs currently running in the cluster. | Count |
| Queued Jobs | The number of jobs waiting in the queue for scheduling. | Count |
| Finished Jobs | The total number of successfully completed jobs. | Count |
| Failed Job Number by Cluster | The number of failed jobs within the cluster. | Count |
| Suspended Job Number by Cluster | The number of suspended jobs in the cluster. | Count |
| Job Run Duration by Cluster | The average runtime of jobs currently executing in the cluster. | Seconds |
| Job Wait Duration by Cluster | The average waiting time of jobs before execution. | Seconds |
| Running Job Number by Cluster | The total number of active jobs currently running. | Count |
| Pending Job Number by Cluster | The number of jobs pending resource allocation. | Count |
| Created Jobs by Cluster | The total number of jobs created in the cluster. | Count |
User/Project-Level
| Metric name | Description | Unit |
|---|---|---|
| Job Run Duration by User | The average duration of jobs run by a user. | Seconds |
| Job Used CPU by User | The total number of CPUs used by jobs run by a user. | Count |
| Job Used Memory by User | The total memory used by a user's jobs. | Bytes |
| Job CPU Usage by User | The percentage of CPU usage for jobs executed by a user. | Percentage |
| Job Memory Usage by User | The percentage of memory usage for jobs executed by a user. | Percentage |
| Job Run Duration by Project | The average runtime of jobs associated with a project. | Seconds |
| Job Used CPU by Project | The total number of CPUs used by all jobs under a project. | Count |
| Job Used Memory by Project | The total memory used by all jobs under a project. | Bytes |
Queue
| Metric name | Description | Unit |
|---|---|---|
| Job Number by Queue | The total number of jobs assigned to each queue. | Count |
| Queued Jobs by Queue | The number of jobs waiting in each queue. | Count |
| Running Job Number by Queue | The number of jobs currently running in each queue. | Count |
| Pending Job Number by Queue | The number of jobs pending in each queue. | Count |
| Queue Used CPUs | The number of CPUs currently used by jobs in each queue. | Count |
| Queue Total CPUs | The total number of CPUs allocated to each queue. | Count |
| Queue Total Memory | The total memory allocated to each queue. | Bytes |
Job Resource Request and Allocation
| Metric name | Description | Unit |
|---|---|---|
| Job Required CPU by User | The total number of CPUs requested by a user's jobs. | Count |
| Job Required Memory by User | The total memory requested by a user's jobs. | Bytes |
| Running Job Required CPU by Cluster | The total number of CPUs requested by running jobs in the cluster. | Count |
| Running Job Required Memory by Cluster | The total memory requested by running jobs in the cluster. | Bytes |
| Pending Job Required CPU by Cluster | The total number of CPUs requested by pending jobs in the cluster. | Count |
| Pending Job Required Memory by Cluster | The total memory requested by pending jobs in the cluster. | Bytes |
| Job Required CPU by Queue | The total number of CPUs requested by jobs in each queue. | Count |
Threshold configuration
- Go to Admin > Configuration Profiles > Threshold and Availability.
- Create or edit a threshold profile for E-HPC.
- Assign the profile to the respective monitors to trigger alerts.
IT automation
Site24x7's IT Automation tools help with automatically resolving performance degradation issues. When a breach occurs, the alarm engine continuously examines the system events for which thresholds have been defined and performs the mapped automation.
- Go to Admin > IT Automation Templates.
- Create a new automation rule.
- Map the rule to the monitor for proactive resolution.
How to configure IT Automation for a monitor
Configuration rules
With Site24x7's Configuration Rules, you can set parameters like Threshold Profile, Notification Profile, Tags, and Monitor Group for multiple monitors and automate the configuration settings of your monitoring resources. Automatically assign these settings when new E-HPC monitors are added.
How to add a Configuration Rule
