Alibaba Cloud Elastic High Performance Computing (E-HPC) Monitoring Integration

Site24x7 provides in-depth monitoring for Alibaba Cloud Elastic High Performance Computing (E-HPC) clusters, giving visibility into compute utilization, job activity, and resource distribution. By tracking metrics such as CPU usage, memory consumption, node performance, job execution status, and queue-level utilization, Site24x7 helps administrators ensure efficient scheduling, resource allocation, and system stability. Once your Alibaba Cloud account is integrated, all E-HPC clusters are automatically discovered and monitored.

Use cases

Cluster health tracking: Monitor total and used CPUs, memory, and node utilization across clusters.
Job monitoring: Identify running, queued, and failed jobs to optimize scheduling.
User and project insights: Track CPU and memory consumption at user and project levels.
Queue optimization: Analyze queue-level performance for efficient job distribution.
Proactive alerts: Configure automations and alerts for high utilization or job failures.

Setup and configuration

Log in to your Site24x7 account and navigate to Cloud > Alibaba Cloud > Add Monitor.
In the Edit Alibaba Cloud Monitor page, select Elastic High Performance Computing (E-HPC) from the Service Types list.
Once added, go to Cloud > Alibaba > E-HPC to view dashboards and performance metrics.

Supported metrics

Cluster Health

Metric name	Description	Unit
Cluster Total CPUs	The total number of CPUs available in the E-HPC cluster.	Count
Cluster Total Memory	The total amount of memory available across all nodes in the cluster.	Bytes
Cluster Total Nodes	The total number of nodes in the E-HPC cluster.	Count
Cluster Used CPUs	The number of CPUs currently in use within the cluster.	Count
Cluster Used Core Time	The total core time consumed by all running jobs in the cluster.	Seconds
Used Memory	The total memory currently in use across all nodes in the cluster.	Bytes

Node Performance

Metric name	Description	Unit
Node CPU Usage by Cluster	The percentage of CPU utilization for each node within the cluster.	Percentage
Node Memory Usage by Cluster	The percentage of memory utilization for each node within the cluster.	Percentage
Node Used CPU by Cluster	The number of CPUs used by each node in the cluster.	Count
Node Used Memory by Cluster	The amount of memory used by each node in the cluster.	Bytes
Node Load 1m by Cluster	The system load average over the past one minute for each node.	Count
Node Network In Rate by Cluster	The rate of inbound network traffic to each node in the cluster.	Bytes/second

Job Execution

Metric name	Description	Unit
Running Jobs	The number of jobs currently running in the cluster.	Count
Queued Jobs	The number of jobs waiting in the queue for scheduling.	Count
Finished Jobs	The total number of successfully completed jobs.	Count
Failed Job Number by Cluster	The number of failed jobs within the cluster.	Count
Suspended Job Number by Cluster	The number of suspended jobs in the cluster.	Count
Job Run Duration by Cluster	The average runtime of jobs currently executing in the cluster.	Seconds
Job Wait Duration by Cluster	The average waiting time of jobs before execution.	Seconds
Running Job Number by Cluster	The total number of active jobs currently running.	Count
Pending Job Number by Cluster	The number of jobs pending resource allocation.	Count
Created Jobs by Cluster	The total number of jobs created in the cluster.	Count

User/Project-Level

Metric name	Description	Unit
Job Run Duration by User	The average duration of jobs run by a user.	Seconds
Job Used CPU by User	The total number of CPUs used by jobs run by a user.	Count
Job Used Memory by User	The total memory used by a user's jobs.	Bytes
Job CPU Usage by User	The percentage of CPU usage for jobs executed by a user.	Percentage
Job Memory Usage by User	The percentage of memory usage for jobs executed by a user.	Percentage
Job Run Duration by Project	The average runtime of jobs associated with a project.	Seconds
Job Used CPU by Project	The total number of CPUs used by all jobs under a project.	Count
Job Used Memory by Project	The total memory used by all jobs under a project.	Bytes

Queue

Metric name	Description	Unit
Job Number by Queue	The total number of jobs assigned to each queue.	Count
Queued Jobs by Queue	The number of jobs waiting in each queue.	Count
Running Job Number by Queue	The number of jobs currently running in each queue.	Count
Pending Job Number by Queue	The number of jobs pending in each queue.	Count
Queue Used CPUs	The number of CPUs currently used by jobs in each queue.	Count
Queue Total CPUs	The total number of CPUs allocated to each queue.	Count
Queue Total Memory	The total memory allocated to each queue.	Bytes

Job Resource Request and Allocation

Metric name	Description	Unit
Job Required CPU by User	The total number of CPUs requested by a user's jobs.	Count
Job Required Memory by User	The total memory requested by a user's jobs.	Bytes
Running Job Required CPU by Cluster	The total number of CPUs requested by running jobs in the cluster.	Count
Running Job Required Memory by Cluster	The total memory requested by running jobs in the cluster.	Bytes
Pending Job Required CPU by Cluster	The total number of CPUs requested by pending jobs in the cluster.	Count
Pending Job Required Memory by Cluster	The total memory requested by pending jobs in the cluster.	Bytes
Job Required CPU by Queue	The total number of CPUs requested by jobs in each queue.	Count

Threshold configuration

Go to Admin > Configuration Profiles > Threshold and Availability.
Create or edit a threshold profile for E-HPC.
Assign the profile to the respective monitors to trigger alerts.

IT automation

Site24x7's IT Automation tools help with automatically resolving performance degradation issues. When a breach occurs, the alarm engine continuously examines the system events for which thresholds have been defined and performs the mapped automation.

Go to Admin > IT Automation Templates.
Create a new automation rule.
Map the rule to the monitor for proactive resolution.

How to configure IT Automation for a monitor

Configuration rules

With Site24x7's Configuration Rules, you can set parameters like Threshold Profile, Notification Profile, Tags, and Monitor Group for multiple monitors and automate the configuration settings of your monitoring resources. Automatically assign these settings when new E-HPC monitors are added.

How to add a Configuration Rule