Help Docs

Alibaba Cloud Elastic High Performance Computing (E-HPC) Monitoring Integration

Site24x7 provides in-depth monitoring for Alibaba Cloud Elastic High Performance Computing (E-HPC) clusters, giving visibility into compute utilization, job activity, and resource distribution. By tracking metrics such as CPU usage, memory consumption, node performance, job execution status, and queue-level utilization, Site24x7 helps administrators ensure efficient scheduling, resource allocation, and system stability. Once your Alibaba Cloud account is integrated, all E-HPC clusters are automatically discovered and monitored.

Use cases

  • Cluster health tracking: Monitor total and used CPUs, memory, and node utilization across clusters.
  • Job monitoring: Identify running, queued, and failed jobs to optimize scheduling.
  • User and project insights: Track CPU and memory consumption at user and project levels.
  • Queue optimization: Analyze queue-level performance for efficient job distribution.
  • Proactive alerts: Configure automations and alerts for high utilization or job failures.

Setup and configuration

  • Log in to your Site24x7 account and navigate to Cloud > Alibaba Cloud > Add Monitor.
  • In the Edit Alibaba Cloud Monitor page, select Elastic High Performance Computing (E-HPC) from the Service Types list.
  • Once added, go to Cloud > Alibaba > E-HPC to view dashboards and performance metrics.

Supported metrics

Cluster Health

Metric nameDescriptionUnit
Cluster Total CPUs The total number of CPUs available in the E-HPC cluster. Count
Cluster Total Memory The total amount of memory available across all nodes in the cluster. Bytes
Cluster Total Nodes The total number of nodes in the E-HPC cluster. Count
Cluster Used CPUs The number of CPUs currently in use within the cluster. Count
Cluster Used Core Time The total core time consumed by all running jobs in the cluster. Seconds
Used Memory The total memory currently in use across all nodes in the cluster. Bytes

Node Performance

Metric nameDescriptionUnit
Node CPU Usage by Cluster The percentage of CPU utilization for each node within the cluster. Percentage
Node Memory Usage by Cluster The percentage of memory utilization for each node within the cluster. Percentage
Node Used CPU by Cluster The number of CPUs used by each node in the cluster. Count
Node Used Memory by Cluster The amount of memory used by each node in the cluster. Bytes
Node Load 1m by Cluster The system load average over the past one minute for each node. Count
Node Network In Rate by Cluster The rate of inbound network traffic to each node in the cluster. Bytes/second

Job Execution

Metric nameDescriptionUnit
Running Jobs The number of jobs currently running in the cluster. Count
Queued Jobs The number of jobs waiting in the queue for scheduling. Count
Finished Jobs The total number of successfully completed jobs. Count
Failed Job Number by Cluster The number of failed jobs within the cluster. Count
Suspended Job Number by Cluster The number of suspended jobs in the cluster. Count
Job Run Duration by Cluster The average runtime of jobs currently executing in the cluster. Seconds
Job Wait Duration by Cluster The average waiting time of jobs before execution. Seconds
Running Job Number by Cluster The total number of active jobs currently running. Count
Pending Job Number by Cluster The number of jobs pending resource allocation. Count
Created Jobs by Cluster The total number of jobs created in the cluster. Count

User/Project-Level

Metric nameDescriptionUnit
Job Run Duration by User The average duration of jobs run by a user. Seconds
Job Used CPU by User The total number of CPUs used by jobs run by a user. Count
Job Used Memory by User The total memory used by a user's jobs. Bytes
Job CPU Usage by User The percentage of CPU usage for jobs executed by a user. Percentage
Job Memory Usage by User The percentage of memory usage for jobs executed by a user. Percentage
Job Run Duration by Project The average runtime of jobs associated with a project. Seconds
Job Used CPU by Project The total number of CPUs used by all jobs under a project. Count
Job Used Memory by Project The total memory used by all jobs under a project. Bytes

Queue

Metric nameDescriptionUnit
Job Number by Queue The total number of jobs assigned to each queue. Count
Queued Jobs by Queue The number of jobs waiting in each queue. Count
Running Job Number by Queue The number of jobs currently running in each queue. Count
Pending Job Number by Queue The number of jobs pending in each queue. Count
Queue Used CPUs The number of CPUs currently used by jobs in each queue. Count
Queue Total CPUs The total number of CPUs allocated to each queue. Count
Queue Total Memory The total memory allocated to each queue. Bytes

Job Resource Request and Allocation

Metric nameDescriptionUnit
Job Required CPU by User The total number of CPUs requested by a user's jobs. Count
Job Required Memory by User The total memory requested by a user's jobs. Bytes
Running Job Required CPU by Cluster The total number of CPUs requested by running jobs in the cluster. Count
Running Job Required Memory by Cluster The total memory requested by running jobs in the cluster. Bytes
Pending Job Required CPU by Cluster The total number of CPUs requested by pending jobs in the cluster. Count
Pending Job Required Memory by Cluster The total memory requested by pending jobs in the cluster. Bytes
Job Required CPU by Queue The total number of CPUs requested by jobs in each queue. Count

Threshold configuration

  1. Go to Admin > Configuration Profiles > Threshold and Availability.
  2. Create or edit a threshold profile for E-HPC.
  3. Assign the profile to the respective monitors to trigger alerts.

IT automation

Site24x7's IT Automation tools help with automatically resolving performance degradation issues. When a breach occurs, the alarm engine continuously examines the system events for which thresholds have been defined and performs the mapped automation.

  1. Go to Admin > IT Automation Templates.
  2. Create a new automation rule.
  3. Map the rule to the monitor for proactive resolution.

How to configure IT Automation for a monitor

Configuration rules

With Site24x7's Configuration Rules, you can set parameters like Threshold Profile, Notification Profile, Tags, and Monitor Group for multiple monitors and automate the configuration settings of your monitoring resources. Automatically assign these settings when new E-HPC monitors are added.

How to add a Configuration Rule

Related links

Was this document helpful?

Would you like to help us improve our documents? Tell us what you think we could do better.


We're sorry to hear that you're not satisfied with the document. We'd love to learn what we could do to improve the experience.


Thanks for taking the time to share your feedback. We'll use your feedback to improve our online help resources.

Shortlink has been copied!