Monitor GPU nodes in AWS

Published in

DevOps.dev

3 min readMay 18, 2024

Deep learning workloads often rely on GPUs (graphics processing units) due to their parallel processing capabilities. Amazon Web Services (AWS) provides P2 and P3 instances optimized for running deep learning frameworks like MXNet. Monitoring GPU nodes in AWS is crucial to ensure optimal performance, efficient resource utilization, and cost-effectiveness. In this article, we will explore the importance of monitoring GPU nodes, the metrics to track, and the tools and techniques available in AWS for monitoring and optimizing GPU performance.

Why Monitor GPU Nodes in AWS?

GPUs are critical components in various applications, including machine learning, scientific simulations, and graphics rendering. Monitoring GPU nodes helps:

Optimize performance: Identify bottlenecks and optimize GPU utilization for better performance.
Reduce costs: Right-size instances and avoid idle resources to minimize costs.
Ensure reliability: Detect and troubleshoot issues before they impact applications.

Metrics to Track

We should at least monitor the following metrics for GPUs:

GPU utilization: Monitor GPU usage to identify under or over-utilization.
Memory usage: Track memory allocation and usage to prevent memory bottlenecks.
Temperature: Monitor GPU temperature to prevent overheating and throttling.
Power consumption: Track power usage to optimize energy efficiency.

Tools and Techniques for Monitoring GPU Nodes in AWS

There are a couple of tools and technologies you can use to monitor GPU nodes in AWS. For example, if you use EKS, you can use either Prometheus or AWS cloud watch. If you use AWS EC2 instances with Deep Learning AMI, CloudWatch is available for you as a systemd service.

EKS
- use Prometheus/Grafana
- use CloudWatch
DLAMI
- use CloudWatch

Monitoring GPU in EKS — use Prometheus/Grafana

This is similar to setup Prometheus in normal Kubernetes clusters. Prometheus uses poll model — it scrapes metrics from target (CoreDNS, kube-state-metrics, DCGM exporter, etc) metrics endpoint (usually /metrics).

From Monitoring GPU workloads on Amazon EKS using AWS managed open-source services | AWS Cloud Operations & Migrations Blog

An example of the metrics may look like this:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3",
device="nvidia0",modelName="NVIDIA A10G",Hostname="nvidia-dcgm-exporter-48cwd"
,DCGM_FI_DRIVER_VERSION="470.182.03",container="main",namespace="kube-system",
pod="gpu-burn-c68d8c774-ltg9s"} 1455

Each metric is associated with multiple labels. With labels, we can aggregate metrics and visualize that in the dashboard. For example, we can aggregate GPU power usage for a give host using Hostname label.

From https://aws.amazon.com/blogs/mt/monitoring-gpu-workloads-on-amazon-eks-using-aws-managed-open-source-services/

Monitoring GPU in EKS — use CloudWatch

You can also use CloudWatch for your EKS cluster. CloudWatch utilize push model — the CloudWatch agent is running on each node, collecting and pushing metrics to CloudWatch servers. You can follow the AWS documentation to configure CloudWatch agent for your EKS cluster.

Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on — Amazon CloudWatch

Monitor GPUs on DLAMI

Deep Learning AMI (DLAMI) is available in AWS EC2 in most regions. If you use DLAMI, it has CloudWatch configured as a systemd service. All you need to do is to enable it for your DLAMI.

There are 3 levels of GPU metrics:

minimal GPU metrics
partial GPU metrics
all available GPU metrics

All the available GPU metrics are:

utilization_gpu
utilization_memory
memory_total
memory_used
memory_free
temperature_gpu
power_draw
fan_speed
pcie_link_gen_current
pcie_link_width_current
encoder_stats_session_count
encoder_stats_average_fps
encoder_stats_average_latency
clocks_current_graphics
clocks_current_sm
clocks_current_memory
clocks_current_video

The three levels of GPU metrics are configured as systemd services, for example, the configuration file for systemd service for minimal preconfigured GPU metrics is located at

/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-minimal.json

You can enable all GPU metrics with systemd:

sudo systemctl enable dlami-cloudwatch-agent@all
sudo systemctl start dlami-cloudwatch-agent@all

Please refer to Monitor GPUs with CloudWatch — Deep Learning AMI for more details.

Conclusions

Monitoring GPU nodes in AWS is essential for optimal performance, cost-effectiveness, and reliability. By tracking key metrics and leveraging AWS and third-party tools, you can ensure efficient GPU utilization and optimize your computing-intensive workloads. Remember to follow best practices and automate monitoring and optimization tasks to maximize the benefits of GPU computing in AWS.

DevOps.dev

Monitor GPU nodes in AWS

Why Monitor GPU Nodes in AWS?

Metrics to Track

Tools and Techniques for Monitoring GPU Nodes in AWS

Monitoring GPU in EKS — use Prometheus/Grafana

Monitoring GPU in EKS — use CloudWatch

Monitor GPUs on DLAMI

Conclusions

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in DevOps.dev

Written by Roaming Roadster

No responses yet