Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuro …

Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container, an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS). This solution simplifies the integration of advanced monitoring tools such as Prometheus and Grafana, enabling you to set up and manage your machine learning (ML) workflows with AWS AI Chips. With the new Neuron Monitor container, you can visualize and optimize the performance of your ML applications, all within a familiar Kubernetes environment. The Neuron Monitor container can also run on Amazon Elastic Container Service (Amazon ECS), but for the purpose of this post, we primarily discuss Amazon EKS deployment.
In addition to the Neuron Monitor container, the release of CloudWatch Container Insights (for Neuron) provides further benefits. This extension provides a robust monitoring solution, offering deeper insights and analytics tailored specifically for Neuron-based applications. With Container Insights, you can now access more granular data and comprehensive analytics, making it effortless for developers to maintain high performance and operational health of their ML workloads.
Solution overview
The Neuron Monitor container solution provides a comprehensive monitoring framework for ML workloads on Amazon EKS, using the power of Neuron Monitor in conjunction with industry-standard tools like Prometheus, Grafana, and Amazon CloudWatch. By deploying the Neuron Monitor DaemonSet across EKS nodes, developers can collect and analyze performance metrics from ML workload pods.
In one flow, metrics gathered by Neuron Monitor are integrated with Prometheus, which is configured using a Helm chart for scalability and ease of management. These metrics are then visualized through Grafana, offering you detailed insights into your applications’ performance for effective troubleshooting and optimization.
Alternatively, metrics can also be directed to CloudWatch through the CloudWatch Observability EKS add-on or a Helm chart for a deeper integration with AWS services in a single step. The add-on helps automatically discover critical health metrics from the AWS Trainium and AWS Inferentia chips in the Amazon EC2 Trn1 and Amazon EC2 Inf2 instances, as well as from Elastic Fabric Adapter, the network interface for EC2 instances.. This integration can help you better understand the traffic impact on your distributed deep learning algorithms.
This architecture has many benefits:

Highly targeted and intentional monitoring on Container Insights
Real-time analytics and greater visibility into ML workload performance on Neuron
Native support for your existing Amazon EKS infrastructure

Neuron Monitor provides flexibility and depth in monitoring within the Kubernetes environment.
The following diagram illustrates the solution architecture:

Fig.1 Solution Architecture Diagram
In the following sections, we demonstrate how to use Container Insights for enhanced observability, and how to set up Prometheus and Grafana for this solution.
Configure Container Insights for enhanced observability
In this section, we walk through the steps to configure Container Insights.
Set up the CloudWatch Observability EKS add-on
Refer to Install the Amazon CloudWatch Observability EKS add-on for instructions to create the amazon-cloudwatch-observability add-on in your EKS cluster. This process involves deploying the necessary resources for monitoring directly within CloudWatch.
After you set up the add-on, check the health of the add-on with the following command:

aws eks describe-addon –cluster-name <value> –addon-name amazon-cloudwatch-observability

The output should contain the following property value:

“status”: “ACTIVE”,

For details about confirming the output, see Retrieve addon version compatibility.
Once the add-on is active, you can then directly view metrics in Container Insights.
View CloudWatch metrics
Navigate to the Container Insights console, where you can visualize metrics and telemetry about your whole Amazon EKS environment, including your Neuron device metrics. The enhanced Container Insights page looks similar to the following screenshot, with the high-level summary of your clusters, along with kube-state and control-plane metrics. The Container Insights dashboard also shows cluster status and alarms. It uses predefined thresholds for CPU, memory, and NeuronCores to quickly identify which resources have higher consumption, and enables proactive actions to avoid performance impact.

Fig.2 CloudWatch Container Insights Dashboard
The out-of-the-box opinionated performance dashboards and troubleshooting UI enables you to see your Neuron metrics at multiple granularities from an aggregated cluster level to per-container level and per-NeuronCore level. With the Container Insights default configuration, you can also qualify and correlate your Neuron metrics against the other aspects of your infrastructure such as CPU, memory, disk, Elastic Fabric Adapter devices, and more.
When you navigate to any of the clusters based on their criticality, you can view the Performance monitoring dashboard, as shown in the following screenshot.

Fig.3 Performance Monitoring Dashboard Views
This monitoring dashboard provides various views to analyze performance, including:

Cluster-wide performance dashboard view – Provides an overview of resource utilization across the entire cluster
Node performance view – Visualizes metrics at the individual node level
Pod performance view – Focuses on pod-level metrics for CPU, memory, network, and so on
Container performance view – Drills down into utilization metrics for individual containers

This landing page has now been enhanced with Neuron metrics, including top 10 graphs, which helps you identify unhealthy components in your environments even without alarms and take proactive action before application performance is impacted. For a more in-depth analysis of what is delivered on this landing page, refer to Announcing Amazon CloudWatch Container Insights with Enhanced Observability for Amazon EKS on EC2.
Prometheus and Grafana
In this section, we walk through the steps to set up Prometheus and Grafana.
Prerequisites
You should have an EKS cluster set up with AWS Inferentia or Trainium worker nodes.
Set up the Neuron Monitoring container
The Neuron Monitoring container is hosted on Amazon ECR Public. Although it’s accessible for immediate use, it’s not a recommended best practice for direct production workload use due to potential throttling limits. For more information on this and on setting up a pull through cache, see the Neuron Monitor User Guide. For production environments, it’s advisable to copy the Neuron Monitoring container to your private Amazon Elastic Container Registry (Amazon ECR) repository, where the Amazon ECR pull through cache feature can manage synchronization effectively.
Set up Kubernetes for Neuron Monitoring
You can use the following YAML configuration snippet to set up Neuron Monitoring in your Kubernetes cluster. This setup includes a DaemonSet to deploy the monitoring container on each suitable node in namespace neuron-monitor:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: neuron-monitor
namespace: neuron-monitor
labels:
app: neuron-monitor
version: v1
spec:
selector:
matchLabels:
app: neuron-monitor
template:
metadata:
labels:
app: neuron-monitor
version: v1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
– matchExpressions:
– key: kubernetes.io/os
operator: In
values:
– linux
– key: node.kubernetes.io/instance-type
operator: In
values:
– trn1.2xlarge
– trn1.32xlarge
– trn1n.32xlarge
– inf1.xlarge
– inf1.2xlarge
– inf1.6xlarge
– inf2.xlarge
– inf2.8xlarge
– inf2.24xlarge
– inf2.48xlarge
containers:
– name: neuron-monitor
image: public.ecr.aws/neuron/neuron-monitor:1.0.1
ports:
– containerPort: 8000
command:
– “/opt/bin/entrypoint.sh”
args:
– “–port”
– “8000”
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 256m
memory: 128Mi
env:
– name: GOMEMLIMIT
value: 160MiB
securityContext:
privileged: true

To apply this YAML file, complete the following steps:

Replace <IMAGE_URI> with the URI of the Neuron Monitoring container image in your ECR repository.
Run the YAML file with the Kubernetes command line tool with the following code:

kubectl apply -f <filename>.yaml

Verify the Neuron Monitor container is running as DaemonSet:

kubectl get daemonset -n neuron-monitor

Set up Amazon Managed Service for Prometheus
To utilize Amazon Managed Service for Prometheus with your EKS cluster, you must first configure Prometheus to scrape metrics from Neuron Monitor pods and forward them to the managed service.
Prometheus requires the Container Storage Interface (CSI) in the EKS cluster. You can use eksctl to set up the necessary components.

Create an AWS Identity and Access Management (IAM) service account with appropriate permissions:

eksctl create iamserviceaccount –name ebs-csi-controller-sa –namespace kube-system –cluster <cluster-name> –role-name <role name> –role-only –attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy –approve

Install the Amazon Elastic Block Store (Amazon EBS) CSI driver add-on:

eksctl create addon –name aws-ebs-csi-driver –cluster <cluster-name> –service-account-role-arn <role-arn> –force

Verify the add-on installation:

eksctl get addon –name aws-ebs-csi-driver –cluster <cluster-name>

Now you’re ready to set up your Amazon Managed Service for Prometheus workspace.

Create a workspace using the AWS Command Line Interface (AWS CLI) and confirm its active status:

aws amp create-workspace –alias <alias>
aws amp list-workspaces –alias <alias>

Set up the required service roles following the AWS guidelines to facilitate the ingestion of metrics from your EKS clusters. This includes creating an IAM role specifically for Prometheus ingestion:

aws iam get-role –role-name amp-iamproxy-ingest-role

Next, you install Prometheus in your EKS cluster using a Helm chart, configuring it to scrape metrics from Neuron Monitor and forward them to your Amazon Managed Service for Prometheus workspace. The following is an example of the Helm chart .yaml file to override the necessary configs:

serviceAccounts:
server:
name: “amp-iamproxy-ingest-service-account”
annotations:
eks.amazonaws.com/role-arn: “arn:aws:iam::<account-id>:role/amp-iamproxy-ingest-role”
server:
remoteWrite:
– url: https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>/api/v1/remote_write
sigv4:
region: us-west-2
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
extraScrapeConfigs: |
– job_name: neuron-monitor-stats
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: neuron-monitor
– source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: 8000

This file has the following key sections:

serviceAccounts – Configures the service account used by Prometheus with the necessary IAM role for permissions to ingest metrics
remoteWrite – Specifies the endpoint for writing metrics to Amazon Managed Service for Prometheus, including AWS Region-specific details and batch-writing configurations
extraScrapeConfigs – Defines additional configurations for scraping metrics from Neuron Monitor pods, including selecting pods based on labels and making sure only relevant metrics are captured

Install Prometheus in your EKS cluster using the Helm command and specifying the .yaml file:

helm install prometheus prometheus-community/prometheus -n prometheus –create-namespace -f values.yaml

Verify the installation by checking that all Prometheus pods are running:

kubectl get pods -n prometheus

This confirms that Prometheus is correctly set up to collect metrics from the Neuron Monitor container and forward them to Amazon Managed Service for Prometheus.
Integrate Amazon Managed Grafana
When Prometheus is operational, complete the following steps:

Set up Amazon Managed Grafana. For instructions, see Getting started with Amazon Managed Grafana.
Configure it to use Amazon Managed Service for Prometheus as a data source. For details, see Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source.
Import the example Neuron Monitor dashboard from GitHub to quickly visualize your metrics.

The following screenshot shows your dashboard integrated with Amazon Managed Grafana.

Fig.4 Integrating Amazon Managed Grafana
Clean up
To make sure none of the resources created in this walkthrough are left running, complete the following cleanup steps:

Delete the Amazon Managed Grafana workspace.
Uninstall Prometheus from the EKS cluster:

helm uninstall prometheus -n Prometheus

Remove the Amazon Managed Service for Prometheus workspace ID from the trust policy of the role amp-iamproxy-ingest-role or delete the role.
Delete the Amazon Managed Service for Prometheus workspace:

aws amp delete-workspace –workspace-id <workspace-id>

Clean up the CSI:

eksctl delete addon –cluster <cluster-name> –name aws-ebs-csi-driver
eksctl delete iamserviceaccount –name ebs-csi-controller-sa –namespace kube-system –cluster <cluster-name>

Delete the Neuron Monitor DaemonSet from the EKS cluster:

kubectl delete daemonset neuron-monitor -n neuron-monitor

Conclusion
The release of the Neuron Monitor container marks a significant enhancement in the monitoring of ML workloads on Amazon EKS, specifically tailored for AWS Inferentia and Trainium chips. This solution simplifies the integration of powerful monitoring tools like Prometheus, Grafana, and CloudWatch, so you can effectively manage and optimize your ML applications with ease and precision.
To explore the full capabilities of this monitoring solution, refer to Deploy Neuron Container on Elastic Kubernetes Service (EKS). Refer to Amazon EKS and Kubernetes Container Insights metrics to learn more about setting up the Neuron Monitor container and using Container Insights to fully harness the capabilities of your ML infrastructure on Amazon EKS. Additionally, engage with our community through our GitHub repo to share experiences and best practices, so you stay at the forefront of ML operations on AWS.

About the Authors
Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.
Emir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.
Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.
Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.
Albert Opher is a Solutions Architect Intern at AWS. He is a rising senior at the University of Pennsylvania pursuing Dual Bachelor’s Degrees in Computer Information Science and Business Analytics in the Jerome Fisher Management and Technology Program. He has experience with multiple programming languages, AWS cloud services, AI/ML technologies, product and operations management, pre and early seed start-up ventures, and corporate finance.
Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to audible in her free time

<