When it comes to developing and deploying advanced AI models, access to scalable, efficient GPU infrastructure is critical. But managing this infrastructure across cloud-native, containerized environments can be complex and costly. That’s where NVIDIA Run:ai can help. NVIDIA Run:ai is now generally available on AWS Marketplace, making it even easier for organizations to streamline their AI infrastructure management.
Built for Kubernetes-native environments, NVIDIA Run:ai acts as a control plane for GPU infrastructure, removing complexity and enabling organizations to scale AI workloads with speed, efficiency, and proper governance.
This post dives into how NVIDIA Run:ai orchestrates AI workloads and GPUs across Amazon Web Services (AWS). It integrates seamlessly with NVIDIA GPU-accelerated Amazon EC2 instances, Amazon Elastic Kubernetes Service (EKS), Amazon SageMaker HyperPod, AWS Identity and Access Management (IAM), Amazon CloudWatch, and other AWS-native services.
The challenge: efficient GPU orchestration at scaleModern AI workloads—from large-scale training to real-time inference—require dynamic access to powerful GPUs. But in Kubernetes environments, native support for GPUs is limited. Common challenges include:
NVIDIA Run:ai addresses these challenges with a Kubernetes-based AI orchestration platform designed specifically for AI/ML workloads. It introduces a virtual GPU pool, enabling dynamic, policy-based scheduling of GPU resources.
Key capabilities:NVIDIA Run:ai integrates seamlessly with NVIDIA-powered AWS services to optimize performance and simplify operations:
1. Amazon EC2 GPU-accelerated instances within Kubernetes clusters (NVIDIA A10G, A100, H100, etc.)NVIDIA Run:ai schedules AI workloads on Kubernetes clusters that are deployed on EC2 instances with NVIDIA GPUs. That maximizes GPU utilization through intelligent sharing and bin packing.
NVIDIA Run:ai integrates natively with Amazon EKS, providing a robust scheduling and orchestration layer that’s purpose-built for AI workloads. It maximizes the utilization of GPU resources in Kubernetes clusters.
NVIDIA Run:ai integrates with Amazon SageMaker HyperPod to seamlessly extend AI infrastructure across both on-premise and public/private cloud environments.
Monitoring GPU workloads at scale requires real-time observability. NVIDIA Run:ai can be integrated with Amazon CloudWatch to provide:
By combining NVIDIA Run:ai’s rich workload telemetry with CloudWatch’s analytics and alerting, users gain actionable insights into resource consumption and efficiency.
Integrating with AWS IAMSecurity and governance are foundational for AI infrastructure. NVIDIA Run:ai integrates with AWS IAM to:
IAM integration ensures that only authorized users and services can access or manage NVIDIA Run:ai resources within your AWS environment.
Example: multi-team GPU orchestration on EKSImagine an enterprise AI platform with three teams: natural language processing (NLP), computer vision, and generative AI. Each team needs guaranteed GPU access for training, while also running inference jobs on shared infrastructure.
With NVIDIA Run:ai:
This model allows AI teams to move faster without stepping on each other’s toes—or burning the budget on underutilized GPUs.
Figure 2. NVIDIA Run:ai Dashboard Get startedAs enterprises scale their AI efforts, managing GPU infrastructure manually becomes unsustainable. NVIDIA Run:ai, in combination with NVIDIA technologies on AWS, offers a powerful orchestration layer that simplifies GPU management, boosts utilization, and accelerates AI innovation.
With native integration into EKS, EC2, IAM, SageMaker HyperPod, and CloudWatch, NVIDIA Run:ai provides a unified, enterprise-ready foundation for AI/ML workloads in the cloud.
To learn more or deploy NVIDIA Run:ai on your AWS environment, visit the NVIDIA Run:ai listing on AWS Marketplace or explore the NVIDIA Run:ai documentation.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4