A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/scaling below:

NVIDIA Run:ai at Scale | Run:ai Documentation

NVIDIA Run:ai at Scale | Run:ai Documentation
  1. Infrastructure setup
  2. Infrastructure Procedures
NVIDIA Run:ai at Scale

Operating NVIDIA Run:ai at scale ensures that the system can efficiently handle fluctuating workloads while maintaining optimal performance. As clusters grow, whether due to an increasing number of nodes or a surge in workload demand, NVIDIA Run:ai services must be appropriately tuned to support large-scale environments.

This guide outlines the best practices for optimizing NVIDIA Run:ai for high-performance deployments, including NVIDIA Run:ai system services configurations, vertical scaling (adjusting CPU and memory resources) and where applicable, horizontal scaling (replicas).

Each of the NVIDIA Run:ai containers has default resource requirements that reflect an average customer load. With significantly larger cluster loads, certain NVIDIA Run:ai services will require more CPU and memory resources. NVIDIA Run:ai supports configuring these resources for each NVIDIA Run:ai service group separately. For instructions and more information, see NVIDIA Run:ai services resource management.

The scheduling services group should be scaled together with the number of nodes and the number of workloads handled by the Scheduler (running / pending). These resource recommendations are based on internal benchmarks performed on stressed environments:

Sync and Workload Services

The sync and workload service groups are less sensitive for scale. The recommendation for large or intensive environments is set to the following:

By default, NVIDIA Run:ai cluster services are deployed with a single replica. For large scale and intensive environments it is recommended to scale the NVIDIA Run:ai services horizontally by increasing the number of replicas. For more information, see NVIDIA Run:ai services replicas.

NVIDIA Run:ai relies on Prometheus to scrape cluster metrics and forward them to the NVIDIA Run:ai control plane. The volume of metrics generated is directly proportional to the number of nodes, workloads, and projects in the system. When operating at scale—reaching hundreds, and thousands of nodes and projects—the system generates a significant volume of metrics which can place a strain on the cluster and the network bandwidth.

To mitigate this impact, it is recommended to tune the Prometheus remote-write configurations. See remote write tuning to read more about the tuning parameters available via the remote write configuration and refer to this article for optimizing Prometheus remote write performance.

You can apply the remote-write configurations required as described in advanced cluster configurations.

The following example demonstrates the recommended approach in NVIDIA Run:ai for tuning Prometheus remote-write configurations:

remoteWrite:
  queueConfig:
    capacity: 5000
    maxSamplesPerSend: 1000
    maxShards: 100
Scaling the NVIDIA Run:ai Control Plane

For clusters with more than 32 nodes (SuperPod and larger), increase the replica count for key control plane services to 2.

To set the replica count, use the following NVIDIA Run:ai control plane Helm flag:

--set <service>.replicaCount=2

Replicas for following services should not be increased: postgres, keycloak, grafana, thanos, nats, redoc, cluster-migrator, identity provider reconciler, settings migrator.

For Grafana, enable autoscaling first and then set the number of minReplicas. Use the following NVIDIA Run:ai control plane Helm flags:

--set grafana.autoscaling.enabled=true \
--set grafana.autoscaling.minReplicas=2

Thanos is the third-party used by NVIDIA Run:ai to store metrics under a significant user load. Use the following NVIDIA Run:ai control plane Helm flags to increase resources for the Thanos query function:

--set thanos.query.resources.limits.memory=3G \
--set thanos.query.resources.requests.memory=3G \
--set thanos.query.resources.limits.cpu=1 \
--set thanos.query.resources.requests.cpu=1 \
--set thanos.receive.resources.limits.memory=15G \
--set thanos.receive.resources.requests.memory=15G \
--set thanos.receive.resources.limits.cpu=2 \
--set thanos.receive.resources.requests.cpu=2

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4