This section explains the procedure to view and manage Clusters.
The Cluster table provides a quick and easy way to see the status of your cluster.
The Clusters table can be found under Resources in the NVIDIA Run:ai platform.
The clusters table provides a list of the clusters added to NVIDIA Run:ai platform, along with their status.
The clusters table consists of the following columns:
The status of the cluster. For more information see the table below. Hover over the information icon for a short description and links to troubleshooting
The timestamp when the cluster was created
The URL that was given to the cluster
NVIDIA Run:ai cluster version
The NVIDIA Run:ai version installed on the cluster
The flavor of Kubernetes distribution
The version of Kubernetes installed
NVIDIA Run:ai cluster UUID
The unique ID of the cluster
The cluster has never been connected.
At least one of the services is not working properly. You can view the list of nonfunctioning services for more information. See troubleshooting scenarios.
The NVIDIA Run:ai cluster is connected, and all NVIDIA Run:ai services are running.
Customizing the Table ViewFilter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.
To add a new cluster, see the installation guide.
Select the cluster you want to remove
A dialog appears: Make sure to carefully read the message before removing
Click REMOVE to confirm the removal.
Go to the Clusters API reference to view the available actions
Before starting, make sure you have access to the Kubernetes cluster where NVIDIA Run:ai is deployed with the necessary permissions
Troubleshooting Scenarios Cluster disconnectedDescription: When the cluster's status is ‘disconnected’, there is no communication from the cluster services reaching the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.
Mitigation:
Check NVIDIA Run:ai’s services status:
Make sure you have access to the Kubernetes cluster with permissions to view pods
Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:
kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
If any of the services are not running, see the ‘cluster has service issues’ scenario.
Check the network connection
Make sure you have access to the Kubernetes cluster with permissions to create pods
Copy and paste the following command to create a connectivity check pod:
kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
Replace <control-plane-endpoint>
with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies
Check and modify the network policies
Copy and paste the following command to check the existence of network policies:
kubectl get networkpolicies -n runai
Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic
Example of allowing traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-control-plane-traffic
namespace: runai
spec:
podSelector:
matchLabels:
app: runai
policyTypes:
- Ingress
- Egress
egress:
- to:
- ipBlock:
cidr: <control-plane-ip-range>
ports:
- protocol: TCP
port: <control-plane-port>
ingress:
- from:
- ipBlock:
cidr: <control-plane-ip-range>
ports:
- protocol: TCP
port: <control-plane-port>
Check infrastructure-level configurations:
Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane
Verify required ports and protocols:
Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups
Check NVIDIA Run:ai services logs
Make sure you have access to the Kubernetes cluster with permissions to view logs
Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:
kubectl logs deployment/runai-agent -n runai
kubectl logs deployment/cluster-sync -n runai
kubectl logs deployment/assets-sync -n runai
Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step.
Diagnosing internal network issues: NVIDIA Run:ai operates on Kubernetes, which uses its internal subnet and DNS services for communication between pods and services. If you find connectivity issues in the logs, the problem might be related to Kubernetes' internal networking.
To diagnose DNS or connectivity issues, you can start a debugging {{glossary.Pod}} with networking utilities:
Copy the following command to your terminal, to start a pod with networking tools:
kubectl run -i --tty netutils --image=dersimn/netutils -- bash
This command creates an interactive pod (netutils
) where you can use networking commands like ping
, curl
, nslookup
, etc., to troubleshoot network issues.
Use this pod to perform network resolution tests and other diagnostics to identify any DNS or connectivity problems within your Kubernetes {{glossary.Cluster}}.
Contact NVIDIA Run:ai’s support
Description: When a cluster's status is ‘Has service issues`, it means that one or more NVIDIA Run:ai services running in the cluster are not available.
Mitigation:
Verify non-functioning services
Make sure you have access to the Kubernetes cluster with permissions to view the runaiconfig
resource
Copy and paste the following command to determine which services are not functioning:
kubectl get runaiconfig -n runai runai -ojson | jq -r '.status.conditions | map(select(.type == "Available"))'
Check for Kubernetes events
Make sure you have access to the Kubernetes cluster with permissions to view events
Inspect resource details
Make sure you have access to the Kubernetes cluster with permissions to describe resources
Copy and paste the following command to check the details of the required resource:
kubectl describe <resource_type> <name>
Contact NVIDIA Run:ai’s Support
Description: When the cluster's status is ‘waiting to connect’, it means that no communication from the cluster services reaches the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.
Mitigation:
Check NVIDIA Run:ai’s services status
Make sure you have access to the Kubernetes cluster with permissions to view pods
Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:
kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
If any of the services are not running, see the ‘cluster has service issues’ scenario.
Check the network connection
Make sure you have access to the Kubernetes cluster with permissions to create pods
Copy and paste the following command to create a connectivity check pod:
kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
Replace <control-plane-endpoint>
with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies:
Check and modify the network policies
Copy and paste the following command to check the existence of network policies:
kubectl get networkpolicies -n runai
Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic
Example of allowing traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-control-plane-traffic
namespace: runai
spec:
podSelector:
matchLabels:
app: runai
policyTypes:
- Ingress
- Egress
egress:
- to:
- ipBlock:
cidr: <control-plane-ip-range>
ports:
- protocol: TCP
port: <control-plane-port>
ingress:
- from:
- ipBlock:
cidr: <control-plane-ip-range>
ports:
- protocol: TCP
port: <control-plane-port>
Check infrastructure-level configurations:
Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane
Verify required ports and protocols:
Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups
Check NVIDIA Run:ai services logs
Make sure you have access to the Kubernetes cluster with permissions to view logs
Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:
kubectl logs deployment/runai-agent -n runai
kubectl logs deployment/cluster-sync -n runai
kubectl logs deployment/assets-sync -n runai
Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step
Contact NVIDIA Run:ai’s support
Description: When a cluster's status displays Missing prerequisites, it indicates that at least one of the Mandatory Prerequisites has not been fulfilled. In such cases, NVIDIA Run:ai services may not function properly.
Mitigation:
If you have ensured that all prerequisites are installed and the status still shows missing prerequisites, follow these steps:
Check the message in the NVIDIA Run:ai platform for further details regarding the missing prerequisites.
Inspect the runai-public
ConfigMap:
Open your terminal. In the terminal, type the following command to list all ConfigMaps in the runai-public
namespace:
kubectl get configmap -n runai-public
Describe the ConfigMap
Locate the ConfigMap named runai-public
from the list
To view the detailed contents of this ConfigMap, type the following command:
kubectl describe configmap runai-public -n runai-public
Find Missing Prerequisites
In the output displayed, look for a section labeled dependencies.required
This section provides detailed information about any missing resources or prerequisites. Review this information to identify what is needed
Contact NVIDIA Run:ai’s support
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4