NCP-AIO NVIDIA AI Operations Questions and Answers
A system administrator needs to collect the information below:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
What single tool should be used?
You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.
What would be the first step to troubleshoot this issue?
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.
How should they troubleshoot this issue?
After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
You are tasked with deploying a DOCA service on an NVIDIA BlueField DPU in an air-gapped data center environment. The DPU has the required BlueField OS version (3.9.0 or higher) installed, and you have access to the necessary container image from NVIDIA's NGC catalog. However, you need to ensure that the deployment process is successful without an internet connection.
Which of the following steps should you take to deploy the DOCA service on the DPU?
A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
Your organization is deploying an AI workload that requires high-throughput access to shared storage across multiple servers. The workload involves both training and inference tasks that need fast read and write speeds.
Which storage architecture would best support this AI workload?
You have successfully pulled a TensorFlow container from NGC and now need to run it on your stand-alone GPU-enabled server.
Which command should you use to ensure that the container has access to all available GPUs?
Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to ensure that inference services have higher priority over training jobs during peak resource usage times.
How would you configure Kubernetes to prioritize inference workloads?
You are a Solutions Architect designing a data center infrastructure for a cloud-based AI application that requires high-performance networking, storage, and security. You need to choose a software framework to program the NVIDIA BlueField DPUs that will be used in the infrastructure. The framework must support the development of custom applications and services, as well as enable tailored solutions for specific workloads. Additionally, the framework should allow for the integration of storage services such as NVMe over Fabrics (NVMe-oF) and elastic block storage.
Which framework should you choose?
What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
