Your organization is running multiple AI models on a single A100 GPU using MIG in a multi-tenant environment. One of the tenants reports a performance issue, but you notice that other tenants are unaffected.
What feature of MIG ensures that one tenant's workload does not impact others?
An administrator is troubleshooting a bottleneck in a deep learning run time and needs consistent data feed rates to GPUs.
Which storage metric should be used?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
A system administrator needs to lower latency for an AI application by utilizing GPUDirect Storage.
What two (2) bottlenecks are avoided with this approach? (Choose two.)
A system administrator is looking to set up virtual machines in an HGX environment with NVIDIA Fabric Manager.
What three (3) tasks will Fabric Manager accomplish? (Choose three.)
An organization only needs basic network monitoring and validation tools.
Which UFM platform should they use?
You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.
What is a key requirement for enabling cloudbursting across multiple cloud providers?
An organization has multiple containers and wants to view STDIN, STDOUT, and STDERR I/O streams of a specific container.
What command should be used?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
A system administrator needs to scale a Kubernetes Job to 4 replicas.
What command should be used?
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
A system administrator is experiencing issues with Docker containers failing to start due to volume mounting problems. They suspect the issue is related to incorrect file permissions on shared volumes between the host and containers.
How should the administrator troubleshoot this issue?
You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.
What mechanism is responsible for ensuring that the workload continues to run without interruption?
A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.
What command should be used?
Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from the GPU Operator in the public NGC catalog? (Choose two.)
A system administrator needs to configure and manage multiple installations of NVIDIA hardware ranging from single DGX BasePOD to SuperPOD.
Which software stack should be used?