Spring Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Buy Now
Questions 5

An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?

Options:

A.

To guarantee that hardware inventory tools can report serial numbers and manufacturer codes for asset management, which is critical for future support and troubleshooting.

B.

To ensure stability, bandwidth, and compatibility across the cluster, avoiding link issues and performance loss.

C.

To allow the network operating system to automatically discover all connected transceivers with heterogeneous firmware.

D.

To reduce GPU memory consumption during distributed training jobs.

Buy Now
Questions 6

You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?

Pick the 2 correct responses below.

Options:

A.

Assign a floating Virtual IP address that can automatically migrate between the primary and secondary head nodes during failover.

B.

Compute nodes must be powered on and performing work to initiate synchronization of the head nodes.

C.

After configuration is complete, simulate a failover by stopping BCM services on the active head node to verify that all services are running on the secondary node with no interruption.

D.

Configure both head nodes to use independent static IP addresses for BCM services instead of relying on a shared virtual IP address.

E.

During configuration, explicitly synchronize both the configuration and state data directories from the primary to the secondary head node to ensure consistency.

Buy Now
Questions 7

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Buy Now
Questions 8

What is the primary purpose of performing a NeMo burn-in on a new AI infrastructure?

Options:

A.

To benchmark production training speed and ensure all GPUs are running at identical clock speeds.

B.

To stress test the hardware and software stack with representative NeMo workloads, ensuring reliability.

C.

To tune NeMo model hyperparameters for maximum accuracy on user datasets during cluster deployment.

Buy Now
Questions 9

After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?

Options:

A.

Installing the CLI with apt-get instead of manual extraction.

B.

Entering the API key during ngc config set or storing it in ~/.ngc/config.

C.

Setting --format_type=json to enable API interactions.

D.

Running sudo systemctl restart docker after configuration.

Buy Now
Questions 10

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

Options:

A.

SFP Connectors

B.

SFP to 1G BASE-T (RJ45) adapter

C.

QSA Adapter

Buy Now
Questions 11

An engineer must ensure that a BlueField-3 NIC firmware download matches the cluster’s PSID. Which step is critical before installation?

Options:

A.

Check that the DPU’s BMC IP is reachable by ping.

B.

Confirm that the firmware file size matches the DPU’s flash capacity.

C.

Use mstflint -d < PCI_ID > query to validate the device PSID before selecting the firmware image.

D.

Verify that the SHA256 hash of the firmware matches NVIDIA’s public ledger.

Buy Now
Questions 12

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.

nvidia-smi -q | grep " GPU Stress Test "

B.

sudo nvsm stress-test --force

C.

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.

./gpu_burn 60

Buy Now
Questions 13

You are standing up an NVIDIA DGX system for enterprise production. Stakeholder teams require system reliability, performance consistency under load, and proper escalation processes before release. A recent system in another cluster experienced intermittent GPU failures attributed to missed early-stage validation. Which deployment and validation sequence best addresses production readiness and mitigates the risk of avoidable downtime or performance loss?

Options:

A.

Install latest OS images and drivers, confirm OS and container functionality, invite users for a monitored production trial, and collect workload feedback to plan any further diagnostics or updates.

B.

Complete hardware and cabling, power on the system, update firmware and drivers, run full hardware health checks and stress diagnostics using NVSM, verify all GPU and system sensor logs, and validate GPU accessibility.

C.

Update network topology, assign static IPs and DNS entries, register the system with NVIDIA, then conduct basic OS-level checks and enable user access after login testing is successful.

D.

Power on the system, install all AI frameworks, configure the CUDA and library stack, set up user environments, then plan stress tests and diagnostics as part of ongoing routine operations.

Buy Now
Questions 14

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

Options:

A.

ipmitool raw 0x32 0x6a 1

B.

systemctl restart rshim

C.

systemctl enable bmc-rshim.service

D.

scp < path_to_bfb > root@ < bmc_ip > :/dev/rshim0/boot

Buy Now
Questions 15

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.

Disable Secure Boot permanently in BIOS/UEFI settings.

B.

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Buy Now
Questions 16

You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?

Options:

A.

Adjust the batch size so that each GPU receives an equal-sized portion of the batch, ensuring all GPUs process similar workloads and communication is evenly distributed.

B.

Assign the largest possible workload to the first GPU to maximize its utilization, and allow the remaining GPUs to process smaller or variable batch sizes as needed.

C.

Disable automatic load balancing so that the deep learning framework can dynamically assign samples to any GPU available during each iteration.

D.

Increase the communication frequency between GPUs while allowing workloads to be unevenly split, so synchronization is more frequent and model updates happen faster.

Buy Now
Questions 17

An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU ' s BMC. Which Redfish API command provides this information?

Options:

A.

mlxconfig -d < dev > q

B.

curl -k -u root: < password > -X GET https:// < DPU-BMC-IP > /redfish/v1/UpdateService/FirmwareList

C.

mstflint -d < PCI_ID > query full

D.

curl -k -u root: < password > -X GET https:// < DPU-BMC-IP > /redfish/v1/UpdateService/FirmwareInventory

Buy Now
Questions 18

A system administrator needs to validate a GPU-based server and ensure that no errors occur under load. What command should be used?

Options:

A.

nvsm dump health

B.

stress-test --usage

C.

nvsm show health

D.

nvsm stress-test

Buy Now
Questions 19

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

Buy Now
Questions 20

A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?

Options:

A.

Create a unique, strong, lower-case username and password that will be used for both BMC and GRUB access, avoiding default or weak credentials.

B.

Create separate usernames for BMC and GRUB to maximize flexibility.

C.

Skip the creation of a new user and retain the default admin account for BMC and GRUB access.

D.

Use “sysadmin” as the username and a simple password for ease of management.

Buy Now
Questions 21

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

Options:

A.

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.

B.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.

C.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.

D.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.

Buy Now
Questions 22

For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?

Options:

A.

broadcast_perf -b 4G -e 16G -w 160

B.

all_reduce_perf -b 8G -e 32G -c 1000 -z 1 -G 1000

C.

all_reduce_perf -b 8G -e 32G -z 1 -G 1000

D.

reduce_scatter_perf -f 2 -g 8

Buy Now
Questions 23

After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?

Options:

A.

Check /var/log/messages on the DPU operating system for update logs.

B.

Query the DPU BMC with the Task ID of the installation process.

C.

Power cycle the DPU immediately to force a rollback.

D.

Run bfrec --status on the DPU to view flash progress.

Buy Now
Questions 24

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

Options:

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Buy Now
Questions 25

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Buy Now
Questions 26

During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?

Options:

A.

Enable HPL_USE_NVSHMEM=1 for shared memory acceleration

B.

HPL_RUN_GEMM_TESTS to skip validation

C.

Set --gpu-affinity and --cpu-affinity to align GPU and NUMA nodes

D.

HPL_OOC_TILE_M to 8192 for larger blocks

Buy Now
Questions 27

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

Options:

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Buy Now
Questions 28

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager (BCM) cluster. Which two of the following actions are essential for a successful OS installation on the cluster ' s head node? (Pick the 2 correct responses below)

Options:

A.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

B.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

C.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

D.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

Buy Now
Questions 29

An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?

Options:

A.

docker run --rm --runtime=docker nvidia/cuda nvidia-smi

B.

docker run --rm -it ubuntu:22.04 nvidia-smi

C.

docker run --rm --gpus all nvidia/cuda:12.4.6-base-ubuntu22.04 nvidia-smi

D.

docker run --rm nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Buy Now
Questions 30

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

Options:

A.

To measure the storage network performance.

B.

To measure the latency between GPUs.

C.

To measure the power consumption of GPUs.

D.

To measure bandwidth between GPUs.

Buy Now
Questions 31

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Buy Now
Questions 32

A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

Options:

A.

flint

B.

iblinkinfo

C.

mlxconfig

D.

ethtool

Buy Now
Questions 33

Which of the following steps are essential components of a recommended DGX cluster installation procedure?

Pick the 2 correct responses below.

Options:

A.

Group nodes by function during initial setup and assign them to relevant categories in the cluster management tool.

B.

Configure networking by validating all interfaces on each node, ensuring proper InfiniBand and Ethernet connectivity prior to installing cluster software.

C.

Install Slurm on the head node and then configure the compute nodes’ default OS images.

D.

Complete application containerization, run distributed jobs, and skip validation of node health or storage availability.

Buy Now
Questions 34

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

Options:

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Buy Now
Questions 35

A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?

Options:

A.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

B.

esxcli graphics module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x01 "

C.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=FRL=0x01 "

D.

esxcli system module parameters set -m nvidia -p " NVreg_RegistryDwords=RmPVMRL=0x00 "

Buy Now
Questions 36

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

Options:

A.

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

B.

Remove the GPU from the system and replace it with a new one before any diagnostics.

C.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

D.

Immediately re-run the firmware upgrade on all system components.

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: May 21, 2026
Questions: 123

PDF + Testing Engine

$64.99   $185.69

Testing Engine

$49.99   $142.83

PDF (Q&A)

$54.99   $157.11