Summer Special Sale Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 713PS592

Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Questions and Answers

Questions 4

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:

A.

Theycan turn on Databricks Autologging

B.

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.

They can start each child run with the same experiment ID as the parent run

E.

They can specify nested=True when starting the parent run for the tuningprocess

Buy Now
Questions 5

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

Databricks-Machine-Learning-Associate Question 5

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

Options:

A.

They need to specify the method parameter to the OneHotEncoder.

B.

They need to remove the line with the fit operation.

C.

They need to use Stringlndexer prior to one-hot encodinq the features.

D.

They need to useVectorAssemblerprior to one-hot encoding the features.

Buy Now
Questions 6

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

Options:

A.

They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

B.

They can check the Databricks Runtime ML box when creating their clusters.

C.

They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

D.

They can set the runtime-version variable in their Spark session to “ml”.

Buy Now
Questions 7

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

Options:

A.

The second model is much more accurate than the first model

B.

The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE

C.

The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE

D.

The first model is much more accurate than the second model

E.

The RMSE is an invalid evaluation metric for regression problems

Buy Now
Questions 8

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_pandas()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

Buy Now
Questions 9

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

Buy Now
Questions 10

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

Options:

A.

Random Search

B.

Halving Random Search

C.

Tree of Parzen Estimators

D.

Grid Search

Buy Now
Questions 11

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options:

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Buy Now
Questions 12

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Buy Now
Questions 13

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Databricks-Machine-Learning-Associate Question 13

Which of the following suggestions should the team include in their guidelines?

Options:

A.

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

B.

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

C.

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

D.

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Buy Now
Questions 14

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

A.

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

B.

Gradient boosting requires access to all data at once which cannot happen during parallelization.

C.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

D.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Buy Now
Questions 15

Which of the following statements describes a Spark ML estimator?

Options:

A.

An estimator is a hyperparameter arid that can be used to train a model

B.

An estimator chains multiple alqorithms toqether to specify an ML workflow

C.

An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D.

An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E.

An estimator is an evaluation tool to assess to the quality of a model

Buy Now
Questions 16

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

Options:

A.

PySpark DataFrame API

B.

pandas API on Spark

C.

Spark SQL

D.

Feature Store

Buy Now
Questions 17

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.

batch_dfhas the following schema:

customer_id STRING

The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:

Databricks-Machine-Learning-Associate Question 17

In which situation will the machine learning engineer’s code block perform the desired inference?

Options:

A.

When the Feature Store feature set was logged with the model at model_uri

B.

When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark

C.

When the model at model_uri only uses customer_id as a feature

D.

This code block will not perform the desired inference in any situation.

E.

When all of the features used by the model at model_uri are in a single Feature Store table

Buy Now
Questions 18

A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?

Options:

A.

Implement MLflow Experiment Tracking

B.

Scale up with Spark ML

C.

Enable autoscaling clusters

D.

Parallelize with Hyperopt

Buy Now
Questions 19

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.describe()

B.

dbutils.data(spark_df).summarize()

C.

This task cannot be accomplished in a single line of code.

D.

spark_df.summary()

E.

dbutils.data.summarize (spark_df)

Buy Now
Questions 20

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Databricks-Machine-Learning-Associate Question 20

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Buy Now
Questions 21

A data scientist is using the following code block to tune hyperparameters for a machine learning model:

Databricks-Machine-Learning-Associate Question 21

Which change can they make the above code block to improve the likelihood of a more accurate model?

Options:

A.

Increase num_evals to 100

B.

Change fmin() to fmax()

C.

Change sparkTrials() to Trials()

D.

Change tpe.suggest to random.suggest

Buy Now
Questions 22

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.

Which approach should the data scientist take to complete this task?

Options:

A.

They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.

B.

They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.

C.

They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.

D.

They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

Buy Now
Exam Name: Databricks Certified Machine Learning Associate Exam
Last Update: Jun 21, 2025
Questions: 74

PDF + Testing Engine

$66  $164.99

Testing Engine

$50  $124.99
buy now Databricks-Machine-Learning-Associate testing engine

PDF (Q&A)

$42  $104.99
buy now Databricks-Machine-Learning-Associate pdf