Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5-Python Questions and Answers

Questions 4

A data engineer is working on the DataFrame:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 4

(Referring to the table image: it has columnsId,Name,count, andtimestamp.)

Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?

Options:

df.select("Name").orderBy(df["Name"].asc())

df.select("Name").distinct().orderBy(df["Name"])

df.select("Name").distinct()

df.select("Name").distinct().orderBy(df["Name"].desc())

Buy Now

Questions 5

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

df.write.mode("overwrite").json("path/to/file")

df.write.overwrite.json("path/to/file")

df.write.json("path/to/file", overwrite=True)

df.write.format("json").save("path/to/file", mode="overwrite")

Buy Now

Questions 6

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 6

Options:

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Buy Now

Questions 7

Which configuration can be enabled to optimize the conversion between Pandas and PySpark DataFrames using Apache Arrow?

Options:

spark.conf.set("spark.pandas.arrow.enabled", "true")

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set("spark.sql.arrow.pandas.enabled", "true")

Buy Now

Questions 8

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

employees_df.filter(employees_df.tenure >= 5).show()

employees_df.where(employees_df.tenure >= 5)

filter(employees_df.tenure >= 5)

employees_df.filter(employees_df.tenure >= 5).collect()

Buy Now

Questions 9

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

Buy Now

Questions 10

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference inevent_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

It is not able to handle deduplication in this scenario

It removes duplicates that arrive within the 30-minute window specified by the watermark

It removes all duplicates regardless of when they arrive

It accepts watermarks in seconds and the code results in an error

Buy Now

Questions 11

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

Options:

Add more executor instances to the cluster

Increase the driver memory on the client machine

Switch the deployment mode to cluster mode

Switch the deployment mode to local mode

Buy Now

Questions 12

A data engineer is working ona Streaming DataFrame streaming_df with the given streaming data:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 12

Which operation is supported with streamingdf ?

Options:

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Buy Now

Answer:

Explanation:

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

Comprehensive and Detailed Explanation:

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed — Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide – Unsupported Operations., B. groupby("Id").count()Supported — Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs → Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4)Not allowed — Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming – Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show()Not allowed — show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide – Output operations like show() are not supported., , Reference Extract from Official Guide:, “Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.”— Databricks Structured Streaming Programming Guide]

Questions 13

The following code fragment results in an error:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 13

Which code fragment should be used instead?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 13

Options:

Buy Now

Questions 14

You have:

DataFrame A: 128 GB of transactions

DataFrame B: 1 GB user lookup table

Which strategy is correct for broadcasting?

Options:

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A

DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B

DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

Buy Now

Questions 15

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

Use an RDD action like reduce() to compute the maximum time

Use an accumulator to record the maximum time on the driver

Broadcast a variable to share the maximum time among workers

Configure the Spark UI to automatically collect maximum times

Buy Now

Questions 16

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 16

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers","host1:port1,host2:port2") \

.[LINE5] \

.load()

Options:

.option("subscribe", "feed")

.option("subscribe.topic", "feed")

.option("kafka.topic", "feed")

.option("topic", "feed")

Buy Now

Questions 17

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Spark DataFrames, Structured Streaming, and GraphX

Spark SQL, Pandas API on Spark, and Structured Streaming

Spark Streaming, GraphX, and Pandas API on Spark

Spark DataFrames, Spark SQL, and MLlib

Buy Now

Questions 18

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.

Which save mode and method should be used?

Options:

saveAsTable with mode ErrorIfExists

saveAsTable with mode Overwrite

save with mode Ignore

save with mode ErrorIfExists

Buy Now

Questions 19

Given a CSV file with the content:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 19

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

Options:

[Row(name='bambi'), Row(name='alladin', age=20)]

[Row(name='alladin', age=20)]

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

The code throws an error due to a schema mismatch.

Buy Now

Questions 20

What is the behavior for functiondate_sub(start, days)if a negative value is passed into thedaysparameter?

Options:

The same start date will be returned

An error message of an invalid parameter will be returned

The number of days specified will be added to the start date

The number of days specified will be removed from the start date

Buy Now

Questions 21

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

Options:

DataFrame.groupBy().agg()

DataFrame.filter()

DataFrame.withColumn()

DataFrame.select()

Buy Now

Questions 22

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

By configuring the optioncheckpointLocationduringreadStream

By configuring the optionrecoveryLocationduring the SparkSession initialization

By configuring the optionrecoveryLocationduringwriteStream

By configuring the optioncheckpointLocationduringwriteStream

Buy Now

Questions 23

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.

Buy Now

Questions 24

What is a feature of Spark Connect?

Options:

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

It supports only PySpark applications

It has built-in authentication

Buy Now

Questions 25

What is the benefit of Adaptive Query Execution (AQE)?

Options:

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Buy Now

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5-Python

Last Update: Jul 1, 2025

Questions: 85

PDF + Testing Engine

$72.6 ~~$181.49~~

Testing Engine

$57.8 ~~$144.49~~

PDF (Q&A)

$49.8 ~~$124.49~~

buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 pdf

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5-Python Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

PDF + Testing Engine

Testing Engine

PDF (Q&A)

Quick Links

Why Us

Unlimited Packages

Marks4sure

Site Secure