Spring Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Questions 4

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:

A.

10

B.

Same number as the cluster executors

C.

1

D.

20

Buy Now
Questions 5

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 5

import hashlib

import pyspark.sql.functions as sf

from pyspark.sql.types import StringType

def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this: CopyEdit

shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

Options:

A.

def shake_256(df: pd.Series) - > str:

B.

def shake_256(df: Iterator[pd.Series]) - > Iterator[pd.Series]:

C.

def shake_256(raw: str) - > str:

D.

def shake_256(df: pd.Series) - > pd.Series:

Buy Now
Questions 6

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

Options:

A.

Prevention of driver log accumulation during long-running jobs

B.

Improved job execution speed due to reduced logging overhead

C.

Loss of access to past job logs and reduced debugging capability for completed jobs

D.

Enhanced executor performance due to reduced log size

Buy Now
Questions 7

A data engineer wants to create an external table from a JSON file located at /data/input.json with the following requirements:

Create an external table named users

Automatically infer schema

Merge records with differing schemas

Which code snippet should the engineer use?

Options:

Options:

A.

CREATE TABLE users USING json OPTIONS (path ' /data/input.json ' )

B.

CREATE EXTERNAL TABLE users USING json OPTIONS (path ' /data/input.json ' )

C.

CREATE EXTERNAL TABLE users USING json OPTIONS (path ' /data/input.json ' , mergeSchema ' true ' )

D.

CREATE EXTERNAL TABLE users USING json OPTIONS (path ' /data/input.json ' , schemaMerge ' true ' )

Buy Now
Questions 8

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

A.

Locate the executor logs on the Spark master node, typically under the /tmp directory.

B.

Use the command spark-submit with the —verbose flag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with the spark-sql CLI tool.

Buy Now
Questions 9

A data engineer is building an Apache Spark™ Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.

Which code snippet should the data engineer use?

Options:

A.

query = streaming_df.writeStream \

.format( " console " ) \

.option( " checkpoint " , " /path/to/checkpoint " ) \

.outputMode( " append " ) \

.start()

B.

query = streaming_df.writeStream \

.format( " console " ) \

.outputMode( " append " ) \

.option( " checkpointLocation " , " /path/to/checkpoint " ) \

.start()

C.

query = streaming_df.writeStream \

.format( " console " ) \

.outputMode( " complete " ) \

.start()

D.

query = streaming_df.writeStream \

.format( " console " ) \

.outputMode( " append " ) \

.start()

Buy Now
Questions 10

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:

A.

employees_df = employees_df.withColumn( " tax " , col( " salary " ) * 0.1).drop( " age " )

B.

employees_df = employees_df.withColumn( " tax " , lit(0.1)).drop( " age " )

C.

employees_df = employees_df.dropField( " age " ).withColumn( " tax " , col( " salary " ) * 0.1)

D.

employees_df = employees_df.withColumn( " tax " , col( " salary " ) + 0.1).drop( " age " )

Buy Now
Questions 11

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

Options:

A.

spark.udf.register( " cube_func " , cube_func)

num_df.selectExpr( " cube_func(num) " ).show()

B.

num_df.select(cube_func( " num " )).show()

C.

spark.createDataFrame(cube_func( " num " )).show()

D.

num_df.register( " cube_func " ).select( " num " ).show()

Buy Now
Questions 12

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

A.

The conversion will automatically distribute the data across worker nodes

B.

The operation will fail if the Pandas DataFrame exceeds 1000 rows

C.

Data will be lost during conversion

D.

The operation will load all data into the driver ' s memory, potentially causing memory overflow

Buy Now
Questions 13

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

A.

By configuring the option checkpointLocation during readStream

B.

By configuring the option recoveryLocation during the SparkSession initialization

C.

By configuring the option recoveryLocation during writeStream

D.

By configuring the option checkpointLocation during writeStream

Buy Now
Questions 14

9 of 55.

Given the code fragment:

import pyspark.pandas as ps

pdf = ps.DataFrame(data)

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

A.

pdf.to_pandas()

B.

pdf.to_spark()

C.

pdf.to_dataframe()

D.

pdf.spark()

Buy Now
Questions 15

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Buy Now
Questions 16

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col( " price " ) > min_price).agg(F.count( " * " ))

Which code block should the developer use to refactor the code?

Options:

A.

result_df = prices_df.filter(F.col( " price " ) > F.lit(min_price)).agg(F.count( " * " ))

B.

result_df = prices_df.where(F.lit( " price " ) > min_price).groupBy().count()

C.

result_df = prices_df.withColumn( " valid_price " , when(col( " price " ) > F.lit(min_price), True))

D.

result_df = prices_df.filter(F.lit(min_price) > F.col( " price " )).count()

Buy Now
Questions 17

Given the code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 17

df = spark.read.csv( " large_dataset.csv " )

filtered_df = df. filter (col( " error_column " ).contains( " error " ))

mapped_df = filtered_df.select(split(col( " timestamp " ), " " ).getItem( 0 ).alias( " date " ), lit( 1 ).alias( " count " ))

reduced_df = mapped_df.groupBy( " date " ). sum ( " count " )

reduced_df.count()

reduced_df.show()

At which point will Spark actually begin processing the data?

Options:

A.

When the filter transformation is applied

B.

When the count action is applied

C.

When the groupBy transformation is applied

D.

When the show action is applied

Buy Now
Questions 18

A DataFrame df has columns name , age , and salary . The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col( " age " ).asc(), col( " salary " ).asc()).show()

B.

df.sort( " age " , " salary " , ascending=[True, True]).show()

C.

df.sort( " age " , " salary " , ascending=[False, True]).show()

D.

df.orderBy( " age " , " salary " , ascending=[True, False]).show()

Buy Now
Questions 19

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

Options:

A.

df1.orderBy(col( " count " ).desc(), col( " Name " ).asc())

B.

df1.sort( " Name " , " count " )

C.

df1.orderBy( " Name " , " count " )

D.

df1.orderBy(col( " Name " ).desc(), col( " count " ).asc())

Buy Now
Questions 20

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 20

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

. format ( " kafka " ) \

.option( " kafka.bootstrap.servers " , " host1:port1,host2:port2 " ) \

.[LINE 5 ] \

.load()

Options:

Options:

A.

.option( " subscribe " , " feed " )

B.

.option( " subscribe.topic " , " feed " )

C.

.option( " kafka.topic " , " feed " )

D.

.option( " topic " , " feed " )

Buy Now
Questions 21

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

A.

employees_df.filter(employees_df.tenure > = 5).show()

B.

employees_df.where(employees_df.tenure > = 5)

C.

filter(employees_df.tenure > = 5)

D.

employees_df.filter(employees_df.tenure > = 5).collect()

Buy Now
Questions 22

A data engineer writes the following code to join two DataFrames df1 and df2 :

df1 = spark.read.csv( " sales_data.csv " ) # ~10 GB

df2 = spark.read.csv( " product_data.csv " ) # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 22

Which join strategy will Spark use?

Options:

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

Buy Now
Questions 23

24 of 55.

Which code should be used to display the schema of the Parquet file stored in the location events.parquet?

Options:

A.

spark.sql( " SELECT * FROM events.parquet " ).show()

B.

spark.read.format( " parquet " ).load( " events.parquet " ).show()

C.

spark.read.parquet( " events.parquet " ).printSchema()

D.

spark.sql( " SELECT schema FROM events.parquet " ).show()

Buy Now
Questions 24

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Buy Now
Questions 25

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 25

Options:

A.

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

B.

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

C.

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

D.

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Buy Now
Questions 26

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations .

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

Options:

A.

Ability to process small datasets efficiently

B.

In-memory computation and parallel processing capabilities

C.

Support for SQL queries on structured data

D.

Built-in machine learning libraries

Buy Now
Questions 27

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Buy Now
Questions 28

35 of 55.

A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.

How can this be achieved?

Options:

A.

By configuring the option recoveryLocation during SparkSession initialization.

B.

By configuring the option checkpointLocation during readStream.

C.

By configuring the option checkpointLocation during writeStream.

D.

By configuring the option recoveryLocation during writeStream.

Buy Now
Questions 29

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 29

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 29

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 29

Options:

A.

Use the applyInPandas API:

df.groupby( " user_id " ).applyInPandas(mean_func, schema= " user_id long, value double " ).show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema= " user_id long, value double " ).show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy( " user_id " ).agg(mean( " value " )).show()

D.

Use a Pandas UDF:

@pandas_udf( " double " )

def mean_func(value: pd.Series) - > float:

return value.mean()

df.groupby( " user_id " ).agg(mean_func(df[ " value " ])).show()

Buy Now
Questions 30

A data scientist wants each record in the DataFrame to contain:

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 30

The entire contents of a file

The full file path

The issue: reading line-by-line rather than full text per file.

Code:

corpus = spark.read.text( " /datasets/raw_txt/* " ) \

.select( ' * ' , ' _metadata.file_path ' )

Which change will ensure one record per file?

Options:

Options:

A.

Add the option wholetext=True to the text() function

B.

Add the option lineSep= ' \n ' to the text() function

C.

Add the option wholetext=False to the text() function

D.

Add the option lineSep= " , " to the text() function

Buy Now
Questions 31

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Buy Now
Questions 32

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn( " event_year " , F.year( " event_ts " )) \

.withColumn( " event_month " , F.month( " event_ts " )) \

.bucketBy(42, [ " event_year " , " event_month " ]) \

.saveAsTable( " events.liveLatest " )

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

A.

Replace .bucketBy() with .partitionBy( " event_year " , " event_month " )

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy( " event_year " ) only

Buy Now
Questions 33

29 of 55.

A Spark application is experiencing performance issues in client mode due to the driver being resource-constrained.

How should this issue be resolved?

Options:

A.

Switch the deployment mode to cluster mode .

B.

Add more executor instances to the cluster.

C.

Increase the driver memory on the client machine.

D.

Switch the deployment mode to local mode .

Buy Now
Questions 34

Given the code fragment:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 34

import pyspark.pandas as ps

psdf = ps.DataFrame({ ' col1 ' : [1, 2], ' col2 ' : [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame ( pyspark.pandas.DataFrame ) into a standard PySpark DataFrame ( pyspark.sql.DataFrame )?

Options:

A.

psdf.to_spark()

B.

psdf.to_pyspark()

C.

psdf.to_pandas()

D.

psdf.to_dataframe()

Buy Now
Questions 35

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 35

def in_spanish_inner(df: pd.Series) - > pd.Series:

model = get_translation_model(target_lang= ' es ' )

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

C.

Run the in_spanish_inner() function in a mapInPandas() function call

D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Buy Now
Questions 36

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions_dict = dict(regions.take(3))

B.

regions_dict = regions.select( " region_id " , " region_name " ).take(3)

C.

regions_dict = dict(regions.select( " region_id " , " region_name " ).rdd.collect())

D.

regions_dict = dict(regions.orderBy( " region_id " ).limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Buy Now
Questions 37

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format( " console " ) \

.outputMode( " ??? " ) \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE

B.

COMPLETE

C.

REPLACE

D.

APPEND

Buy Now
Questions 38

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

C.

In cluster mode, the driver runs on the client machine, which can limit the application ' s ability to handle large datasets efficiently.

D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Buy Now
Questions 39

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 39

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

complete

B.

append

C.

replace

D.

aggregate

Buy Now
Questions 40

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country , and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

Options:

A.

df.write.mode( " overwrite " ).partitionBy( " country " ).parquet( " /data/output " )

B.

df.write.mode( " append " ).partitionBy( " country " ).parquet( " /data/output " )

C.

df.write.mode( " overwrite " ).parquet( " /data/output " )

D.

df.write.partitionBy( " country " ).parquet( " /data/output " )

Buy Now
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python
Last Update: May 21, 2026
Questions: 136

PDF + Testing Engine

$64.99   $185.69

Testing Engine

$49.99   $142.83

PDF (Q&A)

$54.99   $157.11