Spring Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Questions 4

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.executor.memory

D.

spark.sql.shuffle.partitions

Buy Now
Questions 5

A data engineer is streaming data from Kafka and requires:

Minimal latency

Exactly-once processing guarantees

Which trigger mode should be used?

Options:

A.

.trigger(processingTime='1 second')

B.

.trigger(continuous=True)

C.

.trigger(continuous='1 second')

D.

.trigger(availableNow=True)

Buy Now
Questions 6

A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 6

Which of the following code snippets will read all the data within the directory structure?

Options:

A.

df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")

B.

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

C.

df = spark.read.parquet("/path/events/data/*")

D.

df = spark.read.parquet("/path/events/data/")

Buy Now
Questions 7

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

Options:

A.

The code fails to execute because the column names employee_id and emp_id do not match automatically

B.

The code fails to execute because it must use on='employee_id' to specify the join column explicitly

C.

The code fails to execute because PySpark does not support joining DataFrames with a different structure

D.

The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Buy Now
Questions 8

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

A.

employees_df.filter(employees_df.tenure >= 5).show()

B.

employees_df.where(employees_df.tenure >= 5)

C.

filter(employees_df.tenure >= 5)

D.

employees_df.filter(employees_df.tenure >= 5).collect()

Buy Now
Questions 9

The following code fragment results in an error:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 9

Which code fragment should be used instead?

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 9

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 9

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 9

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 9

Options:

Buy Now
Questions 10

A data analyst wants to add a column date derived from a timestamp column.

Options:

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Buy Now
Questions 11

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Buy Now
Questions 12

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

Options:

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

Buy Now
Questions 13

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

Options:

A.

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

B.

num_df.select(cube_func("num")).show()

C.

spark.createDataFrame(cube_func("num")).show()

D.

num_df.register("cube_func").select("num").show()

Buy Now
Questions 14

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options:

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Buy Now
Questions 15

Given this code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 15

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Buy Now
Questions 16

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.driver.cores

D.

spark.executor.memory

Buy Now
Questions 17

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

Options:

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Buy Now
Questions 18

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 18

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 18

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 18

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 18

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 18

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

B.

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

C.

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

D.

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Buy Now
Questions 19

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Buy Now
Questions 20

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

Options:

A.

It increases the partition size for df1 and df2.

B.

It ensures that the join happens only when the id values are identical.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It filters the id values before performing the join.

Buy Now
Questions 21

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Buy Now
Questions 22

A data engineer writes the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv") # ~10 GB

df2 = spark.read.csv("product_data.csv") # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 22

Which join strategy will Spark use?

Options:

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

Buy Now
Questions 23

What is the benefit of Adaptive Query Execution (AQE)?

Options:

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Buy Now
Questions 24

Which UDF implementation calculates the length of strings in a Spark DataFrame?

Options:

A.

df.withColumn("length", spark.udf("len", StringType()))

B.

df.select(length(col("stringColumn")).alias("length"))

C.

spark.udf.register("stringLength", lambda s: len(s))

D.

df.withColumn("length", udf(lambda s: len(s), StringType()))

Buy Now
Questions 25

An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

Options:

A.

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

B.

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

C.

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

D.

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Buy Now
Questions 26

A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

A.

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

B.

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

C.

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

D.

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

Buy Now
Questions 27

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Buy Now
Questions 28

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Buy Now
Questions 29

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

A.

Execute their pyspark shell with the option --remote "https://localhost "

B.

Execute their pyspark shell with the option --remote "sc://localhost"

C.

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

D.

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Buy Now
Questions 30

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

A.

A job contains multiple tasks, and each task contains multiple stages.

B.

A stage contains multiple jobs, and each job contains multiple tasks.

C.

A stage contains multiple tasks, and each task contains multiple jobs.

D.

A job contains multiple stages, and each stage contains multiple tasks.

Buy Now
Questions 31

Given a CSV file with the content:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 31

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

Options:

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

Buy Now
Questions 32

A Data Analyst is working on the DataFrame sensor_df, which contains two columns:

Which code fragment returns a DataFrame that splits the record column into separate columns and has one array item per row?

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 32

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 32

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 32

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 32

Options:

A.

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

exploded_df = exploded_df.select("record_datetime", "sensor_id", "status", "health")

B.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

C.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

D.

exploded_df = exploded_df.select("record_datetime", "record_exploded")

Buy Now
Questions 33

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

Options:

A.

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))

B.

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

C.

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))

D.

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))

Buy Now
Questions 34

Given the code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 34

df = spark.read.csv("large_dataset.csv")

filtered_df = df.filter(col("error_column").contains("error"))

mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"), lit(1).alias("count"))

reduced_df = mapped_df.groupBy("date").sum("count")

reduced_df.count()

reduced_df.show()

At which point will Spark actually begin processing the data?

Options:

A.

When the filter transformation is applied

B.

When the count action is applied

C.

When the groupBy transformation is applied

D.

When the show action is applied

Buy Now
Questions 35

A data engineer is running a batch processing job on a Spark cluster with the following configuration:

10 worker nodes

16 CPU cores per worker node

64 GB RAM per node

The data engineer wants to allocate four executors per node, each executor using four cores.

What is the total number of CPU cores used by the application?

Options:

A.

160

B.

64

C.

80

D.

40

Buy Now
Questions 36

20 of 55.

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

A.

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

B.

persist() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and cache() — Can be used to set different storage levels.

C.

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_DESER).

D.

cache() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and persist() — Can be used to set different storage levels to persist the contents of the DataFrame.

Buy Now
Questions 37

Given the code fragment:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

import pyspark.pandas as ps

psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

A.

psdf.to_spark()

B.

psdf.to_pyspark()

C.

psdf.to_pandas()

D.

psdf.to_dataframe()

Buy Now
Questions 38

27 of 55.

A data engineer needs to add all the rows from one table to all the rows from another, but not all the columns in the first table exist in the second table.

The error message is:

AnalysisException: UNION can only be performed on tables with the same number of columns.

The existing code is:

au_df.union(nz_df)

The DataFrame au_df has one extra column that does not exist in the DataFrame nz_df, but otherwise both DataFrames have the same column names and data types.

What should the data engineer fix in the code to ensure the combined DataFrame can be produced as expected?

Options:

A.

df = au_df.unionByName(nz_df, allowMissingColumns=True)

B.

df = au_df.unionAll(nz_df)

C.

df = au_df.unionByName(nz_df, allowMissingColumns=False)

D.

df = au_df.union(nz_df, allowMissingColumns=True)

Buy Now
Questions 39

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

Buy Now
Questions 40

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.

To remove the duplicates, the engineer adds the code:

df = df.withWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It removes all duplicates regardless of when they arrive.

B.

It accepts watermarks in seconds and the code results in an error.

C.

It removes duplicates that arrive within the 30-minute window specified by the watermark.

D.

It is not able to handle deduplication in this scenario.

Buy Now
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python
Last Update: Feb 20, 2026
Questions: 136

PDF + Testing Engine

$63.52  $181.49

Testing Engine

$50.57  $144.49
buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 testing engine

PDF (Q&A)

$43.57  $124.49
buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 pdf