Pre-Summer Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that, over time, data being transmitted from the bicycle sensors fail to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.

The data engineer already has this code:

import dlt

from pyspark.sql.functions import expr

rules = {

" valid_lat " : " (lat IS NOT NULL) " ,

" valid_long " : " (long IS NOT NULL) "

}

quarantine_rules = " NOT({}) " .format( " AND " .join(rules.values()))

@dlt.view

def raw_trips_data():

return spark.readStream.table( " ride_and_go.telemetry.trips " )

How should the data engineer meet the requirements to capture good and bad data?

Options:

A.

@dlt.table(name= " trips_data_quarantine " )

def trips_data_quarantine():

return (

spark.readStream.table( " raw_trips_data " )

.filter(expr(quarantine_rules))

)

B.

@dlt.view

@dlt.expect_or_drop( " lat_long_present " , " (lat IS NOT NULL AND long IS NOT NULL) " )

def trips_data_quarantine():

return spark.readStream.table( " ride_and_go.telemetry.trips " )

C.

@dlt.table

@dlt.expect_all_or_drop(rules)

def trips_data_quarantine():

return spark.readStream.table( " raw_trips_data " )

D.

@dlt.table(partition_cols=[ " is_quarantined " , ])

@dlt.expect_all(rules)

def trips_data_quarantine():

return (

spark.readStream.table( " raw_trips_data " )

.withColumn( " is_quarantined " , expr(quarantine_rules))

)

Buy Now
Questions 5

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the " Owner " for each job. They attempt to transfer " Owner " privileges to the " DevOps " group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Options:

A.

Databricks jobs must have exactly one owner; " Owner " privileges cannot be assigned to a group.

B.

The creator of a Databricks job will always have " Owner " privileges; this configuration cannot be changed.

C.

Other than the default " admins " group, only individual users can be granted privileges on jobs.

D.

A user can only transfer job ownership to a group if they are also a member of that group.

E.

Only workspace administrators can grant " Owner " privileges to a group.

Buy Now
Questions 6

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job ' s initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Buy Now
Questions 7

A nightly job ingests data into a Delta Lake table using the following code:

Databricks-Certified-Professional-Data-Engineer Question 7

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

A.

return spark.readStream.table( " bronze " )

B.

return spark.readStream.load( " bronze " )

C.

7

D.

return spark.read.option( " readChangeFeed " , " true " ).table ( " bronze " )

E.

7

Buy Now
Questions 8

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers , one driver of type i3.xlarge, and should use the ' 14.3.x-scala2.12 ' runtime.

Which command should the data engineer use?

Options:

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Buy Now
Questions 9

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Databricks-Certified-Professional-Data-Engineer Question 9

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

A.

Three columns will be returned, but one column will be named " redacted " and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string " REDACTED " in each row.

Buy Now
Questions 10

A data engineer needs to implement column masking for a sensitive column in a Unity Catalog-managed table. The masking logic must dynamically check if users belong to specific groups defined in a separate table (group_access) that maps groups to allowed departments.

Which approach should the engineer use to efficiently enforce this requirement?

Options:

A.

Create a UDF that hardcodes allowed groups and apply it as a column mask.

B.

Create a view without selecting the sensitive column.

C.

Apply a column mask that references the group_access mapping table in its UDF.

D.

Use a row filter to restrict access based on the user’s group.

Buy Now
Questions 11

A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.

Options:

A.

Use Repos to merge all differences and make a pull request back to the remote repository.

B.

Use repos to merge all difference and make a pull request back to the remote repository.

C.

Use Repos to create a new branch commit all changes and push changes to the remote Git repertory.

D.

Use repos to create a fork of the remote repository commit all changes and make a pull request on the source repository

Buy Now
Questions 12

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

A.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

B.

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

C.

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

D.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

E.

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Buy Now
Questions 13

A data engineer wants to ingest a large collection of image files (JPEG and PNG) from cloud object storage into a Unity Catalog–managed table for analysis and visualization.

Which two configurations and practices are recommended to incrementally ingest these images into the table? (Choose 2 answers)

Options:

A.

Move files to a volume and read with SQL editor.

B.

Use Auto Loader and set cloudFiles.format to " BINARYFILE " .

C.

Use Auto Loader and set cloudFiles.format to " TEXT " .

D.

Use Auto Loader and set cloudFiles.format to " IMAGE " .

E.

Use the pathGlobFilter option to select only image files (e.g., " *.jpg,*.png " ).

Buy Now
Questions 14

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

A.

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

B.

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

C.

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

D.

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Buy Now
Questions 15

A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.

Which technique will fix the skew in this aggregation?

Options:

A.

Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix.

B.

Increase the Spark driver memory and retry.

C.

Use reduceByKey instead of groupBy to avoid shuffles.

D.

Filter out the skewed users before the aggregation.

Buy Now
Questions 16

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

Databricks-Certified-Professional-Data-Engineer Question 16

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120 . Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

Options:

A.

The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

B.

The recent_sensor_recordings table was unresponsive for three consecutive runs of the query

C.

The source query failed to update properly for three consecutive minutes and then restarted

D.

The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

E.

The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Buy Now
Questions 17

A data architect has heard about lake ' s built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

A.

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

B.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.

C.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.

D.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.

Buy Now
Questions 18

A data engineer is designing a system to process batch patient encounter data stored in an S3 bucket, creating a Delta table (patient_encounters) with columns encounter_id, patient_id, encounter_date, diagnosis_code, and treatment_cost. The table is queried frequently by patient_id and encounter_date, requiring fast performance. Fine-grained access controls must be enforced. The engineer wants to minimize maintenance and boost performance.

How should the data engineer create the patient_encounters table?

Options:

A.

Create an external table in Unity Catalog, specifying an S3 location for the data files. Enable predictive optimization through table properties, and configure Unity Catalog permissions for access controls.

B.

Create a managed table in Unity Catalog . Configure Unity Catalog permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.

C.

Create a managed table in Unity Catalog. Configure Unity Catalog permissions for access controls, schedule jobs to run OPTIMIZE and VACUUM commands daily to achieve best performance.

D.

Create a managed table in Hive Metastore. Configure Hive Metastore permissions for access controls, and rely on predictive optimization to enhance query performance and simplify maintenance.

Buy Now
Questions 19

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Options:

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

D.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

E.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Buy Now
Questions 20

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings . The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

A.

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

B.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

C.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

D.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

E.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Buy Now
Questions 21

A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions. The requirements are:

• Grant the data-engineers group CAN_MANAGE access to the job.

• Ensure the auditors’ group can view the job but not modify/run it.

• Avoid granting unintended permissions to other users/groups.

How should the data engineer deploy the job while meeting the requirements?

Options:

A.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

- group_name: admin-team

level: IS_OWNER

B.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

permissions:

- group_name: auditors

level: CAN_VIEW

C.

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

D.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

Buy Now
Questions 22

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Buy Now
Questions 23

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

Databricks-Certified-Professional-Data-Engineer Question 23

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

Options:

A.

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

B.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

C.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

D.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

E.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Buy Now
Questions 24

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

A.

The job_id and number of times the job has been run are concatenated and returned.

B.

The globally unique ID of the newly triggered run.

C.

The job_id is returned in this field.

D.

The number of times the job definition has been run in this workspace.

Buy Now
Questions 25

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

A.

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

B.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

C.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

E.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Buy Now
Questions 26

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

Databricks-Certified-Professional-Data-Engineer Question 26

An analyze who is not a member of the auditing group executing the following query:

Databricks-Certified-Professional-Data-Engineer Question 26

Which result will be returned by this query?

Options:

A.

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

B.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

C.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

D.

All records from all columns will be displayed with the values in user_ltv.

Buy Now
Questions 27

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake ' s time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Options:

A.

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

B.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

C.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

D.

Because Delta Lake ' s delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

E.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Buy Now
Questions 28

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as " unmanaged " ) Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

A.

When a database is being created, make sure that the LOCATION keyword is used.

B.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

C.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

D.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Buy Now
Questions 29

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

Options:

A.

databricks execute my_project_job -e prod

B.

databricks job run my_project_job --env prod

C.

databricks run my_project_job -t prod

D.

databricks bundle run my_project_job -t prod

Buy Now
Questions 30

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

A.

Can Manage

B.

Can Edit

C.

No permissions

D.

Can Read

E.

Can Run

Buy Now
Questions 31

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

A.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

B.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

C.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

D.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

E.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Buy Now
Questions 32

A Delta Lake table was created with the below query:

Databricks-Certified-Professional-Data-Engineer Question 32

Consider the following query:

DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

Options:

A.

Nothing will occur until a COMMIT command is executed.

B.

The table will be removed from the catalog but the data will remain in storage.

C.

The table will be removed from the catalog and the data will be deleted.

D.

An error will occur because Delta Lake prevents the deletion of production data.

E.

Data will be marked as deleted but still recoverable with Time Travel.

Buy Now
Questions 33

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.

A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:

Databricks-Certified-Professional-Data-Engineer Question 33

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.

Which statement explains the cause of this failure?

Options:

A.

Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.

B.

The activity details table already exists; CHECK constraints can only be added during initial table creation.

C.

The activity details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.

D.

The activity details table already contains records; CHECK constraints can only be added prior to inserting values into a table.

E.

The current table schema does not contain the field valid coordinates; schema evolution will need to be enabled before altering the table to add a constraint.

Buy Now
Questions 34

A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.

Which approach should the data engineer use?

Options:

A.

Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.

B.

Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.

C.

Use an Auto CDC pipeline with batch tables to simplify late data handling.

D.

Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.

Buy Now
Questions 35

An external object storage container has been mounted to the location /mnt/finance_eda_bucket .

The following logic was executed to create a database for the finance team:

Databricks-Certified-Professional-Data-Engineer Question 35

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

Databricks-Certified-Professional-Data-Engineer Question 35

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:

A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

E.

A managed table will be created in the DBFS root storage container.

Buy Now
Questions 36

A platform engineer is creating catalogs and schemas for the development team to use.

The engineer has created an initial catalog, catalog_A, and initial schema, schema_A. The engineer has also granted USE CATALOG, USE

SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.

Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.

What explains the engineer ' s lack of access to the underlying tables?

Options:

A.

The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners.

B.

Users granted with USE CATALOG can modify the owner ' s permissions to downstream tables.

C.

The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point.

D.

Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their

schema.

Buy Now
Questions 37

A data engineer is tasked with ensuring that a Delta table in Databricks continuously retains deleted files for 15 days (instead of the default 7 days), in order to permanently comply with the organization’s data retention policy.

Which code snippet correctly sets this retention period for deleted files?

Options:

A.

spark.sql( " ALTER TABLE my_table SET TBLPROPERTIES ( ' delta.deletedFileRetentionDuration ' = ' interval 15 days ' ) " )

B.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, " /mnt/data/my_table " )

deltaTable.deletedFileRetentionDuration = " interval 15 days "

C.

spark.sql( " VACUUM my_table RETAIN 15 HOURS " )

D.

spark.conf.set( " spark.databricks.delta.deletedFileRetentionDuration " , " 15 days " )

Buy Now
Questions 38

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

A.

Set the configuration delta.deduplicate = true.

B.

VACUUM the Delta table after each batch completes.

C.

Perform an insert-only merge with a matching condition on a unique key.

D.

Perform a full outer join on a unique key and overwrite existing data.

E.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Buy Now
Questions 39

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

A.

Can manage

B.

Can edit

C.

Can run

D.

Can Read

Buy Now
Questions 40

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Databricks-Certified-Professional-Data-Engineer Question 40

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

withWatermark( " event_time " , " 10 minutes " )

B.

awaitArrival( " event_time " , " 10 minutes " )

C.

await( " event_time + ‘10 minutes ' " )

D.

slidingWindow( " event_time " , " 10 minutes " )

E.

delayWrite( " event_time " , " 10 minutes " )

Buy Now
Questions 41

Which statement describes Delta Lake optimized writes?

Options:

A.

A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

B.

Optimized writes logical partitions instead of directory partitions partition boundaries are only represented in metadata fewer small files are written.

C.

An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

D.

Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

Buy Now
Questions 42

A facilities-monitoring team is building a near-real-time PowerBI dashboard off the Delta table device_readings:

Columns:

    device_id (STRING, unique sensor ID)

    event_ts (TIMESTAMP, ingestion timestamp UTC)

    temperature_c (DOUBLE, temperature in °C)

Requirement:

    For each sensor, generate one row per non-overlapping 5-minute interval , offset by 2 minutes (e.g., 00:02–00:07, 00:07–00:12, …).

    Each row must include interval start, interval end, and average temperature in that slice.

    Downstream BI tools (e.g., Power BI) must use the interval timestamps to plot time-series bars.

Options:

Options:

A.

WITH buckets AS (

SELECT device_id,

window(event_ts, ' 5 minutes ' , ' 2 minutes ' , ' 5 minutes ' ) AS win,

temperature_c

FROM device_readings

)

SELECT device_id,

win.start AS bucket_start,

win.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM buckets

GROUP BY device_id, win

ORDER BY device_id, bucket_start;

B.

SELECT device_id,

event_ts,

AVG(temperature_c) OVER (

PARTITION BY device_id

ORDER BY event_ts

RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW

) AS avg_temp_5m

FROM device_readings

WINDOW w AS (window(event_ts, ' 5 minutes ' , ' 2 minutes ' ));

C.

SELECT device_id,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 2 MINUTES AS bucket_start,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 7 MINUTES AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES)

ORDER BY device_id, bucket_start;

D.

SELECT device_id,

window.start AS bucket_start,

window.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, window(event_ts, ' 5 minutes ' , ' 5 minutes ' , ' 2 minutes ' )

ORDER BY device_id, bucket_start;

Buy Now
Questions 43

A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event’s address falls into.

Table 1: network_events (≈ 5 billion rows)

event_id ip_int

42 3232235777

Table 2: ip_ranges (≈ 2 million rows)

start_ip_int end_ip_int country

3232235520 3232236031 US

The query is currently very slow:

SELECT n.event_id, n.ip_int, r.country

FROM network_events n

JOIN ip_ranges r

ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;

Question:

Which change will most dramatically accelerate the query while preserving its logic?

Options:

A.

Increase spark.sql.shuffle.partitions from 200 to 10000.

B.

Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.

C.

Force a sort-merge join with /*+ MERGE(r) */.

D.

Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.

Buy Now
Questions 44

A data team is automating a daily multi-task ETL pipeline in Databricks. The pipeline includes a notebook for ingesting raw data, a Python wheel task for data transformation, and a SQL query to update aggregates. They want to trigger the pipeline programmatically and see previous runs in the GUI. They need to ensure tasks are retried on failure and stakeholders are notified by email if any task fails.

Which two approaches will meet these requirements? (Choose 2 answers)

Options:

A.

Use the REST API endpoint /jobs/runs/submit to trigger each task individually as separate job runs and implement retries using custom logic in the orchestrator.

B.

Create a multi-task job using the UI, Databricks Asset Bundles (DABs), or the Jobs REST API (/jobs/create) with notebook, Python wheel, and SQL tasks. Configure task-level retries and email notifications in the job definition.

C.

Trigger the job programmatically using the Databricks Jobs REST API (/jobs/run-now), the CLI (databricks jobs run-now), or one of the Databricks SDKs.

D.

Create a single orchestrator notebook that calls each step with dbutils.notebook.run(), defining a job for that notebook and configuring retries and notifications at the notebook level.

E.

Use Databricks Asset Bundles (DABs) to deploy the workflow, then trigger individual tasks directly by referencing each task’s notebook or script path in the workspace.

Buy Now
Questions 45

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.

B.

Text data cannot be stored with Delta Lake.

C.

ZORDER ON review will need to be run to see performance gains.

D.

The Delta log creates a term matrix for free text fields to support selective filtering.

E.

Delta Lake statistics are only collected on the first 4 columns in a table.

Buy Now
Questions 46

A data architect is designing a Databricks solution to efficiently process data for different business requirements.

In which scenario should a data engineer use a materialized view compared to a streaming table ?

Options:

A.

Implementing a CDC (Change Data Capture) pipeline that needs to detect and respond to database changes within seconds.

B.

Ingesting data from Apache Kafka topics with sub-second processing requirements for immediate alerting.

C.

Precomputing complex aggregations and joins from multiple large tables to accelerate BI dashboard performance.

D.

Processing high-volume, continuous clickstream data from a website to monitor user behavior in real-time.

Buy Now
Questions 47

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Buy Now
Questions 48

While reviewing a query ' s execution in the Databricks Query Profiler, a data engineer observes that the Top Operators panel shows a Sort operator with high Time Spent and Memory Peak metrics. The Spark UI also reports frequent data spilling .

How should the data engineer address this issue?

Options:

A.

Switch to a broadcast join to reduce memory usage.

B.

Repartition the DataFrame to a single partition before sorting.

C.

Convert the sort operation to a filter operation.

D.

Increase the number of shuffle partitions to better distribute data.

Buy Now
Questions 49

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

Options:

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Buy Now
Questions 50

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Buy Now
Questions 51

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

A.

" Can Manage " privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. " Can Attach To " privileges on the required cluster

C.

Cluster creation allowed. " Can Attach To " privileges on the required cluster

D.

" Can Restart " privileges on the required cluster

E.

Cluster creation allowed. " Can Restart " privileges on the required cluster

Buy Now
Questions 52

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 and longitude > -20

Which statement describes how data will be filtered?

Options:

A.

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

B.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

C.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

D.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

E.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Buy Now
Questions 53

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark( " event_time " , " 10 minutes " )

.groupBy(

________,

" device_id "

)

.agg(

avg( " temp " ).alias( " avg_temp " ),

avg( " humidity " ).alias( " avg_humidity " )

)

.writeStream

.format( " delta " )

.saveAsTable( " sensor_avg " )

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

A.

window( " event_time " , " 5 minutes " ).alias( " time " )

B.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

lag( " event_time " , " 5 minutes " ).alias( " time " )

Buy Now
Questions 54

A data engineer is creating a daily reporting job. There are two reporting notebooks—one for weekdays and one for weekends. An “if/else condition” task is configured as {{job.start_time.is_weekday}} == true to route the job to either the weekday or weekend notebook tasks. The same job would be used across multiple time zones.

Which action should a senior data engineer take upon reviewing the job to merge or reject the pull request?

Options:

A.

Reject, as the {{job.start_time.is_weekday}} is for the UTC timezone .

B.

Reject, as the {{job.start_time.is_weekday}} is not a valid value reference.

C.

Merge, as the job configuration looks good.

D.

Reject, as they should use {{job.trigger_time.is_weekday}} instead.

Buy Now
Questions 55

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Databricks-Certified-Professional-Data-Engineer Question 55

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

B.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

C.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

D.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

E.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Buy Now
Questions 56

Given the following PySpark code snippet in a Databricks notebook:

filtered_df = spark.read.format( " delta " ).load( " /mnt/data/large_table " ) \

.filter( " event_date > ' 2024-01-01 ' " )

filtered_df.count()

The data engineer notices from the Query Profiler that the scan operator for filtered_df is reading almost all files, despite the filter being applied.

What is the probable reason for poor data skipping?

Options:

A.

The Delta table lacks optimization that enables dynamic file pruning.

B.

The filter is executed only after the full data scan, preventing data skipping.

C.

The event_date column is outside the table’s partitioning and Z-ordering scheme.

D.

The filter condition involves a data type excluded from data skipping support.

Buy Now
Questions 57

The following table consists of items found in user carts within an e-commerce website.

Databricks-Certified-Professional-Data-Engineer Question 57

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

Databricks-Certified-Professional-Data-Engineer Question 57

How would the following update be handled?

Options:

A.

The update is moved to separate ' ' restored ' ' column because it is missing a column expected in the target schema.

B.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

C.

The update throws an error because changes to existing columns in the target schema are not supported.

D.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Buy Now
Questions 58

A data engineer is designing a Lakeflow Spark Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id is not null and amount is greater than zero. Invalid records should be dropped. Which Lakeflow Spark Declarative Pipelines configuration implements this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect_or_drop( " valid_amount " , " amount > 0 " )

B.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect( " valid_amount " , " amount > 0 " )

C.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 59

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

A.

Integration

B.

Unit

C.

Manual

D.

functional

Buy Now
Questions 60

A data engineer is configuring Delta Sharing for a Databricks-to-Databricks scenario to optimize read performance. The recipient needs to perform time travel queries and streaming reads on shared sales data.

Which configuration will provide the optimal performance while enabling these capabilities?

Options:

A.

Share tables WITH HISTORY , ensure tables don’t have partitioning enabled, and enable CDF before sharing.

B.

Share tables WITHOUT HISTORY and enable partitioning for better query performance.

C.

Share the entire schema WITHOUT HISTORY and rely on recipient-side caching for performance.

D.

Use the open sharing protocol instead of Databricks-to-Databricks sharing for better performance.

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: Apr 17, 2026
Questions: 195

PDF + Testing Engine

$63.52  $181.49

Testing Engine

$50.57  $144.49
buy now Databricks-Certified-Professional-Data-Engineer testing engine

PDF (Q&A)

$43.57  $124.49
buy now Databricks-Certified-Professional-Data-Engineer pdf