Databricks Databricks-Certified-Professional-Data-Engineer Dumps - 100% Cover Real Exam Questions (Updated 220 Questions)
Real Databricks-Certified-Professional-Data-Engineer dumps - Real Databricks dumps PDF
NEW QUESTION 109
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE . The table is configured to
run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?
- A. All datasets will be updated continuously and the pipeline will not shut down. The compute resources
will persist with the pipeline - B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
be deployed for the update and terminated when the pipeline is stopped - C. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to
allow for additional testing - D. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
persist after the pipeline is stopped to allow for additional testing - E. All datasets will be updated once and the pipeline will shut down. The compute resources will be
terminated
Answer: C
NEW QUESTION 110
What are the advantages of the Hashing Features?
- A. Easily reverse engineer vectors to determine which original feature mapped to a vector location
- B. Less pass through the training data
- C. Requires the less memory
Answer: B,C
Explanation:
Explanation
SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and
shoehorning the training data into vectors of that size. This approach is known as feature hashing. The
shoehorning is done by picking one or more locations by using a hash of the name of the variable for
continuous variables or a hash of the variable name and the category name or word for categorical, text*like, or
word-like data.
This hashed feature approach has the distinct advantage of requiring less memory and one less pass through
the training data, but it can make it much harder to reverse engineer vectors to determine which original
feature mapped to a vector location. This is because multiple features may hash to the same location. With
large vectors or with multiple locations per feature, this isn't a problem for accuracy but it can make it hard to
understand what a classifier is doing.
An additional benefit of feature hashing is that the unknown and unbounded vocabularies typical of word-like
variables aren't a problem.
NEW QUESTION 111
What is the best way to query external csv files located on DBFS Storage to inspect the data using SQL?
- A. SELECT * FROM 'dbfs:/location/csv_files/' USING CSV
- B. SELECT * FROM 'dbfs:/location/csv_files/' FORMAT = 'CSV'
- C. SELECT CSV. * from 'dbfs:/location/csv_files/'
- D. SELECT * FROM CSV. 'dbfs:/location/csv_files/'
- E. You can not query external files directly, us COPY INTO to load the data into a table first
Answer: D
Explanation:
Explanation
Answer is, SELECT * FROM CSV. 'dbfs:/location/csv_files/'
you can query external files stored on the storage using below syntax
SELECT * FROM format.`/Location`
format - CSV, JSON, PARQUET, TEXT
NEW QUESTION 112
Which of the following SQL statements can replace python variables in Databricks SQL code, when the notebook is set in SQL mode?
1.%python
2.table_name = "sales"
3.schema_name = "bronze"
4.
5.%sql
6.SELECT * FROM ____________________
- A. SELECT * FROM f{schema_name.table_name}
- B. SELECT * FROM schema_name.table_name
- C. SELECT * FROM {schem_name.table_name}
- D. SELECT * FROM ${schema_name}.${table_name}
Answer: D
Explanation:
Explanation
The answer is, SELECT * FROM ${schema_name}.${table_name}
%python
table_name = "sales"
schema_name = "bronze"
%sql
SELECT * FROM ${schema_name}.${table_name}
${python variable} -> Python variables in Databricks SQL code
NEW QUESTION 113
The research team has put together a funnel analysis query to monitor the customer traffic on the e-commerce platform, the query takes about 30 mins to run on a small SQL endpoint cluster with max scaling set to 1 cluster. What steps can be taken to improve the performance of the query?
- A. They can increase the maximum bound of the SQL endpoint's scaling range anywhere from between 1 to 100 to review the performance and select the size that meets the re-quired SLA.
- B. They can turn off the Auto Stop feature for the SQL endpoint to more than 30 mins.
- C. They can turn on the Serverless feature for the SQL endpoint.
- D. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance Policy from
"Cost optimized" to "Reliability Optimized." - E. They can increase the cluster size anywhere from X small to 3XL to review the per-formance and select the size that meets the required SLA.
Answer: E
Explanation:
Explanation
The answer is, They can increase the cluster size anywhere from 2X-Small to 4XL(Scale Up) to review the performance and select the size that meets your SLA. If you are trying to improve the performance of a single query at a time having additional memory, additional worker nodes mean that more tasks can run in a cluster which will improve the performance of that query.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
1.A SQL Warehouse should have at least one cluster
2.A cluster comprises one driver node and one or many worker nodes
3.No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale Up
4.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min
1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
5.Increasing the Warehouse cluster size can improve the performance of a query, example if a query runs for 1 minute in a 2X-Small warehouse size, it may run in 30 Seconds if we change the warehouse size to X-Small.
this is due to 2X-Small has 1 worker node and X-Small has 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
6.A warehouse can have more than one cluster this is called Scale Out. If a warehouse is configured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
7.A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
Please review the below diagram to understand the above concepts:
Scale-up-> Increase the size of the SQL endpoint, change cluster size from 2X-Small to up to 4X-Large If you are trying to improve the performance of a single query having additional memory, additional worker nodes and cores will result in more tasks running in the cluster will ultimately improve the performance.
During the warehouse creation or after, you have the ability to change the warehouse size (2X-Small....to
...4XLarge) to improve query performance and the maximize scaling range to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out if you are changing an existing warehouse you may have to restart the warehouse to make the changes effective.
NEW QUESTION 114
Which of the following command can be used to drop a managed delta table and the underlying files in the storage?
- A. DROP TABLE table_name
- B. Use DROP TABLE table_name command and manually delete files using com-mand dbutils.fs.rm("/path",True)
- C. DROP TABLE table_name CASCADE
- D. DROP TABLE table_name INCLUDE_FILES
- E. DROP TABLE table and run VACUUM command
Answer: A
Explanation:
Explanation
The answer is DROP TABLE table_name,
When a managed table is dropped, the table definition is dropped from metastore and everything including data, metadata, and history are also dropped from storage.
NEW QUESTION 115
A DELTA LIVE TABLE pipelines can be scheduled to run in two different modes, what are these two different modes?
- A. Triggered, Incremental
- B. Triggered, Continuous
- C. Continuous, Incremental
- D. Once, Continuous
- E. Once, Incremental
Answer: B
Explanation:
Explanation
The answer is Triggered, Continuous
https://docs.microsoft.com/en-us/azure/databricks/data-engineering/delta-live-tables/delta-live-tables-concepts#-
*Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipeline are updated after their dependent data sources have been updated.
*Continuous pipelines update tables continuously as input data changes. Once an update is started, it continues to run until manually stopped. Continuous pipelines require an always-running cluster but ensure that downstream consumers have the most up-to-date data.
NEW QUESTION 116
The Delta Live Tables Pipeline is configured to run in Development mode using the Triggered Pipeline Mode.
what is the expected outcome after clicking Start to update the pipeline?
- A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped
- B. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional development and testing
- C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional development and testing
- D. All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline
- E. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated
Answer: D
Explanation:
Explanation
The answer is All datasets will be updated once and the pipeline will shut down. The compute re-sources will persist to allow for additional testing.
DLT pipeline supports two modes Development and Production, you can switch between the two based on the stage of your development and deployment lifecycle.
Development and production modes
When you run your pipeline in development mode, the Delta Live Tables system:
*Reuses a cluster to avoid the overhead of restarts.
*Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
*Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
*Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the buttons in the Pipelines UI to switch between develop-ment and production modes. By default, pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution behavior.
Storage locations must be configured as part of pipeline settings and are not affected when switching between modes.
Please review additional DLT concepts using below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-live-tables-c
NEW QUESTION 117
Drop the customers database and associated tables and data, all of the tables inside the database are managed tables. Which of the following SQL commands will help you accomplish this?
- A. DROP DATABASE customers FORCE
- B. DROP DELTA DATABSE customers
- C. DROP DATABASE customers INCLUDE
- D. DROP DATABASE customers CASCADE
- E. All the tables must be dropped first before dropping database
Answer: C
Explanation:
Explanation
The answer is DROP DATABASE customers CASCADE
Drop database with cascade option drops all the tables, since all of the tables inside the database are managed tables we do not need to perform any additional steps to clean the data in the storage.
NEW QUESTION 118
Which of the following approaches can the data engineer use to obtain a version-controllable con-figuration of the Job's schedule and configuration?
- A. They can link the Job to notebooks that are a part of a Databricks Repo.
- B. They can download the JSON equivalent of the job from the Job's page.
- C. They can submit the Job once on an all-purpose cluster.
- D. They can download the XML description of the Job from the Job's page
- E. They can submit the Job once on a Job cluster.
Answer: C
NEW QUESTION 119
Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results?
- A. SELECT * FROM table_name GROUP BY * HAVING COUNT(*) < 1
- B. SELECT * FROM table_name GROUP BY * HAVING COUNT(*) > 1
- C. SELECT DISTINCT * FROM table_name HAVING COUNT(*) > 1
- D. SELECT DISTINCT * FROM table_name
- E. SELECT DISTINCT_ROWS (*) FROM table_name
Answer: D
Explanation:
Explanation
The answer is SELECT DISTINCT * FROM table_name
NEW QUESTION 120
In order to use Unity catalog features, which of the following steps needs to be taken on man-aged/external tables in the Databricks workspace?
- A. Copy data from workspace to unity catalog
- B. Upgrade workspace to Unity catalog
- C. Upgrade to DBR version 15.0
- D. Migrate/upgrade objects in workspace managed/external tables/view to unity catalog
- E. Enable unity catalog feature in workspace settings
Answer: D
Explanation:
Explanation
Upgrade tables and views to Unity Catalog - Azure Databricks | Microsoft Docs Managed table: Upgrade a managed to Unity Catalog External table: Upgrade an external table to Unity Catalog
NEW QUESTION 121
Which of the below SQL commands create a Global temporary view?
- A. 1.CREATE OR REPLACE VIEW view_name
2. AS SELECT * FROM table_name - B. 1. CREATE OR REPLACE LOCAL VIEW view_name
2. AS SELECT * FROM table_name - C. 1. CREATE OR REPLACE GLOBAL TEMPORARY VIEW view_name
2. AS SELECT * FROM table_name
(Correct) - D. 1.CREATE OR REPLACE TEMPORARY VIEW view_name
2. AS SELECT * FROM table_name - E. 1. CREATE OR REPLACE LOCAL TEMPORARY VIEW view_name
2. AS SELECT * FROM table_name
Answer: C
Explanation:
Explanation
1. CREATE OR REPLACE GLOBAL TEMPORARY VIEW view_name
2. AS SELECT * FROM table_name
There are two types of temporary views that can be created Local and Global
*A session-scoped temporary view is only available with a spark session, so another note-book in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.
*A global temporary view is available to all the notebooks in the cluster but if a cluster re-starts a global temporary view is lost.
NEW QUESTION 122
What is the probability that the total of two dice will be greater than 8, given that the first die is a 6?
- A. 2/3
- B. 2/6
- C. 1/6
- D. 1/3
Answer: A
NEW QUESTION 123
A data engineer has three notebooks in an ELT pipeline. The notebooks need to be executed in a specific order
for the pipeline to complete successfully. The data engineer would like to use Delta Live Tables to manage this
process.
Which of the following steps must the data engineer take as part of implementing this pipeline using Delta
Live Tables?
- A. They need to create a Delta Live Tables pipeline from the Jobs page
- B. They need to refactor their notebook to use SQL and CREATE LIVE TABLE keyword
- C. They need to create a Delta Live Tables pipeline from the Data page
- D. They need to create a Delta Live tables pipeline from the Compute page
- E. They need to refactor their notebook to use Python and the dlt library
Answer: A
NEW QUESTION 124
If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that
E1 has occurred?
- A. P(E2)/(P(E1+E2)
- B. P(E2)/P(E1)
- C. P(E1+E2)/P(E1)
- D. P(E1)/P(E2)
Answer: B
NEW QUESTION 125
You are asked to create a model to predict the total number of monthly subscribers for a specific magazine.
You are provided with 1 year's worth of subscription and payment data, user demographic data, and 10 years
worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building
a predictive model for subscribers?
- A. Logistic regression
- B. TF-IDF
- C. Linear regression
- D. Decision trees
Answer: C
NEW QUESTION 126
How do you access or use tables in the unity catalog?
- A. schema_name.catalog_name.table_name
- B. catalog_name.table_name
- C. schema_name.table_name
- D. catalog_name.database_name.schema_name.table_name
- E. catalog_name.schema_name.table_name
Answer: E
Explanation:
Explanation
The answer is catalog_name.schema_name.table_name
Graphical user interface, diagram Description automatically generated
Note: Database and Schema are analogous they are interchangeably used in the Unity catalog.
FYI, A catalog is registered under a metastore, by default every workspace has a default metastore called hive_metastore, with a unity catalog you have the ability to create meatstores and share that across multiple workspaces.
Diagram Description automatically generated
NEW QUESTION 127
Which of the following Structured Streaming queries is performing a hop from a Bronze table to a Silver
table?
- A. 1. (spark.table("sales")
2. .withColumn("avgPrice", col("sales") / col("units"))
3. .writeStream
4. .option("checkpointLocation", checkpointPath)
5. .outputMode("append")
6. .table("cleanedSales")
7.) - B. 1. (spark.table("sales")
2. .groupBy("store")
3. .agg(sum("sales"))
4. .writeStream
5. .option("checkpointLocation", checkpointPath)
6. .outputMode("complete")
7. .table("aggregatedSales")
8.) - C. 1. (spark.read.load(rawSalesLocation)
2. .writeStream
3. .option("checkpointLocation", checkpointPath)
4. .outputMode("append")
5. .table("uncleanedSales")
6. ) - D. 1. (spark.readStream.load(rawSalesLocation)
2. .writeStream
3. .option("checkpointLocation", checkpointPath)
4. .outputMode("append")
5. .table("uncleanedSales")
6. ) - E. 1. (spark.table("sales")
2. .agg(sum("sales"),
3. sum("units"))
4. .writeStream
5. .option("checkpointLocation", checkpointPath)
6. .outputMode("complete")
7. .table("aggregatedSales")
8. )
Answer: A
NEW QUESTION 128
A notebook accepts an input parameter that is assigned to a python variable called department and this is an optional parameter to the notebook, you are looking to control the flow of the code using this parameter. you have to check department variable is present then execute the code and if no department value is passed then skip the code execution. How do you achieve this using python?
- A. 1.if department is not None:
2. #Execute code
3.else:
4. pass
(Correct) - B. 1.if (department is not None)
2. #Execute code
3.else
4. pass - C. 1.if department is not None:
2. #Execute code
3.then:
4. pass - D. 1.if department is not None:
2. #Execute code
3.end:
4. pass - E. 1.if department is None:
2. #Execute code
3.else:
4. pass
Answer: A
Explanation:
Explanation
The answer is,
1.if department is not None:
2. #Execute code
3.else:
4. pass
NEW QUESTION 129
Which of the statements are incorrect when choosing between lakehouse and Datawarehouse?
- A. Lakehouse can be accessed through various API's including but not limited to Py-thon/R/SQL
- B. Lakehouse can have special indexes and caching which are optimized for Machine learning
- C. Lakehouse uses standard data formats like Parquet.
- D. Traditional Data warehouses have storage and compute are coupled.
- E. Lakehouse cannot serve low query latency with high reliability for BI workloads, only suitable for batch workloads.
Answer: E
Explanation:
Explanation
The answer is Lakehouse cannot serve low query latency with high reliability for BI workloads, only suitable for batch workloads.
Lakehouse can replace traditional warehouses by leveraging storage and compute optimizations like caching to serve them with low query latency with high reliability.
Focus on comparisons between Spark Cache vs Delta Cache.
https://docs.databricks.com/delta/optimizations/delta-cache.html
What Is a Lakehouse? - The Databricks Blog
Graphical user interface, text, application Description automatically generated
Bottom of Form
Top of Form
NEW QUESTION 130
A data engineer is using a Databricks SQL query to monitor the performance of an ELT job. The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL query returns the number of minutes since the job's most recent runtime. Which of the following approaches can enable the data engineering team to be notified if the ELT job has not been run in an hour?
- A. This type of alert is not possible in Databricks
- B. They can set up an Alert for the accompanying dashboard to notify them if the returned value is greater than 60.
- C. They can set up an Alert for the query to notify them if the returned value is greater than 60.
- D. They can set up an Alert for the accompanying dashboard to notify when it has not re-freshed in 60 minutes.
- E. They can set up an Alert for the query to notify when the ELT job fails.
Answer: C
Explanation:
Explanation
The answer is, They can set up an Alert for the query to notify them if the returned value is greater than 60.
The important thing to note here is that alert can only be setup on query not on the dashboard, query can return a value, which is used if alert can be triggered.
NEW QUESTION 131
You are trying to calculate total sales made by all the employees by parsing a complex struct data type that stores employee and sales data, how would you approach this in SQL Table definition, batchId INT, performance ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>, in-sertDate TIMESTAMP Sample data of performance column
1.[
2.{ "employeeId":1234
3."sales" : 10000},
4.
5.{ "employeeId":3232
6."sales" : 30000}
7.]
Calculate total sales made by all the employees?
Sample data with create table syntax for the data:
1.create or replace table sales as
2.select 1 as batchId ,
3.from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',
4. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
5. current_timestamp() as insertDate
6.union all
7.select 2 as batchId ,
8. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
9. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
10. current_timestamp() as insertDate
- A. 1.select reduce(flatten(collect_list(performance:sales)), 0, (x, y) -> x + y)
2.as total_sales from sales - B. 1.select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
2.as total_sales from sales - C. SELECT SUM(SLICE (performance, sales)) FROM employee
- D. 1.WITH CTE as (SELECT EXPLODE (performance) FROM table_name)
2.SELECT SUM (performance.sales) FROM CTE - E. 1.WITH CTE as (SELECT FLATTEN (performance) FROM table_name)
2.SELECT SUM (sales) FROM CTE
Answer: B
Explanation:
Explanation
The answer is
1.select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
2.as total_sales from sales
Nested Struct can be queried using the . notation performance.sales will give you access to all the sales values in the performance column.
Note: option D is wrong because it uses performance:sales not performance.sales. ":" this is only used when referring to JSON data but here we are dealing with a struct data type. for the exam please make sure to understand if you are dealing with JSON data or Struct data.
Other solutions:
we can also use reduce instead of aggregate
select reduce(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y) as total_sales from sales we can also use explode and sum instead of using any higher-order funtions.
1.with cte as (
2. select
3. explode(flatten(collect_list(performance.sales))) sales from sales
4.)
5.select
6. sum(sales) from cte
Sample data with create table syntax for the data:
1.create or replace table sales as
2.select 1 as batchId ,
3.from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',
4. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
5. current_timestamp() as insertDate
6.union all
7.select 2 as batchId ,
8. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
9. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
10. current_timestamp() as insertDate
NEW QUESTION 132
......
Realistic Actual4Labs Databricks-Certified-Professional-Data-Engineer Dumps PDF - 100% Passing Guarantee: https://www.actual4labs.com/Databricks/Databricks-Certified-Professional-Data-Engineer-actual-exam-dumps.html