Certbus > Databricks > Databricks Certifications > DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK > DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Online Practice Questions and Answers

DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Online Practice Questions and Answers

Questions 4

Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

A. Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions

B. Decrease values for the properties spark.default.parallelism and spark.sql.partitions

C. Increase values for the properties spark.sql.parallelism and spark.sql.partitions

D. Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions

E. Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

Browse 180 Q&As
Questions 5

The code block displayed below contains multiple errors. The code block should return a DataFrame that

contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

A. The column names should be listed directly as arguments to the operator and not as a list.

B. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

C. The select operator should be replaced by a drop operator.

D. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Browse 180 Q&As
Questions 6

In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column productId?

1.

importedDf.createOrReplaceTempView("importedDf")

2.

spark.sql("SELECT * FROM importedDf WHERE productId != 3")

3.

spark.sql("FILTER * FROM importedDf WHERE productId != 3")

4.

importedDf = spark.read.option("format", "json").path(jsonPath)

5.

importedDf = spark.read.json(jsonPath)

A. 4, 1, 2

B. 5, 1, 3

C. 5, 2

D. 4, 1, 3

E. 5, 1, 2

Browse 180 Q&As
Questions 7

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

A. transactionsDf.repartition(transactionsDf.getNumPartitions()+2)

B. transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

C. transactionsDf.coalesce(10)

D. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

E. transactionsDf.repartition(transactionsDf._partitions+2)

Browse 180 Q&As
Questions 8

Which of the following statements about Spark's configuration properties is incorrect?

A. The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

B. The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

C. The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

D. The default number of partitions to use when shuffling data for joins or aggregations is 300.

E. The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Browse 180 Q&As
Questions 9

Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?

A. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

B. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-ddHH:mm:ss"))

C. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))

D. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-ddHH:mm:ss"))

E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

Browse 180 Q&As
Questions 10

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

A. DataFrame.repartition(12)

B. DataFrame.coalesce(6).shuffle()

C. DataFrame.coalesce(6)

D. DataFrame.coalesce(6, shuffle=True)

E. DataFrame.repartition(6)

Browse 180 Q&As
Questions 11

The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.

Find the error.

Code block:

A. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)

B. Spark will only broadcast DataFrames that are much smaller than the default value.

C. The correct option to write configurations is through spark.config and not spark.conf.

D. Spark will only apply the limit to threshold joins and not to other joins.

E. The passed limit has the wrong variable type.

F. The command is evaluated lazily and needs to be followed by an action.

Browse 180 Q&As
Questions 12

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

A. The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

B. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

C. Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

D. Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

E. Operator coalesce needs to be replaced by repartition.

Browse 180 Q&As
Questions 13

The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient

executor memory is available, in a fault-tolerant way. Find the error.

Code block:

transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)

A. Caching is not supported in Spark, data are always recomputed.

B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.

C. The storage level is inappropriate for fault-tolerant storage.

D. The code block uses the wrong operator for caching.

E. The DataFrameWriter needs to be invoked.

Browse 180 Q&As
Questions 14

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf.

Find the error.

Code block:

1.def add_2_if_geq_3(x):

2.

if x is None:

3.

return x

4.

elif x >= 3:

5.

return x+2

6.

return x

7.

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

9.

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

A. The operator used to adding the column does not add column predErrorAdded to the DataFrame.

B. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

C. The udf() method does not declare a return type.

D. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

E. The Python function is unable to handle null values, resulting in the code block crashing on execution.

Browse 180 Q&As
Questions 15

Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?

A. transactionsDf.withColumnRenamed("productId", "productNumber")

B. transactionsDf.withColumn("productId", "productNumber")

C. transactionsDf.withColumnRenamed("productNumber", "productId")

D. transactionsDf.withColumnRenamed(col(productId), col(productNumber))

E. transactionsDf.withColumnRenamed(productId, productNumber)

Browse 180 Q&As
Questions 16

Which of the following code blocks prints out in how many rows the expression Inc. appears in the stringtype column supplier of DataFrame itemsDf?

A. 1.counter = 0

2.

3.for index, row in itemsDf.iterrows():

4.

if 'Inc.' in row['supplier']:

5.

counter = counter + 1

6.

7.print(counter)

B. 1.counter = 0

2.

3.def count(x):

4.

if 'Inc.' in x['supplier']:

5.

counter = counter + 1

6.

7.itemsDf.foreach(count)

8.print(counter)

C. print(itemsDf.foreach(lambda x: 'Inc.' in x))

D. print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

E. 1.accum=sc.accumulator(0)

2.

3.def check_if_inc_in_supplier(row):

4.

if 'Inc.' in row['supplier']:

5.

accum.add(1)

6.

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Browse 180 Q&As
Questions 17

Which of the following is the idea behind dynamic partition pruning in Spark?

A. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

B. Dynamic partition pruning concatenates columns of similar data types to optimize join performance.

C. Dynamic partition pruning performs wide transformations on disk instead of in memory.

D. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.

E. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

Browse 180 Q&As
Questions 18

In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame transactionsDf and itemsDf on columns productId and itemId, respectively?

1.

.filter(~isnull(col('value')))

2.

.count()

3.

transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))

4.

transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')

5.

.filter(col('value').isnotnull())

6.

.sum(col('value'))

A. 4, 1, 2

B. 3, 1, 6

C. 3, 1, 2

D. 3, 5, 2

E. 4, 6

Browse 180 Q&As
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0
Last Update: Mar 14, 2025
Questions: 180 Q&As

PDF

$49.99

VCE

$55.99

PDF + VCE

$65.99