Certbus > Databricks > Databricks Certifications > DATABRICKS-MACHINE-LEARNING-ASSOCIATE > DATABRICKS-MACHINE-LEARNING-ASSOCIATE Online Practice Questions and Answers

DATABRICKS-MACHINE-LEARNING-ASSOCIATE Online Practice Questions and Answers

Questions 4

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

A. They need to specify the method parameter to the OneHotEncoder.

B. They need to remove the line with the fit operation.

C. They need to use Stringlndexer prior to one-hot encodinq the features.

D. They need to useVectorAssemblerprior to one-hot encoding the features.

Browse 74 Q&As
Questions 5

Which of the following statements describes a Spark ML estimator?

A. An estimator is a hyperparameter arid that can be used to train a model

B. An estimator chains multiple alqorithms toqether to specify an ML workflow

C. An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D. An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E. An estimator is an evaluation tool to assess to the quality of a model

Browse 74 Q&As
Questions 6

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

1.

10.0

2.

12.0

3.

17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

A. 13.0

B. 17.0

C. 12.0

D. 39.0

E. 10.0

Browse 74 Q&As
Questions 7

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

A. 3

B. 5

C. 6

D. 18

Browse 74 Q&As
Questions 8

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column's median value.

They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

A. It does not impute both the training and test data sets.

B. The inputCols and outputCols need to be exactly the same.

C. The fit method needs to be called instead of transform.

D. It does not fit the imputer on the data to create an ImputerModel.

Browse 74 Q&As
Questions 9

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

A. spark_df[spark_df["price"] > 0]

B. spark_df.filter(col("price") > 0)

C. SELECT * FROM spark_df WHERE price > 0

D. spark_df.loc[spark_df["price"] > 0,:]

E. spark_df.loc[:,spark_df["price"] > 0]

Browse 74 Q&As
Questions 10

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

A. spark_df.describe()

B. dbutils.data(spark_df).summarize()

C. This task cannot be accomplished in a single line of code.

D. spark_df.summary()

E. dbutils.data.summarize (spark_df)

Browse 74 Q&As
Questions 11

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

B. One-hot encoding is dependent on the target variable's values which differ for each apaplication.

C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.

Browse 74 Q&As
Questions 12

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

A. Logistic regression

B. Singular value decomposition

C. Iterative optimization

D. Least-squares method

Browse 74 Q&As
Questions 13

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

A. When the new solution requires if-else logic determining which model to use to compute each prediction

B. When the new solution's models have an average latency that is larger than the size of the original model

C. When the new solution requires the use of fewer feature variables than the original model

D. When the new solution requires that each model computes a prediction for every record

E. When the new solution's models have an average size that is larger than the size of the original model

Browse 74 Q&As
Questions 14

Which of the following machine learning algorithms typically uses bagging?

A. Gradient boosted trees B. K-means

C. Random forest

D. Linear regression

E. Decision tree

Browse 74 Q&As
Questions 15

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A. import pyspark.pandas as ps df = ps.DataFrame(spark_df)

B. import pyspark.pandas as ps df = ps.to_pandas(spark_df)

C. spark_df.to_pandas()

D. import pandas as pd df = pd.DataFrame(spark_df)

Browse 74 Q&As
Questions 16

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A. Option A

B. Option B

C. Option C

D. Option D

Browse 74 Q&As
Questions 17

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.

batch_dfhas the following schema:

customer_id STRING

The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:

In which situation will the machine learning engineer's code block perform the desired inference?

A. When the Feature Store feature set was logged with the model at model_uri

B. When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark

C. When the model at model_uri only uses customer_id as a feature

D. This code block will not perform the desired inference in any situation.

E. When all of the features used by the model at model_uri are in a single Feature Store table

Browse 74 Q&As
Questions 18

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Which of the following suggestions should the team include in their guidelines?

A. The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

B. The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

C. The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

D. The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Browse 74 Q&As
Exam Name: Databricks Certified Machine Learning Associate
Last Update: Mar 16, 2025
Questions: 74 Q&As

PDF

$49.99

VCE

$55.99

PDF + VCE

$65.99