機器學習超參數調優總結(PySpark ML)-人工智慧-PHP中文網

ML中的一個重要任務是模型選擇，或使用資料為給定任務找到最佳的模型或參數。這也稱為調優。可以對單一估計器(如LogisticRegression)進行調優，也可以對包含多種演算法、特性化和其他步驟的整個pipeline進行的調優。使用者可以一次調優整個Pipeline，而不是分別調優 Pipeline 中的每個元素。

ML中的一個重要任務是模型選擇，或使用資料為給定任務找到最佳的模型或參數。這也稱為調優。可以對單一的Estimator(如LogisticRegression)進行調優，也可以對包含多種演算法、特性化和其他步驟的整個pipeline進行調優。使用者可以一次調優整個Pipeline，而不是分別調優Pipeline中的每個元素。

MLlib支援使用CrossValidator和TrainValidationSplit等工具進行模型選擇。這些工具需要具備以下條件:

估計器：要調優的演算法或管道pipeline
一組參數：可選擇的參數，有時稱為搜尋的「參數網格」
評估者：度量擬合模型在測試資料上的表現

這些模型選擇工具的工作方式如下：

對於每個（訓練、測試）對，它們遍歷ParamMap 集合：

對於每個ParamMap，使用這些參數擬合Estimator，得到擬合的Model ，並使用Evaluator 評估Model的性能。

為了幫助建構參數網格，使用者可以使用ParamGridBuilder。預設情況下，參數網格中的參數集以串列方式計算。在使用CrossValidator或TrainValidationSplit運行模型選擇之前，可以透過將並行度設為2或更多(1的值將是串列的)來並行地進行參數評估。並行度的值應該謹慎選擇，以便在不超過叢集資源的情況下最大化並行度，較大的值不一定會提高效能。一般來說，10以上的值對大多數群集來說應該足夠了。

交叉驗證

CrossValidator交叉驗證器首先將資料集分割為一組折疊資料集，這些折疊資料集用作單獨的訓練資料集和測試資料集。例如，當k=3次時，CrossValidator將產生3對(訓練，測試)資料集，每對資料集使用2/3的資料進行訓練，1/3的資料進行測試。為了評估一個特定的ParamMap, CrossValidator透過在3個不同的(訓練，測試)資料集對上擬合Estimator產生的3個模型計算平均評估量測。

在確定最佳ParamMap之後，CrossValidator最終使用最佳ParamMap和整個資料集重新匹配Estimator。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# 准备训练文件，并做好标签。
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# 配置一个ML管道，它由树stages组成:tokenizer、hashingTF和lr。
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# 我们现在将Pipeline作为一个Estimator，将其包装在CrossValidator实例中。
# 这将允许我们共同选择所有管道阶段的参数。
# 交叉验证器需要一个Estimator、一组Estimator ParamMaps和一个Evaluator。
# 我们使用ParamGridBuilder来构造一个用于搜索的参数网格。
# hashingTF.numFeatures 的3个值, lr.regParam的2个值，
# 这个网格将有3 x 2 = 6的参数设置供CrossValidator选择。

 
paramGrid = ParamGridBuilder() 
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) 
.addGrid(lr.regParam, [0.1, 0.01]) 
.build()

crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)# 使用3+ folds

# 运行交叉验证，并选择最佳参数集。
cvModel = crossval.fit(training)

# 准备测试未标注的文件
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "mapreduce spark"),
(7, "apache hadoop")
], ["id", "text"])

# 对测试文档进行预测, cvModel使用发现的最佳模型(lrModel)。
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
print(row)

登入後複製

訓練驗證分割

除了 CrossValidator 之外，Spark 還提供了超參數調優的 TrainValidationSplit。 TrainValidationSplit 只計算每個參數組合一次，而在 CrossValidator 的情況下是k次。因此，它的成本較低，但當訓練資料集不夠大時，它不會產生可靠的結果。

與 CrossValidator 不同，TrainValidationSplit 會建立單一(訓練、測試)資料集對。它使用 trainRatio 參數將資料集分成這兩部分。例如，當trainRatio=0.75 時，TrainValidationSplit 將產生一個訓練和測試資料集對，其中 75% 的資料用於訓練，25% 用於驗證。

就像 CrossValidator 一樣，TrainValidationSplit 最終使用最佳 ParamMap 和整個資料集來匹配 Estimator。

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and test data.
data = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)

lr = LinearRegression(maxIter=10)

# 我们使用ParamGridBuilder来构造一个用于搜索的参数网格。
# TrainValidationSplit将尝试所有值的组合，并使用评估器确定最佳模型。
paramGrid = ParamGridBuilder()
.addGrid(lr.regParam, [0.1, 0.01]) 
.addGrid(lr.fitIntercept, [False, True])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
.build()

# 在这种情况下，估计器是简单的线性回归。
# TrainValidationSplit需要一个Estimator、一组Estimator ParamMaps 和一个 Evaluator。
tvs = TrainValidationSplit(estimator=lr,
 estimatorParamMaps=paramGrid,
 evaluator=RegressionEvaluator(),
 # 80%的数据将用于培训，20%用于验证。
 trainRatio=0.8)

# 运行TrainValidationSplit，并选择最佳参数集。
model = tvs.fit(train)

# 对测试数据进行预测。模型是参数组合后性能最好的模型。
model.transform(test)
.select("features", "label", "prediction")
.show()

登入後複製

以上是機器學習超參數調優總結(PySpark ML)的詳細內容。更多資訊請關注PHP中文網其他相關文章！