自动化机器学习 (AutoML) 搜索#
背景#
机器学习#
机器学习 (ML) 是根据从系统中收集的样本数据集构建系统数学模型的过程。
训练 ML 模型的主要目标之一是教会模型将数据中存在的信号与系统和数据收集过程中固有的噪声分离。如果做得好,该模型在遇到新的、类似的数据时就可以用来对系统做出准确的预测。此外,对 ML 模型进行内省可以揭示有关被建模系统的关键信息,例如哪些输入以及输入的哪些转换对于 ML 模型学习数据中的信号最有用,从而最具预测性。
ML 问题类型多种多样。监督学习描述了收集到的数据包含一个要建模的输出值和一组用于训练模型的输入的情况。EvalML 专注于训练监督学习模型。
EvalML 支持三种常见的监督 ML 问题类型。第一种是回归,其中要建模的目标值是连续的数值。接下来是二分类和多分类,其中要建模的目标值由两个或多个离散值或类别组成。选择哪种监督 ML 问题类型最合适取决于领域专业知识以及模型将如何被评估和使用。
EvalML 目前正在构建对监督时间序列问题的支持:时间序列回归、时间序列二分类和时间序列多分类。虽然我们已经添加了一些特性来处理这些类型的问题,但我们的功能仍在积极开发中,因此在使用之前请注意这一点。
AutoML 和搜索#
AutoML 是自动化构建、训练和评估 ML 模型的过程。给定数据和一些配置,AutoML 会搜索最有效、最准确的 ML 模型或模型来拟合数据集。在搜索过程中,AutoML 将探索模型类型、模型参数和模型架构的不同组合。
有效的 AutoML 解决方案比手动构建和调整 ML 模型具有多种优势。AutoML 可以协助解决 ML 中的许多困难方面,例如避免过拟合和欠拟合、不平衡数据、检测数据泄露以及问题设置中的其他潜在问题,并自动应用最佳实践数据清洗、特征工程、特征选择和各种建模技术。AutoML 还可以利用搜索算法来最佳地遍历超参数搜索空间,从而获得手动训练难以达到的模型性能。
EvalML 中的 AutoML#
EvalML 支持上述所有功能及更多功能。
在最简单的用法中,AutoML 搜索界面只需要输入数据、目标数据以及指定要建模的监督 ML 问题类型的 problem_type
。
** Jupyter Notebook 和 Jupyter Lab 上的图形方法,例如 verbose AutoMLSearch,需要安装 ipywidgets。
** 如果在 Jupyter Lab 上绘图,需要 jupyterlab-plotly。要下载此工具,请确保已安装 npm。
[1]:
import evalml
from evalml.utils import infer_feature_types
X, y = evalml.demos.load_fraud(n_rows=650)
Number of Features
Boolean 1
Categorical 6
Numeric 5
Number of training examples: 650
Targets
False 86.31%
True 13.69%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
要向 EvalML 提供数据,建议您在数据上初始化 Woodwork accessor。这使您可以在训练模型之前轻松控制 EvalML 如何处理您的每个特征。
EvalML 也接受 pandas
输入,并会在输入的 pandas
数据之上运行类型推断。如果您想更改 EvalML 推断的类型,可以使用 infer_feature_types
工具方法,该方法接受 pandas 或 numpy 输入并将其转换为 Woodwork 数据结构。feature_types
参数可用于指定特定列应为哪种类型。
诸如 Natural Language
之类的特征类型必须通过这种方式指定,否则 Woodwork 会将其推断为 Unknown
类型并在 AutoMLSearch 期间将其丢弃。
在下面的示例中,我们重新格式化了几个特征,以便模型轻松使用它们,然后指定 provider(否则会被推断为包含自然语言的列)是一个类别列。
[2]:
X.ww["expiration_date"] = X["expiration_date"].apply(
lambda x: "20{}-01-{}".format(x.split("/")[1], x.split("/")[0])
)
X = infer_feature_types(
X,
feature_types={
"store_id": "categorical",
"expiration_date": "datetime",
"lat": "categorical",
"lng": "categorical",
"provider": "categorical",
},
)
为了验证管道创建和优化过程的结果,我们将部分数据保存为保留集(holdout set)。
[3]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
X, y, problem_type="binary", test_size=0.2
)
数据检查#
在调用 AutoMLSearch.search
之前,我们应该对数据进行一些健全性检查,以确保传递的输入数据在运行耗时的搜索之前不会遇到一些常见问题。EvalML 提供了各种数据检查功能,使此过程变得轻松。如果检测到输入数据存在潜在问题,每个数据检查都会返回警告和错误集合。这允许用户检查他们的数据,以避免在搜索过程中可能出现的令人困惑的错误。您可以通过我们的数据检查指南了解每个可用数据检查的详细信息。
在此,我们将运行 DefaultDataChecks
类,它包含一系列通常有用的数据检查。
[4]:
from evalml.data_checks import DefaultDataChecks
data_checks = DefaultDataChecks("binary", "log loss binary")
data_checks.validate(X_train, y_train)
[4]:
[]
由于没有返回警告或错误,我们可以安全地继续搜索过程。
用于管道排名的保留集#
如果 holdout_set_size
参数已设置且输入数据集包含超过 500 行,AutoMLSearch 将从训练数据的 holdout_set_size
大小中创建一个保留集。或者,可以通过在 AutoMLSearch()
中使用 X_holdout
和 y_holdout
参数手动指定保留集。在此示例中,之前创建的保留集将由 AutoML 搜索使用。
在 AutoML 搜索过程中,会计算所有交叉验证折叠的目标分数平均值(在管道排名中显示为“mean_cv_score”列)。这个分数会传递给 AutoML 搜索调优器,以进一步优化下一批管道的超参数。
之后,管道将在整个训练数据集上拟合,并在新的保留集上评分。这个分数在管道排名板的“ranking_score”列下表示,并用于衡量管道性能。
如果数据集少于 500 行或 holdout_set_size=0
(这是默认设置),则将使用“mean_cv_score”作为 ranking_score。
[5]:
automl = evalml.automl.AutoMLSearch(
X_train=X_train,
y_train=y_train,
X_holdout=X_holdout,
y_holdout=y_holdout,
problem_type="binary",
verbose=True,
)
automl.search(interactive_plot=False)
AutoMLSearch will use the holdout set to score and rank pipelines.
Removing columns ['currency'] because they are of 'Unknown' type
Using default limit of max_batches=2.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 2 batches for a total of None pipelines.
Allowed model families:
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 4.921
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 4.991
*****************************
* Evaluating Batch Number 1 *
*****************************
Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.253
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.212
*****************************
* Evaluating Batch Number 2 *
*****************************
LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.300
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.161
Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.355
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.345
Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.375
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.401
XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.260
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.165
Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.374
Starting holdout set scoring
Finished holdout set scoring - Log Loss Binary: 0.403
Search finished after 29.44 seconds
Best pipeline: LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler
Best pipeline Log Loss Binary: 0.160955
[5]:
{1: {'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': 5.261218070983887,
'Total time of batch': 5.392166376113892},
2: {'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 3.3102989196777344,
'Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 5.022824048995972,
'Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 4.690482139587402,
'XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 3.5465054512023926,
'Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 5.883444547653198,
'Total time of batch': 23.247490644454956}}
将 verbose
参数设置为 True 时,AutoML 搜索将记录其进度,报告搜索期间评估的每个管道和参数集。AutoML 搜索期间显示的搜索迭代图跟踪当前管道的验证分数(以灰色点表示)与最佳管道验证分数(以蓝色线表示)。
有多种机制可以控制 AutoML 搜索时间。一种方法是设置 max_batches
参数,它控制要评估的 AutoML 最大轮数,每轮可能训练和评分可变数量的管道。另一种方法是设置 max_iterations
参数,它控制在 AutoML 期间要评估的候选模型的最大数量。默认情况下,AutoML 将搜索一个批次。要评估的第一个管道始终是代表平凡解决方案的基线模型。
AutoML 接口支持各种其他参数。有关完整列表,请参阅 API 参考。
我们还提供一个独立的搜索方法,该方法只需一行代码即可完成上述所有操作,并返回 AutoMLSearch
实例和数据检查结果。如果存在数据检查错误,将不会运行 AutoML,也不会返回 AutoMLSearch
实例。
检测问题类型#
EvalML 包含一个简单的方法 detect_problem_type
,可帮助根据目标数据确定问题类型。
此函数可以将预测的问题类型作为 ProblemType 枚举返回,可选择 ProblemType.BINARY、ProblemType.MULTICLASS 和 ProblemType.REGRESSION。如果目标数据无效(例如,当只有一个唯一标签时),该函数将抛出错误。
[6]:
import pandas as pd
from evalml.problem_types import detect_problem_type
y_binary = pd.Series([0, 1, 1, 0, 1, 1])
detect_problem_type(y_binary)
[6]:
<ProblemTypes.BINARY: 'binary'>
objective 参数#
AutoMLSearch 接受一个 objective
参数来确定要优化的目标。默认情况下,此参数设置为 auto
,允许 AutoML 为二分类问题选择 LogLossBinary
,为多分类问题选择 LogLossMulticlass
,为回归问题选择 R2
。
应该注意的是,objective
参数仅用于排名和帮助选择要迭代的管道,但不用于在拟合期间优化每个单独的管道。
要获取每种问题类型的默认目标,可以使用 get_default_primary_search_objective
函数。
[7]:
from evalml.automl import get_default_primary_search_objective
binary_objective = get_default_primary_search_objective("binary")
multiclass_objective = get_default_primary_search_objective("multiclass")
regression_objective = get_default_primary_search_objective("regression")
print(binary_objective.name)
print(multiclass_objective.name)
print(regression_objective.name)
Log Loss Binary
Log Loss Multiclass
R2
使用自定义管道#
EvalML 的 AutoML 算法生成一组用于搜索的管道。要提供自定义集合,请将 allowed_component_graphs 设置为自定义组件图的字典。AutoMLSearch
将使用这些来生成 Pipeline
实例。注意:这将阻止 AutoML 生成其他管道进行搜索。
[8]:
from evalml.pipelines import MulticlassClassificationPipeline
automl_custom = evalml.automl.AutoMLSearch(
X_train=X_train,
y_train=y_train,
problem_type="multiclass",
verbose=True,
allowed_component_graphs={
"My_pipeline": ["Simple Imputer", "Random Forest Classifier"],
"My_other_pipeline": ["One Hot Encoder", "Random Forest Classifier"],
},
)
AutoMLSearch will use mean CV score to rank pipelines.
Removing columns ['currency'] because they are of 'Unknown' type
Using default limit of max_batches=2.
回调函数#
AutoMLSearch
支持多种回调函数,可以在初始化 AutoMLSearch
对象时将其指定为参数。它们是
start_iteration_callback
add_result_callback
error_callback
开始迭代回调#
用户可以设置 start_iteration_callback
来指定在每次管道训练迭代之前调用的函数。此回调函数必须接受三个位置参数:管道类、管道参数和 AutoMLSearch
对象。
[9]:
## start_iteration_callback example function
def start_iteration_callback_example(pipeline_class, pipeline_params, automl_obj):
print("Training pipeline with the following parameters:", pipeline_params)
添加结果回调#
用户可以设置 add_result_callback
来指定在每次管道训练迭代之后调用的函数。此回调函数必须接受三个位置参数:包含新管道训练结果的字典、包含训练期间使用的参数的 untrained_pipeline 以及 AutoMLSearch
对象。
[10]:
## add_result_callback example function
def add_result_callback_example(pipeline_results_dict, untrained_pipeline, automl_obj):
print(
"Results for trained pipeline with the following parameters:",
pipeline_results_dict,
)
错误回调#
用户可以设置 error_callback
来指定当 search()
发生错误并引发 Exception
时调用的函数。此回调函数接受三个位置参数:引发的 Exception raised
、回溯(traceback)和 AutoMLSearch object
。此回调函数还必须接受 kwargs
,以便 AutoMLSearch
能够传递默认使用的其他参数。
Evalml 定义了几个错误回调函数,可在 evalml.automl.callbacks
下找到。它们是
silent_error_callback
raise_error_callback
log_and_save_error_callback
raise_and_save_error_callback
log_error_callback
(当error_callback
为 None 时使用的默认值)
[11]:
# error_callback example; this is implemented in the evalml library
def raise_error_callback(exception, traceback, automl, **kwargs):
"""Raises the exception thrown by the AutoMLSearch object. Also logs the exception as an error."""
logger.error(f"AutoMLSearch raised a fatal exception: {str(exception)}")
logger.error("\n".join(traceback))
raise exception
查看排名#
所有构建的管道的摘要可以作为 pandas DataFrame 返回,并按验证分数排序。
对于使用保留集完成的 AutoML 搜索,验证分数是使用整个训练数据集拟合的管道的保留集分数。
对于没有保留集完成的 AutoML 搜索,验证分数是所有交叉验证折叠的平均分数。
[12]:
automl.rankings
[12]:
id | 管道名称 | 搜索顺序 | 排名分数 | 保留集分数 | 平均交叉验证分数 | 交叉验证分数标准差 | 比基线好百分比 | 高方差交叉验证 | 参数 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | LightGBM 分类器 w/ Label Encoder + Select ... | 2 | 0.160955 | 0.160955 | 0.299971 | 0.206176 | 93.904575 | False | {'Label Encoder': {'positive_label': None}, 'N... |
1 | 5 | XGBoost 分类器 w/ Label Encoder + Select C... | 5 | 0.165049 | 0.165049 | 0.260214 | 0.148578 | 94.712440 | False | {'Label Encoder': {'positive_label': None}, 'N... |
2 | 1 | 随机森林分类器 w/ Label Encoder + Dr... | 1 | 0.212356 | 0.212356 | 0.253361 | 0.045592 | 94.851683 | False | {'Label Encoder': {'positive_label': None}, 'D... |
3 | 3 | Extra Trees 分类器 w/ Label Encoder + Sele... | 3 | 0.344941 | 0.344941 | 0.354802 | 0.028195 | 92.790401 | False | {'Label Encoder': {'positive_label': None}, 'N... |
4 | 4 | Elastic Net 分类器 w/ Label Encoder + Sele... | 4 | 0.401431 | 0.401431 | 0.374964 | 0.045768 | 92.380720 | False | {'Label Encoder': {'positive_label': None}, 'N... |
5 | 6 | Logistic Regression 分类器 w/ Label Encode... | 6 | 0.403180 | 0.403180 | 0.374419 | 0.045999 | 92.391778 | False | {'Label Encoder': {'positive_label': None}, 'N... |
6 | 0 | 模式基线二分类管道 | 0 | 4.990660 | 4.990660 | 4.921248 | 0.112910 | 0.000000 | False | {'Label Encoder': {'positive_label': None}, 'B... |
推荐分数#
如果您想对模型的性能进行更可靠的评估,EvalML 除了选定的目标外,还提供一个推荐分数。推荐分数是根据您的问问题类型选择的多个默认目标的加权平均值,经过归一化和缩放,使得最终分数可以解释为 0 到 100 之间的百分比。这个加权分数提供了对模型性能更全面的理解,并优先考虑模型的泛化能力,而不是可能无法完全满足您用例的单一目标。
[13]:
automl.get_recommendation_scores(use_pipeline_names=True)
[13]:
{'Baseline Classifier': 25.0,
'Random Forest Classifier': 92.70765199012199,
'LightGBM Classifier': 91.29441485901573,
'Extra Trees Classifier': 75.89652715271133,
'Elastic Net Classifier': 64.71178271206128,
'XGBoost Classifier': 90.70346746264624,
'Logistic Regression Classifier': 64.65342052603951}
[14]:
automl.get_recommendation_scores(priority="F1", use_pipeline_names=True)
[14]:
{'Baseline Classifier': 16.666666666666664,
'Random Forest Classifier': 92.10813162977828,
'LightGBM Classifier': 90.02960990601049,
'Extra Trees Classifier': 68.7795029502924,
'Elastic Net Classifier': 53.58295554298061,
'XGBoost Classifier': 89.63564497509749,
'Logistic Regression Classifier': 53.5440474189661}
要查看推荐分数中包含哪些目标,可以使用
[15]:
evalml.objectives.get_default_recommendation_objectives("binary")
[15]:
{'AUC', 'Balanced Accuracy Binary', 'F1', 'Log Loss Binary'}
如果您想根据此推荐分数自动对管道进行排名,可以在初始化 AutoMLSearch 时设置 use_recommendation=True
。
[16]:
automl_recommendation = evalml.automl.AutoMLSearch(
X_train=X_train,
y_train=y_train,
X_holdout=X_holdout,
y_holdout=y_holdout,
problem_type="binary",
use_recommendation=True,
)
automl_recommendation.search(interactive_plot=False)
automl_recommendation.rankings[
[
"id",
"pipeline_name",
"search_order",
"recommendation_score",
"holdout_score",
"mean_cv_score",
]
]
[16]:
id | 管道名称 | 搜索顺序 | recommendation_score | 保留集分数 | 平均交叉验证分数 | |
---|---|---|---|---|---|---|
0 | 1 | 随机森林分类器 w/ Label Encoder + Dr... | 1 | 92.707652 | 0.212356 | 0.253361 |
1 | 2 | LightGBM 分类器 w/ Label Encoder + Select ... | 2 | 91.294415 | 0.160955 | 0.299971 |
2 | 5 | XGBoost 分类器 w/ Label Encoder + Select C... | 5 | 90.703467 | 0.165049 | 0.260214 |
3 | 3 | Extra Trees 分类器 w/ Label Encoder + Sele... | 3 | 75.896527 | 0.344941 | 0.354802 |
4 | 4 | Elastic Net 分类器 w/ Label Encoder + Sele... | 4 | 64.711783 | 0.401431 | 0.374964 |
5 | 6 | Logistic Regression 分类器 w/ Label Encode... | 6 | 64.653421 | 0.403180 | 0.374419 |
6 | 0 | 模式基线二分类管道 | 0 | 25.000000 | 4.990660 | 4.921248 |
AutoMLSearch 对象上有一个辅助函数,可以帮助您理解推荐分数是如何计算的。它显示了分数计算中包含的原始目标分数。在这里,我们查看 id=9 的管道,即决策树管道。
[17]:
automl_recommendation.get_recommendation_score_breakdown(3)
[17]:
{'F1': 0.5454545454545454,
'AUC': 0.8363095238095238,
'Log Loss Binary': 0.34494076832526754,
'Balanced Accuracy Binary': 0.7232142857142857}
描述管道#
每个管道都有一个 id
。我们可以使用该 id
获取有关任何特定管道的更多信息。在这里,我们将获取 id = 1
的管道的更多信息。
[18]:
automl.describe_pipeline(1)
**************************************************************************************************************************************************************************
* Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model *
**************************************************************************************************************************************************************************
Problem Type: binary
Model Family: Random Forest
Pipeline Steps
==============
1. Label Encoder
* positive_label : None
2. Drop Columns Transformer
* columns : ['currency']
3. DateTime Featurizer
* features_to_extract : ['year', 'month', 'day_of_week', 'hour']
* encode_as_categories : False
* time_index : None
4. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* boolean_impute_strategy : most_frequent
* categorical_fill_value : None
* numeric_fill_value : None
* boolean_fill_value : None
5. One Hot Encoder
* top_n : 10
* features_to_encode : None
* categories : None
* drop : if_binary
* handle_unknown : ignore
* handle_missing : error
6. Oversampler
* sampling_ratio : 0.25
* k_neighbors_default : 5
* n_jobs : -1
* sampling_ratio_dict : None
* categorical_features : [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
* k_neighbors : 5
7. RF Classifier Select From Model
* number_features : None
* n_estimators : 10
* max_depth : None
* percent_features : 0.5
* threshold : median
* n_jobs : -1
8. Random Forest Classifier
* n_estimators : 100
* max_depth : 6
* n_jobs : -1
Training
========
Training for binary problems.
Total training time (including CV): 5.2 seconds
Cross Validation
----------------
Log Loss Binary MCC Binary Gini AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Validation
0 0.241 0.768 0.842 0.921 0.895 0.791 0.848 0.948 346 174
1 0.304 0.524 0.535 0.768 1.000 0.467 0.652 0.908 347 173
2 0.215 0.875 0.849 0.924 1.000 0.884 0.896 0.971 347 173
mean 0.253 0.723 0.742 0.871 0.965 0.714 0.799 0.942 - -
std 0.046 0.180 0.179 0.090 0.061 0.219 0.129 0.032 - -
coef of var 0.180 0.249 0.242 0.103 0.063 0.307 0.162 0.034 - -
获取管道#
我们也可以通过它们的 id
获取任何管道的对象
[19]:
pipeline = automl.get_pipeline(1)
print(pipeline.name)
print(pipeline.parameters)
Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model
{'Label Encoder': {'positive_label': None}, 'Drop Columns Transformer': {'columns': ['currency']}, 'DateTime Featurizer': {'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder': {'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler': {'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'k_neighbors': 5}, 'RF Classifier Select From Model': {'number_features': None, 'n_estimators': 10, 'max_depth': None, 'percent_features': 0.5, 'threshold': 'median', 'n_jobs': -1}, 'Random Forest Classifier': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}
获取最佳管道#
如果您特别想获取最佳管道,有一个方便的访问器。返回的管道已经在我们传递给 AutoMLSearch 的输入 X, y 数据上拟合过了。要关闭此默认行为,请在初始化 AutoMLSearch 时设置 train_best_pipeline=False
。
[20]:
best_pipeline = automl.best_pipeline
print(best_pipeline.name)
print(best_pipeline.parameters)
best_pipeline.predict(X_train)
LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler
{'Label Encoder': {'positive_label': None}, 'Numeric Pipeline - Select Columns By Type Transformer': {'column_types': ['category', 'EmailAddress', 'URL'], 'exclude': True}, 'Numeric Pipeline - Label Encoder': {'positive_label': None}, 'Numeric Pipeline - Drop Columns Transformer': {'columns': ['currency']}, 'Numeric Pipeline - DateTime Featurizer': {'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Numeric Pipeline - Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Numeric Pipeline - Select Columns Transformer': {'columns': ['card_id', 'store_id', 'amount', 'customer_present', 'lat', 'lng', 'datetime_month', 'datetime_day_of_week', 'datetime_hour']}, 'Categorical Pipeline - Select Columns Transformer': {'columns': ['expiration_date', 'provider', 'region', 'country']}, 'Categorical Pipeline - Label Encoder': {'positive_label': None}, 'Categorical Pipeline - Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Categorical Pipeline - One Hot Encoder': {'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler': {'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48], 'k_neighbors': 5}, 'LightGBM Classifier': {'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9, 'verbose': -1}}
[20]:
id
144 False
253 True
221 False
432 False
384 False
...
128 False
98 False
472 False
642 False
494 False
Name: fraud, Length: 520, dtype: bool
使用 AutoMLSearch 训练和评分多个管道#
AutoMLSearch 将自动在整个训练数据上拟合最佳管道。它还提供了一个简单的 API 用于训练和评分其他管道。
如果您想在整个训练数据上训练一个或多个管道,可以使用 train_pipelines
方法。
同样,如果您想在特定数据集上为一个或多个管道评分,可以使用 score_pipelines
方法。
[21]:
trained_pipelines = automl.train_pipelines([automl.get_pipeline(i) for i in [0, 1, 2]])
trained_pipelines
[21]:
{'Mode Baseline Binary Classification Pipeline': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Baseline Classifier': ['Baseline Classifier', 'Label Encoder.x', 'Label Encoder.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Baseline Classifier':{'strategy': 'mode'}}, custom_name='Mode Baseline Binary Classification Pipeline', random_seed=0),
'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Drop Columns Transformer': ['Drop Columns Transformer', 'X', 'Label Encoder.y'], 'DateTime Featurizer': ['DateTime Featurizer', 'Drop Columns Transformer.x', 'Label Encoder.y'], 'Imputer': ['Imputer', 'DateTime Featurizer.x', 'Label Encoder.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Label Encoder.y'], 'Oversampler': ['Oversampler', 'One Hot Encoder.x', 'Label Encoder.y'], 'RF Classifier Select From Model': ['RF Classifier Select From Model', 'Oversampler.x', 'Oversampler.y'], 'Random Forest Classifier': ['Random Forest Classifier', 'RF Classifier Select From Model.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Drop Columns Transformer':{'columns': ['currency']}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'k_neighbors': 5}, 'RF Classifier Select From Model':{'number_features': None, 'n_estimators': 10, 'max_depth': None, 'percent_features': 0.5, 'threshold': 'median', 'n_jobs': -1}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0),
'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Numeric Pipeline - Select Columns By Type Transformer': ['Select Columns By Type Transformer', 'X', 'Label Encoder.y'], 'Numeric Pipeline - Label Encoder': ['Label Encoder', 'Numeric Pipeline - Select Columns By Type Transformer.x', 'Label Encoder.y'], 'Numeric Pipeline - Drop Columns Transformer': ['Drop Columns Transformer', 'Numeric Pipeline - Select Columns By Type Transformer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - DateTime Featurizer': ['DateTime Featurizer', 'Numeric Pipeline - Drop Columns Transformer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - Imputer': ['Imputer', 'Numeric Pipeline - DateTime Featurizer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - Select Columns Transformer': ['Select Columns Transformer', 'Numeric Pipeline - Imputer.x', 'Numeric Pipeline - Label Encoder.y'], 'Categorical Pipeline - Select Columns Transformer': ['Select Columns Transformer', 'X', 'Label Encoder.y'], 'Categorical Pipeline - Label Encoder': ['Label Encoder', 'Categorical Pipeline - Select Columns Transformer.x', 'Label Encoder.y'], 'Categorical Pipeline - Imputer': ['Imputer', 'Categorical Pipeline - Select Columns Transformer.x', 'Categorical Pipeline - Label Encoder.y'], 'Categorical Pipeline - One Hot Encoder': ['One Hot Encoder', 'Categorical Pipeline - Imputer.x', 'Categorical Pipeline - Label Encoder.y'], 'Oversampler': ['Oversampler', 'Numeric Pipeline - Select Columns Transformer.x', 'Categorical Pipeline - One Hot Encoder.x', 'Categorical Pipeline - Label Encoder.y'], 'LightGBM Classifier': ['LightGBM Classifier', 'Oversampler.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Numeric Pipeline - Select Columns By Type Transformer':{'column_types': ['category', 'EmailAddress', 'URL'], 'exclude': True}, 'Numeric Pipeline - Label Encoder':{'positive_label': None}, 'Numeric Pipeline - Drop Columns Transformer':{'columns': ['currency']}, 'Numeric Pipeline - DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Numeric Pipeline - Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Numeric Pipeline - Select Columns Transformer':{'columns': ['card_id', 'store_id', 'amount', 'customer_present', 'lat', 'lng', 'datetime_month', 'datetime_day_of_week', 'datetime_hour']}, 'Categorical Pipeline - Select Columns Transformer':{'columns': ['expiration_date', 'provider', 'region', 'country']}, 'Categorical Pipeline - Label Encoder':{'positive_label': None}, 'Categorical Pipeline - Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Categorical Pipeline - One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48], 'k_neighbors': 5}, 'LightGBM Classifier':{'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9, 'verbose': -1}}, random_seed=0)}
[22]:
pipeline_holdout_scores = automl.score_pipelines(
[trained_pipelines[name] for name in trained_pipelines.keys()],
X_holdout,
y_holdout,
["Accuracy Binary", "F1", "AUC"],
)
pipeline_holdout_scores
[22]:
{'Mode Baseline Binary Classification Pipeline': OrderedDict([('Accuracy Binary',
0.8615384615384616),
('F1', 0.0),
('AUC', 0.5)]),
'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': OrderedDict([('Accuracy Binary',
0.9769230769230769),
('F1', 0.9090909090909091),
('AUC', 0.9250992063492064)]),
'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': OrderedDict([('Accuracy Binary',
0.9692307692307692),
('F1', 0.875),
('AUC', 0.9201388888888888)])}
保存 AutoMLSearch 和来自 AutoMLSearch 的管道#
有两种方法可以保存 AutoMLSearch 的结果。
您可以保存 AutoMLSearch 对象本身,通过调用
.save(<filepath>)
来实现。这将允许您保存 AutoMLSearch 的状态,并从此处重新加载所有管道。如果您想从 AutoMLSearch 中保存管道以备将来使用,管道类本身有一个
.save(<filepath>)
方法。
[23]:
# saving the entire automl search
automl.save("automl.cloudpickle")
automl2 = evalml.automl.AutoMLSearch.load("automl.cloudpickle")
# saving the best pipeline using .save()
best_pipeline.save("pipeline.cloudpickle")
best_pipeline_copy = evalml.pipelines.PipelineBase.load("pipeline.cloudpickle")
限制 AutoML 搜索空间#
AutoML 搜索算法首先使用其默认值训练管道中的每个组件。第一次迭代后,它会使用这些组件预定义的超参数范围来调整这些组件的参数。要限制在某些超参数范围内的搜索,您可以在 AutoMLSearch 参数中指定 search_parameters
参数。这些参数将限制超参数搜索空间或管道参数空间。
可以通过每个组件的 API 参考找到超参数范围。参数必须指定为字典,但关联的值必须是用于设置超参数范围的 skopt.space
Real、Integer 或 Categorical 对象。
但是,如果您想为 AutoML 搜索算法的初始批次指定某些值,可以使用 search_parameters
参数并使用非 skopt.space
对象。这将把初始批次的组件参数设置为通过此参数传递的值。
[24]:
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from skopt.space import Categorical
from evalml.model_family import ModelFamily
import woodwork as ww
X, y = load_fraud(n_rows=1000)
# example of setting parameter to just one value
search_parameters = {"Imputer": {"numeric_impute_strategy": "mean"}}
# limit the numeric impute strategy to include only `median` and `most_frequent`
# `mean` is the default value for this argument, but it doesn't need to be included in the specified hyperparameter range for this to work
search_parameters = {
"Imputer": {"numeric_impute_strategy": Categorical(["median", "most_frequent"])}
}
# using this custom hyperparameter means that our Imputer components in these pipelines will only search through
# 'median' and 'most_frequent' strategies for 'numeric_impute_strategy'
automl_constrained = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
search_parameters=search_parameters,
verbose=True,
)
Number of Features
Boolean 1
Categorical 6
Numeric 5
Number of training examples: 1000
Targets
False 85.90%
True 14.10%
Name: count, dtype: object
AutoMLSearch will use mean CV score to rank pipelines.
Using default limit of max_batches=2.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
skopt.space Integer、Real 或 Categorical 将设置搜索期间探索的超参数空间。所有其他值将直接设置管道参数。直接设置管道参数定义了管道在 AutoMLSearch 的第一个批次中开始时的初始化参数。超参数范围则定义了调优器选择的可能的新参数值的空间。
让我们通过一些例子来解释这一点。例如,
search_parameters = {'Imputer': {
'numeric_impute_strategy': 'mean'
}}
那么在初始搜索中,算法将在批次 1 中使用 mean
作为填充策略。然而,由于 Imputer.numeric_impute_strategy
具有有效的超参数范围,如果算法建议使用不同的策略,它可以并且将会改变这个值。要将此限制为在整个搜索期间仅使用 mean
,需要使用 skopt.space
search_parameters = {'Imputer': {
'numeric_impute_strategy': Categorical(['mean'])
}}
但是,如果一个值没有关联的超参数范围,那么算法将使用这个值作为唯一参数。例如,
search_parameters = {'Label Encoder': {
'positive_label': True
}}
由于 Label Encoder.positive_label
没有关联的超参数范围,算法将在整个搜索期间使用此参数。
不平衡数据#
AutoML 搜索算法现在具有在分类期间处理不平衡数据的功能!AutoMLSearch 现在提供了两个额外的参数,sampler_method
和 sampler_balanced_ratio
,允许您告知 AutoMLSearch 是否以及如何对不平衡数据进行采样。sampler_method
接受 Undersampler
、Oversampler
、auto
或 None 作为要使用的采样器,sampler_balanced_ratio
指定您要采样的 minority/majority 比例。有关 Undersampler 和 Oversampler 组件的详细信息,请参阅文档。
这可用于不平衡数据集,例如欺诈数据集,其“少数类:多数类”比例 < 0.2。
[25]:
automl_auto = AutoMLSearch(
X_train=X, y_train=y, problem_type="binary", automl_algorithm="iterative"
)
automl_auto.allowed_pipelines[-1]
[25]:
pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'DateTime Featurizer': ['DateTime Featurizer', 'X', 'Label Encoder.y'], 'Imputer': ['Imputer', 'DateTime Featurizer.x', 'Label Encoder.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Label Encoder.y'], 'Oversampler': ['Oversampler', 'One Hot Encoder.x', 'Label Encoder.y'], 'Extra Trees Classifier': ['Extra Trees Classifier', 'Oversampler.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None}, 'Extra Trees Classifier':{'n_estimators': 100, 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1}}, random_seed=0)
在这里选择 Oversampler 作为默认采样组件,因为 sampler_balanced_ratio = 0.25
。如果您指定了较低的比例,例如 sampler_balanced_ratio = 0.1
,则此处将不会添加采样组件。这是因为如果 0.1 的比例被认为是平衡的,那么 0.2 的比例也将是平衡的。
Oversampler 在底层使用 SMOTE,并根据接收到的数据自动选择使用 SMOTE、SMOTEN 还是 SMOTENC。
[26]:
automl_auto_ratio = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
sampler_balanced_ratio=0.1,
automl_algorithm="iterative",
)
automl_auto_ratio.allowed_pipelines[-1]
[26]:
pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'DateTime Featurizer': ['DateTime Featurizer', 'X', 'Label Encoder.y'], 'Imputer': ['Imputer', 'DateTime Featurizer.x', 'Label Encoder.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Label Encoder.y'], 'Extra Trees Classifier': ['Extra Trees Classifier', 'One Hot Encoder.x', 'Label Encoder.y']}, parameters={'Label Encoder':{'positive_label': None}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Extra Trees Classifier':{'n_estimators': 100, 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1}}, random_seed=0)
此外,您可以通过在管道参数中传入 sampling_ratio_dict
来添加更精细的采样比例。对于此字典,AutoMLSearch 要求键是类别的从 0 到 n-1
的整数值,值是与每个目标关联的 sampler_balanced__ratio
。此字典将覆盖 AutoML 参数 sampler_balanced_ratio
。在下面,您可以看到 Oversampler 组件在此数据集上的情况。请注意,Undersamplers 的逻辑包含在注释部分。
[27]:
# In this case, the majority class is the negative class
# for the oversampler, we don't want to oversample this class, so class 0 (majority) will have a ratio of 1 to itself
# for the minority class 1, we want to oversample it to have a minority/majority ratio of 0.5, which means we want minority to have 1/2 the samples as the minority
sampler_ratio_dict = {0: 1, 1: 0.5}
search_parameters = {"Oversampler": {"sampler_balanced_ratio": sampler_ratio_dict}}
automl_auto_ratio_dict = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
search_parameters=search_parameters,
automl_algorithm="iterative",
)
automl_auto_ratio_dict.allowed_pipelines[-1]
# Undersampler case
# we don't want to undersample this class, so class 1 (minority) will have a ratio of 1 to itself
# for the majority class 0, we want to undersample it to have a minority/majority ratio of 0.5, which means we want majority to have 2x the samples as the minority
# sampler_ratio_dict = {0: 0.5, 1: 1}
# search_parameters = {"Oversampler": {"sampler_balanced_ratio": sampler_ratio_dict}}
# automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', search_parameters=search_parameters)
[27]:
pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'DateTime Featurizer': ['DateTime Featurizer', 'X', 'Label Encoder.y'], 'Imputer': ['Imputer', 'DateTime Featurizer.x', 'Label Encoder.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Label Encoder.y'], 'Oversampler': ['Oversampler', 'One Hot Encoder.x', 'Label Encoder.y'], 'Extra Trees Classifier': ['Extra Trees Classifier', 'Oversampler.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'sampler_balanced_ratio': {0: 1, 1: 0.5}}, 'Extra Trees Classifier':{'n_estimators': 100, 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1}}, random_seed=0)
将集成方法添加到 AutoML#
Stacking#
Stacking 是一种集成机器学习算法,它涉及训练一个模型以最佳地组合多个基础学习算法的预测。首先,使用给定数据训练每个基础学习算法。然后,在这些基础学习算法的预测结果上训练组合算法或元学习器,以做出最终预测。
AutoML 在初始化期间使用 ensembling
标志启用 stacking;默认情况下此标志设置为 False
。集成的运行方式由您选择的 AutoML 算法定义。在 IterativeAlgorithm
中,stacking 集成管道在完成整个训练周期(每个允许的管道训练一个批次)后,在其自己的批次中运行。请注意,这意味着在 stacking 集成运行之前可能需要运行大量迭代。同样重要的是要注意,stacking 集成仅计算第一个 CV 折叠,因为模型内部使用了 CV 折叠。请参阅下面的 AutoML 算法部分,了解 DefaultAlgorithm
如何运行集成。请注意,集成目前不适用于时间序列问题。
[28]:
X, y = evalml.demos.load_breast_cancer()
automl_with_ensembling = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
allowed_model_families=[ModelFamily.LINEAR_MODEL],
max_batches=4,
ensembling=True,
automl_algorithm="iterative",
verbose=True,
)
automl_with_ensembling.search(interactive_plot=False)
Number of Features
Numeric 30
Number of training examples: 569
Targets
benign 62.74%
malignant 37.26%
Name: count, dtype: object
AutoMLSearch will use mean CV score to rank pipelines.
Generating pipelines to search over...
Ensembling will run every 3 batches.
2 pipelines ready for search.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 4 batches for a total of 14 pipelines.
Allowed model families: linear_model, linear_model
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 13.429
*****************************
* Evaluating Batch Number 1 *
*****************************
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.077
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.077
*****************************
* Evaluating Batch Number 2 *
*****************************
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.090
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.085
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.081
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.097
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.093
*****************************
* Evaluating Batch Number 3 *
*****************************
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.075
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.076
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.075
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.079
Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.075
*****************************
* Evaluating Batch Number 4 *
*****************************
Stacked Ensemble Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.103
Search finished after 19.55 seconds
Best pipeline: Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler
Best pipeline Log Loss Binary: 0.075391
[28]:
{1: {'Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler': 1.3139324188232422,
'Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler': 1.4192306995391846,
'Total time of batch': 2.940805196762085},
2: {'Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler': 1.4436862468719482,
'Total time of batch': 7.696403741836548},
3: {'Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler': 1.2711994647979736,
'Total time of batch': 7.111368894577026},
4: {'Stacked Ensemble Classification Pipeline': 1.1113545894622803,
'Total time of batch': 1.2259869575500488}}
通过调用 .describe()
,我们可以查看有关 stacking 集成管道(它是性能最佳的管道)的更多信息。
[29]:
automl_with_ensembling.best_pipeline.describe()
***********************************************************************
* Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler *
***********************************************************************
Problem Type: binary
Model Family: Linear
Number of features: 30
Pipeline Steps
==============
1. Label Encoder
* positive_label : None
2. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : knn
* boolean_impute_strategy : most_frequent
* categorical_fill_value : None
* numeric_fill_value : None
* boolean_fill_value : None
3. Standard Scaler
4. Elastic Net Classifier
* penalty : elasticnet
* C : 8.474044870453413
* l1_ratio : 0.6235636967859725
* n_jobs : -1
* multi_class : auto
* solver : saga
AutoML 算法#
EvalML 目前提供两种算法供用户选择。下面,我们将介绍每种算法的工作原理以及如何通过 AutoMLSearch
和顶层搜索方法访问它们。
IterativeAlgorithm#
IterativeAlgorithm
是 EvalML 中创建的第一个 AutoML 算法,可以使用 search_iterative
方法或指定 AutoMLSearch(automl_algorithm='iterative')
来访问它。该算法的工作原理如下:
每个批次(初始基线模型之后)包含指定问题类型的所有可用估计器的管道
管道包含机器学习所需的预处理(填充、编码等),但不应用特征选择
集成可以通过传入
ensembling=True
参数来开启,并在完成整个训练周期(每个允许的管道训练一个批次)后运行
[30]:
import evalml
X, y = evalml.demos.load_fraud(n_rows=250)
Number of Features
Boolean 1
Categorical 6
Numeric 5
Number of training examples: 250
Targets
False 88.40%
True 11.60%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
[31]:
from evalml.automl import search_iterative
# top level search method will run `AutoMLSearch` with `IterativeAlgorithm` as well as apply our default data checks
auto_iterative, messages_iterative = search_iterative(X, y, problem_type="binary")
[32]:
from evalml import AutoMLSearch
auto_iterative = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
automl_algorithm="iterative",
verbose=True,
)
auto_iterative.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.
Removing columns ['currency', 'expiration_date'] because they are of 'Unknown' type
Generating pipelines to search over...
6 pipelines ready for search.
Using default limit of max_batches=1.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of None pipelines.
Allowed model families: linear_model, linear_model, xgboost, lightgbm, random_forest, extra_trees
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 4.181
*****************************
* Evaluating Batch Number 1 *
*****************************
Elastic Net Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.426
Logistic Regression Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.425
XGBoost Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.275
LightGBM Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.325
Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.290
Extra Trees Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.350
Search finished after 17.86 seconds
Best pipeline: XGBoost Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler
Best pipeline Log Loss Binary: 0.275292
[32]:
{1: {'Elastic Net Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler': 2.609773635864258,
'Logistic Regression Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + Standard Scaler': 3.1346724033355713,
'XGBoost Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler': 2.61729097366333,
'LightGBM Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler': 2.143817901611328,
'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler': 3.1884095668792725,
'Extra Trees Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler': 3.067530870437622,
'Total time of batch': 17.37200093269348}}
DefaultAlgorithm#
DefaultAlgorithm
的设计主要实现了三件事:
从用户那里抽象出更多的参数和决策。
对高性能管道进行更深度的调优。
创建一个平台,引入特征选择以及其他可能的 AutoML 技术/启发式方法。
DefaultAlgorithm
通过创建两种模式的概念来实现这一点:fast
和 long
,其中 fast
是 long 的一个子集。该算法运行方式如下:
运行简单管道
一个带有默认预处理管道的随机森林管道
再次运行相同的管道,这次使用特征选择。后续管道将使用 SelectedColumns 转换器处理选定的特征。
运行所有带有预处理组件的管道
扫描其余估计器(IterativeAlgorithm 批次 1)。
第一次集成运行
Fast 模式在此结束。开始 Long 模式。
运行前 3 个估计器
生成 50 个随机参数集。在同一批次运行所有 150 个。
第二次集成运行
无限重复 8a 和 8b,直到达到 AutoMLSearch 中指定的时间
对于之前的前 3 个估计器,从调优器中抽取 10 个参数。在同一批次运行所有 30 个。
运行集成
因此,建议使用顶层 search()
方法运行 DefaultAlgorithm。这允许用户仅使用 mode
参数指定运行搜索,其中 fast
模式建议用于希望快速了解 EvalML 管道在其问题上的性能的用户,而 long
模式则保留用于更深入地探索高性能管道。如果需要对 AutoML 参数进行更精细的控制,也可以使用 AutoMLSearch 并指定 automl_algorithm='default'
,这将默认为使用 fast 模式。但是,在这种情况下,集成将由 ensembling 标志定义(如果 ensembling=False,则会跳过上述集成批次)。用户可以根据上述算法(或其他停止准则)选择 max_batches
,但应注意,如果算法未运行完整的 fast 模式时长,结果可能不是最佳的。请注意,allowed_model_families
和 excluded_model_families
参数仅应用于默认算法中的非简单批次。如果用户希望将其应用于所有估计器,请通过指定 automl_algorithm='iterative'
使用迭代算法。
[33]:
from evalml.automl import search
# top level search method will run `AutoMLSearch` with `IterativeAlgorithm` as well as apply our default data checks
auto_default, messages_default = search(X, y, problem_type="binary", mode="fast")
[34]:
from evalml import AutoMLSearch
auto_default = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
automl_algorithm="default",
ensembling=True,
verbose=True,
)
auto_default.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.
Removing columns ['currency', 'expiration_date'] because they are of 'Unknown' type
Using default limit of max_batches=3.
*****************************
* Beginning pipeline search *
*****************************
Optimizing for Log Loss Binary.
Lower score is better.
Using SequentialEngine to train and score pipelines.
Searching up to 3 batches for a total of None pipelines.
Allowed model families:
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 4.181
*****************************
* Evaluating Batch Number 1 *
*****************************
Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.270
*****************************
* Evaluating Batch Number 2 *
*****************************
LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.325
Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.349
Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.418
XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.275
Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.418
*****************************
* Evaluating Batch Number 3 *
*****************************
Stacked Ensemble Classification Pipeline:
Starting cross validation
Finished cross validation - mean Log Loss Binary: 0.237
Search finished after 29.29 seconds
Best pipeline: Stacked Ensemble Classification Pipeline
Best pipeline Log Loss Binary: 0.236982
[34]:
{1: {'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': 3.8598780632019043,
'Total time of batch': 3.989288568496704},
2: {'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 2.4690487384796143,
'Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 2.97183895111084,
'Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 3.183461904525757,
'XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 2.9495766162872314,
'Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 3.3269076347351074,
'Total time of batch': 15.692172765731812},
3: {'Stacked Ensemble Classification Pipeline': 8.96763300895691,
'Total time of batch': 9.111598491668701}}
管道差异#
通过上面的搜索输出,我们可以看到 IterativeAlgorithm 和 DefaultAlgorithm 之间的管道差异。这是因为 DefaultAlgorithm 利用了诸如 RFRegressorSelectFromModel
等新组件以及其他列选择器来进行特征选择,并采用了一种新的管道结构来处理分类和非分类特征的特征选择。
[35]:
auto_iterative.get_pipeline(4).graph()
[35]:
[36]:
auto_default.get_pipeline(6).graph()
[36]:
访问原始结果#
AutoMLSearch 类在 results
字段下记录详细的结果信息,包括交叉验证分数和参数信息。
[37]:
import pprint
pp = pprint.PrettyPrinter(indent=0, width=100, depth=3, compact=True, sort_dicts=False)
pp.pprint(automl.results)
{'pipeline_results': {0: {'id': 0,
'pipeline_name': 'Mode Baseline Binary Classification Pipeline',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'Baseline Classifier w/ Label Encoder',
'parameters': {...},
'mean_cv_score': 4.921248270190403,
'standard_deviation_cv_score': 0.11291020093698304,
'high_variance_cv': False,
'training_time': 0.6287055015563965,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 0,
'ranking_score': 4.990659700031606,
'ranking_additional_objectives': {...},
'holdout_score': 4.990659700031606},
1: {'id': 1,
'pipeline_name': 'Random Forest Classifier w/ Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + One Hot '
'Encoder + Oversampler + RF Classifier Select From Model',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'Random Forest Classifier w/ Label Encoder + Drop '
'Columns Transformer + DateTime Featurizer + Imputer + '
'One Hot Encoder + Oversampler + RF Classifier Select '
'From Model',
'parameters': {...},
'mean_cv_score': 0.2533614735001717,
'standard_deviation_cv_score': 0.045592460569307824,
'high_variance_cv': False,
'training_time': 5.226076364517212,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 94.85168275222236,
'ranking_score': 0.21235607618427976,
'ranking_additional_objectives': {...},
'holdout_score': 0.21235607618427976},
2: {'id': 2,
'pipeline_name': 'LightGBM Classifier w/ Label Encoder + Select Columns By '
'Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Select '
'Columns Transformer + Select Columns Transformer + Label '
'Encoder + Imputer + One Hot Encoder + Oversampler',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'LightGBM Classifier w/ Label Encoder + Select Columns '
'By Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Select '
'Columns Transformer + Select Columns Transformer + '
'Label Encoder + Imputer + One Hot Encoder + '
'Oversampler',
'parameters': {...},
'mean_cv_score': 0.2999710030621828,
'standard_deviation_cv_score': 0.2061756997312182,
'high_variance_cv': False,
'training_time': 3.292562246322632,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 93.90457488440069,
'ranking_score': 0.1609546813582899,
'ranking_additional_objectives': {...},
'holdout_score': 0.1609546813582899},
3: {'id': 3,
'pipeline_name': 'Extra Trees Classifier w/ Label Encoder + Select Columns '
'By Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Select '
'Columns Transformer + Select Columns Transformer + Label '
'Encoder + Imputer + One Hot Encoder + Oversampler',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'Extra Trees Classifier w/ Label Encoder + Select '
'Columns By Type Transformer + Label Encoder + Drop '
'Columns Transformer + DateTime Featurizer + Imputer + '
'Select Columns Transformer + Select Columns '
'Transformer + Label Encoder + Imputer + One Hot '
'Encoder + Oversampler',
'parameters': {...},
'mean_cv_score': 0.3548022504410096,
'standard_deviation_cv_score': 0.02819477750638557,
'high_variance_cv': False,
'training_time': 5.005589723587036,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 92.79040131768677,
'ranking_score': 0.34494076832526754,
'ranking_additional_objectives': {...},
'holdout_score': 0.34494076832526754},
4: {'id': 4,
'pipeline_name': 'Elastic Net Classifier w/ Label Encoder + Select Columns '
'By Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Standard '
'Scaler + Select Columns Transformer + Select Columns '
'Transformer + Label Encoder + Imputer + One Hot Encoder + '
'Standard Scaler + Oversampler',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'Elastic Net Classifier w/ Label Encoder + Select '
'Columns By Type Transformer + Label Encoder + Drop '
'Columns Transformer + DateTime Featurizer + Imputer + '
'Standard Scaler + Select Columns Transformer + Select '
'Columns Transformer + Label Encoder + Imputer + One '
'Hot Encoder + Standard Scaler + Oversampler',
'parameters': {...},
'mean_cv_score': 0.3749636652298016,
'standard_deviation_cv_score': 0.04576837668046441,
'high_variance_cv': False,
'training_time': 4.673047780990601,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 92.38072040581496,
'ranking_score': 0.4014310266366239,
'ranking_additional_objectives': {...},
'holdout_score': 0.4014310266366239},
5: {'id': 5,
'pipeline_name': 'XGBoost Classifier w/ Label Encoder + Select Columns By '
'Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Select '
'Columns Transformer + Select Columns Transformer + Label '
'Encoder + Imputer + One Hot Encoder + Oversampler',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'XGBoost Classifier w/ Label Encoder + Select Columns '
'By Type Transformer + Label Encoder + Drop Columns '
'Transformer + DateTime Featurizer + Imputer + Select '
'Columns Transformer + Select Columns Transformer + '
'Label Encoder + Imputer + One Hot Encoder + '
'Oversampler',
'parameters': {...},
'mean_cv_score': 0.2602139590213454,
'standard_deviation_cv_score': 0.14857758702664728,
'high_variance_cv': False,
'training_time': 3.5291311740875244,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 94.71243991900296,
'ranking_score': 0.16504919593086245,
'ranking_additional_objectives': {...},
'holdout_score': 0.16504919593086245},
6: {'id': 6,
'pipeline_name': 'Logistic Regression Classifier w/ Label Encoder + Select '
'Columns By Type Transformer + Label Encoder + Drop '
'Columns Transformer + DateTime Featurizer + Imputer + '
'Standard Scaler + Select Columns Transformer + Select '
'Columns Transformer + Label Encoder + Imputer + One Hot '
'Encoder + Standard Scaler + Oversampler',
'pipeline_class': <class 'evalml.pipelines.binary_classification_pipeline.BinaryClassificationPipeline'>,
'pipeline_summary': 'Logistic Regression Classifier w/ Label Encoder + '
'Select Columns By Type Transformer + Label Encoder + '
'Drop Columns Transformer + DateTime Featurizer + '
'Imputer + Standard Scaler + Select Columns Transformer '
'+ Select Columns Transformer + Label Encoder + Imputer '
'+ One Hot Encoder + Standard Scaler + Oversampler',
'parameters': {...},
'mean_cv_score': 0.3744194787645956,
'standard_deviation_cv_score': 0.04599907860856567,
'high_variance_cv': False,
'training_time': 5.866044521331787,
'cv_data': [...],
'percent_better_than_baseline_all_objectives': {...},
'percent_better_than_baseline': 92.39177830079056,
'ranking_score': 0.40317955644858056,
'ranking_additional_objectives': {...},
'holdout_score': 0.40317955644858056}},
'search_order': [0, 1, 2, 3, 4, 5, 6]}
如果出现错误,例如上面 Iterative Algorithm 示例中的错误,我们可以通过访问 errors
字段更详细地检查这些错误。每个失败的管道折叠都有一个字典条目,每个条目包含管道参数以及抛出的错误及其完整回溯信息。
[38]:
auto_iterative.errors
[38]:
{}
并行 AutoML#
默认情况下,AutoML 批次中的所有管道按顺序评估。可以并行评估管道以提高 AutoML 搜索期间的性能。这通过批量中管道的 futures 式提交和评估来实现。截至本文撰写时,管道使用线程模型进行并发评估。这类似于估计器中当前实现的 n_jobs
参数,该参数使用增加的线程数来训练和评估估计器。
快速入门#
为了快速使用一些并行性来增强管道搜索,可以在初始化期间通过字符串传递给 AutoMLSearch,以在 AutoMLSearch 对象中设置并行引擎和客户端。当前选项包括“cf_threaded”、“cf_process”、“dask_threaded”和“dask_process”,这些选项指示要使用的 futures 后端以及是使用线程级还是进程级并行性。
[39]:
automl_cf_threaded = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
allowed_model_families=[ModelFamily.LINEAR_MODEL],
engine="cf_threaded",
)
automl_cf_threaded.search(interactive_plot=False)
automl_cf_threaded.close_engine()
使用 Concurrent Futures 进行并行计算#
EngineBase 类足够健壮和可扩展,以支持来自各种库的 futures 式实现。CFEngine 扩展 EngineBase,以使用原生的 Python concurrent.futures 库。CFEngine 支持线程级和进程级并行。并行类型可以使用 ThreadPoolExecutor
或 ProcessPoolExecutor
进行选择。如果任一执行器传递了 max_workers
参数,它将设置派生的进程和线程数。否则,默认的进程数将等于可用的处理器数,线程数将设置为可用处理器数的五倍。
在此,CFEngine 使用默认参数调用,即使用所有可用线程进行线程并行。
[40]:
from concurrent.futures import ThreadPoolExecutor
from evalml.automl.engine.cf_engine import CFEngine, CFClient
cf_engine = CFEngine(CFClient(ThreadPoolExecutor(max_workers=4)))
automl_cf_threaded = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
allowed_model_families=[ModelFamily.LINEAR_MODEL],
engine=cf_engine,
)
automl_cf_threaded.search(interactive_plot=False)
automl_cf_threaded.close_engine()
注意:演示进程级并行性的单元格由于与我们的 ReadTheDocs 构建不兼容而使用 Markdown 格式。它可以在本地成功运行。
from concurrent.futures import ProcessPoolExecutor
# Repeat the process but using process-level parallelism\
cf_engine = CFEngine(CFClient(ProcessPoolExecutor(max_workers=2)))
automl_cf_process = AutoMLSearch(X_train=X, y_train=y,
problem_type="binary",
engine="cf_process")
automl_cf_process.search(interactive_plot = False)
automl_cf_process.close_engine()
使用 Dask 进行并行计算#
对于 DaskEngine(以及 CFEngine),可以显式调用线程级或进程级并行。processes 可以设置为 True
,并使用 n_workers
设置进程数。如果 processes 设置为 False
,则产生的并行将是线程式的,n_workers
将代表使用的线程数。以下是两者的示例。
[41]:
from dask.distributed import LocalCluster
from evalml.automl.engine import DaskEngine
dask_engine_p2 = DaskEngine(cluster=LocalCluster(processes=True, n_workers=2))
automl_dask_p2 = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
allowed_model_families=[ModelFamily.LINEAR_MODEL],
engine=dask_engine_p2,
)
automl_dask_p2.search(interactive_plot=False)
# Explicitly shutdown the automl object's LocalCluster
automl_dask_p2.close_engine()
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1288: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
and effective_n_jobs(self.n_jobs) == 1
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/joblib/parallel.py:1359: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1288: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
and effective_n_jobs(self.n_jobs) == 1
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/joblib/parallel.py:1359: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1288: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
and effective_n_jobs(self.n_jobs) == 1
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/joblib/parallel.py:1359: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/sklearn/ensemble/_base.py:168: UserWarning: Inside a Dask worker with daemon=True, setting n_jobs=1.
Possible work-arounds:
- dask.config.set({'distributed.worker.daemon': False})
- set the environment variable DASK_DISTRIBUTED__WORKER__DAEMON=False
before creating your Dask cluster.
n_jobs = min(effective_n_jobs(n_jobs), n_estimators)
2024-06-06 17:58:48,265 - distributed.scheduler - ERROR - Removing worker 'tcp://127.0.0.1:43003' caused the cluster to lose scattered data, which can't be recovered: {'DataFrame-bdd6ad7396f7ef2eafd41b9d359b74d1', 'Series-0bfec7d74a2607cc9a57ac596e573e01'} (stimulus_id='handle-worker-cleanup-1717696728.2653542')
[42]:
dask_engine_t4 = DaskEngine(cluster=LocalCluster(processes=False, n_workers=4))
automl_dask_t4 = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
allowed_model_families=[ModelFamily.LINEAR_MODEL],
engine=dask_engine_t4,
)
automl_dask_t4.search(interactive_plot=False)
automl_dask_t4.close_engine()
2024-06-06 17:59:05,117 - distributed.scheduler - ERROR - Removing worker 'inproc://172.17.0.2/5729/5' caused the cluster to lose scattered data, which can't be recovered: {'DataFrame-bdd6ad7396f7ef2eafd41b9d359b74d1', 'Series-0bfec7d74a2607cc9a57ac596e573e01'} (stimulus_id='handle-worker-cleanup-1717696745.1171188')
正如我们所见,只需使用默认 SequentialEngine 以外的引擎,就可以获得显著的性能提升,从多个进程带来的 100% 加速到多个线程带来的 500% 加速!
[43]:
print("Sequential search duration: %s" % str(automl.search_duration))
print(
"Concurrent futures (threaded) search duration: %s"
% str(automl_cf_threaded.search_duration)
)
print("Dask (two processes) search duration: %s" % str(automl_dask_p2.search_duration))
print("Dask (four threads)search duration: %s" % str(automl_dask_t4.search_duration))
Sequential search duration: 29.43521237373352
Concurrent futures (threaded) search duration: 12.752341270446777
Dask (two processes) search duration: 18.946487426757812
Dask (four threads)search duration: 14.855291366577148