自动化机器学习 (AutoML) 搜索#

背景#

机器学习#

机器学习 (ML) 是根据从系统中收集的样本数据集构建系统数学模型的过程。

训练 ML 模型的主要目标之一是教会模型将数据中存在的信号与系统和数据收集过程中固有的噪声分离。如果做得好，该模型在遇到新的、类似的数据时就可以用来对系统做出准确的预测。此外，对 ML 模型进行内省可以揭示有关被建模系统的关键信息，例如哪些输入以及输入的哪些转换对于 ML 模型学习数据中的信号最有用，从而最具预测性。

ML 问题类型多种多样。监督学习描述了收集到的数据包含一个要建模的输出值和一组用于训练模型的输入的情况。EvalML 专注于训练监督学习模型。

EvalML 支持三种常见的监督 ML 问题类型。第一种是回归，其中要建模的目标值是连续的数值。接下来是二分类和多分类，其中要建模的目标值由两个或多个离散值或类别组成。选择哪种监督 ML 问题类型最合适取决于领域专业知识以及模型将如何被评估和使用。

EvalML 目前正在构建对监督时间序列问题的支持：时间序列回归、时间序列二分类和时间序列多分类。虽然我们已经添加了一些特性来处理这些类型的问题，但我们的功能仍在积极开发中，因此在使用之前请注意这一点。

AutoML 和搜索#

AutoML 是自动化构建、训练和评估 ML 模型的过程。给定数据和一些配置，AutoML 会搜索最有效、最准确的 ML 模型或模型来拟合数据集。在搜索过程中，AutoML 将探索模型类型、模型参数和模型架构的不同组合。

有效的 AutoML 解决方案比手动构建和调整 ML 模型具有多种优势。AutoML 可以协助解决 ML 中的许多困难方面，例如避免过拟合和欠拟合、不平衡数据、检测数据泄露以及问题设置中的其他潜在问题，并自动应用最佳实践数据清洗、特征工程、特征选择和各种建模技术。AutoML 还可以利用搜索算法来最佳地遍历超参数搜索空间，从而获得手动训练难以达到的模型性能。

EvalML 中的 AutoML#

EvalML 支持上述所有功能及更多功能。

在最简单的用法中，AutoML 搜索界面只需要输入数据、目标数据以及指定要建模的监督 ML 问题类型的 problem_type。

** Jupyter Notebook 和 Jupyter Lab 上的图形方法，例如 verbose AutoMLSearch，需要安装 ipywidgets。

** 如果在 Jupyter Lab 上绘图，需要 jupyterlab-plotly。要下载此工具，请确保已安装 npm。

[1]:

import evalml
from evalml.utils import infer_feature_types

X, y = evalml.demos.load_fraud(n_rows=650)

             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 650
Targets
False    86.31%
True     13.69%
Name: count, dtype: object

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

要向 EvalML 提供数据，建议您在数据上初始化 Woodwork accessor。这使您可以在训练模型之前轻松控制 EvalML 如何处理您的每个特征。

EvalML 也接受 pandas 输入，并会在输入的 pandas 数据之上运行类型推断。如果您想更改 EvalML 推断的类型，可以使用 infer_feature_types 工具方法，该方法接受 pandas 或 numpy 输入并将其转换为 Woodwork 数据结构。feature_types 参数可用于指定特定列应为哪种类型。

诸如 Natural Language 之类的特征类型必须通过这种方式指定，否则 Woodwork 会将其推断为 Unknown 类型并在 AutoMLSearch 期间将其丢弃。

在下面的示例中，我们重新格式化了几个特征，以便模型轻松使用它们，然后指定 provider（否则会被推断为包含自然语言的列）是一个类别列。

[2]:

X.ww["expiration_date"] = X["expiration_date"].apply(
    lambda x: "20{}-01-{}".format(x.split("/")[1], x.split("/")[0])
)
X = infer_feature_types(
    X,
    feature_types={
        "store_id": "categorical",
        "expiration_date": "datetime",
        "lat": "categorical",
        "lng": "categorical",
        "provider": "categorical",
    },
)

为了验证管道创建和优化过程的结果，我们将部分数据保存为保留集（holdout set）。

[3]:

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2
)

数据检查#

在调用 AutoMLSearch.search 之前，我们应该对数据进行一些健全性检查，以确保传递的输入数据在运行耗时的搜索之前不会遇到一些常见问题。EvalML 提供了各种数据检查功能，使此过程变得轻松。如果检测到输入数据存在潜在问题，每个数据检查都会返回警告和错误集合。这允许用户检查他们的数据，以避免在搜索过程中可能出现的令人困惑的错误。您可以通过我们的数据检查指南了解每个可用数据检查的详细信息。

在此，我们将运行 DefaultDataChecks 类，它包含一系列通常有用的数据检查。

[4]:

from evalml.data_checks import DefaultDataChecks

data_checks = DefaultDataChecks("binary", "log loss binary")
data_checks.validate(X_train, y_train)

[4]:

[]

由于没有返回警告或错误，我们可以安全地继续搜索过程。

用于管道排名的保留集#

如果 holdout_set_size 参数已设置且输入数据集包含超过 500 行，AutoMLSearch 将从训练数据的 holdout_set_size 大小中创建一个保留集。或者，可以通过在 AutoMLSearch() 中使用 X_holdout 和 y_holdout 参数手动指定保留集。在此示例中，之前创建的保留集将由 AutoML 搜索使用。

在 AutoML 搜索过程中，会计算所有交叉验证折叠的目标分数平均值（在管道排名中显示为“mean_cv_score”列）。这个分数会传递给 AutoML 搜索调优器，以进一步优化下一批管道的超参数。

之后，管道将在整个训练数据集上拟合，并在新的保留集上评分。这个分数在管道排名板的“ranking_score”列下表示，并用于衡量管道性能。

如果数据集少于 500 行或 holdout_set_size=0（这是默认设置），则将使用“mean_cv_score”作为 ranking_score。

[5]:

automl = evalml.automl.AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    X_holdout=X_holdout,
    y_holdout=y_holdout,
    problem_type="binary",
    verbose=True,
)
automl.search(interactive_plot=False)

AutoMLSearch will use the holdout set to score and rank pipelines.
Removing columns ['currency'] because they are of 'Unknown' type
Using default limit of max_batches=2.


*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 2 batches for a total of None pipelines.
Allowed model families:

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 4.921
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 4.991

*****************************
* Evaluating Batch Number 1 *
*****************************

Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.253
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.212

*****************************
* Evaluating Batch Number 2 *
*****************************

LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.300
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.161
Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.355
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.345
Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.375
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.401
XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.260
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.165
Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.374
        Starting holdout set scoring
        Finished holdout set scoring - Log Loss Binary: 0.403

Search finished after 29.44 seconds
Best pipeline: LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler
Best pipeline Log Loss Binary: 0.160955

[5]:

{1: {'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': 5.261218070983887,
  'Total time of batch': 5.392166376113892},
 2: {'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 3.3102989196777344,
  'Extra Trees Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 5.022824048995972,
  'Elastic Net Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 4.690482139587402,
  'XGBoost Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': 3.5465054512023926,
  'Logistic Regression Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Standard Scaler + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Standard Scaler + Oversampler': 5.883444547653198,
  'Total time of batch': 23.247490644454956}}

将 verbose 参数设置为 True 时，AutoML 搜索将记录其进度，报告搜索期间评估的每个管道和参数集。AutoML 搜索期间显示的搜索迭代图跟踪当前管道的验证分数（以灰色点表示）与最佳管道验证分数（以蓝色线表示）。

有多种机制可以控制 AutoML 搜索时间。一种方法是设置 max_batches 参数，它控制要评估的 AutoML 最大轮数，每轮可能训练和评分可变数量的管道。另一种方法是设置 max_iterations 参数，它控制在 AutoML 期间要评估的候选模型的最大数量。默认情况下，AutoML 将搜索一个批次。要评估的第一个管道始终是代表平凡解决方案的基线模型。

AutoML 接口支持各种其他参数。有关完整列表，请参阅 API 参考。

我们还提供一个独立的搜索方法，该方法只需一行代码即可完成上述所有操作，并返回 AutoMLSearch 实例和数据检查结果。如果存在数据检查错误，将不会运行 AutoML，也不会返回 AutoMLSearch 实例。

检测问题类型#

EvalML 包含一个简单的方法 detect_problem_type，可帮助根据目标数据确定问题类型。

此函数可以将预测的问题类型作为 ProblemType 枚举返回，可选择 ProblemType.BINARY、ProblemType.MULTICLASS 和 ProblemType.REGRESSION。如果目标数据无效（例如，当只有一个唯一标签时），该函数将抛出错误。

[6]:

import pandas as pd
from evalml.problem_types import detect_problem_type

y_binary = pd.Series([0, 1, 1, 0, 1, 1])
detect_problem_type(y_binary)

[6]:

<ProblemTypes.BINARY: 'binary'>

objective 参数#

AutoMLSearch 接受一个 objective 参数来确定要优化的目标。默认情况下，此参数设置为 auto，允许 AutoML 为二分类问题选择 LogLossBinary，为多分类问题选择 LogLossMulticlass，为回归问题选择 R2。

应该注意的是，objective 参数仅用于排名和帮助选择要迭代的管道，但不用于在拟合期间优化每个单独的管道。

要获取每种问题类型的默认目标，可以使用 get_default_primary_search_objective 函数。

[7]:

from evalml.automl import get_default_primary_search_objective

binary_objective = get_default_primary_search_objective("binary")
multiclass_objective = get_default_primary_search_objective("multiclass")
regression_objective = get_default_primary_search_objective("regression")

print(binary_objective.name)
print(multiclass_objective.name)
print(regression_objective.name)

Log Loss Binary
Log Loss Multiclass
R2

使用自定义管道#

EvalML 的 AutoML 算法生成一组用于搜索的管道。要提供自定义集合，请将 allowed_component_graphs 设置为自定义组件图的字典。AutoMLSearch 将使用这些来生成 Pipeline 实例。注意：这将阻止 AutoML 生成其他管道进行搜索。

[8]:

from evalml.pipelines import MulticlassClassificationPipeline


automl_custom = evalml.automl.AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="multiclass",
    verbose=True,
    allowed_component_graphs={
        "My_pipeline": ["Simple Imputer", "Random Forest Classifier"],
        "My_other_pipeline": ["One Hot Encoder", "Random Forest Classifier"],
    },
)

AutoMLSearch will use mean CV score to rank pipelines.
Removing columns ['currency'] because they are of 'Unknown' type
Using default limit of max_batches=2.

提前停止搜索#

要提前停止搜索，请按 Ctrl-C。这将弹出一个提示，要求确认。回复 y 将立即停止搜索。回复 n 将继续搜索。

回调函数#

AutoMLSearch 支持多种回调函数，可以在初始化 AutoMLSearch 对象时将其指定为参数。它们是

start_iteration_callback
add_result_callback
error_callback

开始迭代回调#

用户可以设置 start_iteration_callback 来指定在每次管道训练迭代之前调用的函数。此回调函数必须接受三个位置参数：管道类、管道参数和 AutoMLSearch 对象。

[9]:

## start_iteration_callback example function
def start_iteration_callback_example(pipeline_class, pipeline_params, automl_obj):
    print("Training pipeline with the following parameters:", pipeline_params)

添加结果回调#

用户可以设置 add_result_callback 来指定在每次管道训练迭代之后调用的函数。此回调函数必须接受三个位置参数：包含新管道训练结果的字典、包含训练期间使用的参数的 untrained_pipeline 以及 AutoMLSearch 对象。

[10]:

## add_result_callback example function
def add_result_callback_example(pipeline_results_dict, untrained_pipeline, automl_obj):
    print(
        "Results for trained pipeline with the following parameters:",
        pipeline_results_dict,
    )

错误回调#

用户可以设置 error_callback 来指定当 search() 发生错误并引发 Exception 时调用的函数。此回调函数接受三个位置参数：引发的 Exception raised、回溯（traceback）和 AutoMLSearch object。此回调函数还必须接受 kwargs，以便 AutoMLSearch 能够传递默认使用的其他参数。

Evalml 定义了几个错误回调函数，可在 evalml.automl.callbacks 下找到。它们是

silent_error_callback
raise_error_callback
log_and_save_error_callback
raise_and_save_error_callback
log_error_callback (当 error_callback 为 None 时使用的默认值)

[11]:

# error_callback example; this is implemented in the evalml library
def raise_error_callback(exception, traceback, automl, **kwargs):
    """Raises the exception thrown by the AutoMLSearch object. Also logs the exception as an error."""
    logger.error(f"AutoMLSearch raised a fatal exception: {str(exception)}")
    logger.error("\n".join(traceback))
    raise exception

查看排名#

所有构建的管道的摘要可以作为 pandas DataFrame 返回，并按验证分数排序。

对于使用保留集完成的 AutoML 搜索，验证分数是使用整个训练数据集拟合的管道的保留集分数。
对于没有保留集完成的 AutoML 搜索，验证分数是所有交叉验证折叠的平均分数。

[12]:

automl.rankings

[12]:

	id	管道名称	搜索顺序	排名分数	保留集分数	平均交叉验证分数	交叉验证分数标准差	比基线好百分比	高方差交叉验证	参数
0	2	LightGBM 分类器 w/ Label Encoder + Select ...	2	0.160955	0.160955	0.299971	0.206176	93.904575	False	{'Label Encoder': {'positive_label': None}, 'N...
1	5	XGBoost 分类器 w/ Label Encoder + Select C...	5	0.165049	0.165049	0.260214	0.148578	94.712440	False	{'Label Encoder': {'positive_label': None}, 'N...
2	1	随机森林分类器 w/ Label Encoder + Dr...	1	0.212356	0.212356	0.253361	0.045592	94.851683	False	{'Label Encoder': {'positive_label': None}, 'D...
3	3	Extra Trees 分类器 w/ Label Encoder + Sele...	3	0.344941	0.344941	0.354802	0.028195	92.790401	False	{'Label Encoder': {'positive_label': None}, 'N...
4	4	Elastic Net 分类器 w/ Label Encoder + Sele...	4	0.401431	0.401431	0.374964	0.045768	92.380720	False	{'Label Encoder': {'positive_label': None}, 'N...
5	6	Logistic Regression 分类器 w/ Label Encode...	6	0.403180	0.403180	0.374419	0.045999	92.391778	False	{'Label Encoder': {'positive_label': None}, 'N...
6	0	模式基线二分类管道	0	4.990660	4.990660	4.921248	0.112910	0.000000	False	{'Label Encoder': {'positive_label': None}, 'B...

	id	管道名称	搜索顺序	recommendation_score	保留集分数	平均交叉验证分数
0	1	随机森林分类器 w/ Label Encoder + Dr...	1	92.707652	0.212356	0.253361
1	2	LightGBM 分类器 w/ Label Encoder + Select ...	2	91.294415	0.160955	0.299971
2	5	XGBoost 分类器 w/ Label Encoder + Select C...	5	90.703467	0.165049	0.260214
3	3	Extra Trees 分类器 w/ Label Encoder + Sele...	3	75.896527	0.344941	0.354802
4	4	Elastic Net 分类器 w/ Label Encoder + Sele...	4	64.711783	0.401431	0.374964
5	6	Logistic Regression 分类器 w/ Label Encode...	6	64.653421	0.403180	0.374419
6	0	模式基线二分类管道	0	25.000000	4.990660	4.921248

描述管道#

每个管道都有一个 id。我们可以使用该 id 获取有关任何特定管道的更多信息。在这里，我们将获取 id = 1 的管道的更多信息。

[18]:

automl.describe_pipeline(1)

**************************************************************************************************************************************************************************
* Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model *
**************************************************************************************************************************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Label Encoder
         * positive_label : None
2. Drop Columns Transformer
         * columns : ['currency']
3. DateTime Featurizer
         * features_to_extract : ['year', 'month', 'day_of_week', 'hour']
         * encode_as_categories : False
         * time_index : None
4. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * boolean_impute_strategy : most_frequent
         * categorical_fill_value : None
         * numeric_fill_value : None
         * boolean_fill_value : None
5. One Hot Encoder
         * top_n : 10
         * features_to_encode : None
         * categories : None
         * drop : if_binary
         * handle_unknown : ignore
         * handle_missing : error
6. Oversampler
         * sampling_ratio : 0.25
         * k_neighbors_default : 5
         * n_jobs : -1
         * sampling_ratio_dict : None
         * categorical_features : [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
         * k_neighbors : 5
7. RF Classifier Select From Model
         * number_features : None
         * n_estimators : 10
         * max_depth : None
         * percent_features : 0.5
         * threshold : median
         * n_jobs : -1
8. Random Forest Classifier
         * n_estimators : 100
         * max_depth : 6
         * n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 5.2 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0.241       0.768 0.842 0.921      0.895 0.791                     0.848            0.948        346          174
1                      0.304       0.524 0.535 0.768      1.000 0.467                     0.652            0.908        347          173
2                      0.215       0.875 0.849 0.924      1.000 0.884                     0.896            0.971        347          173
mean                   0.253       0.723 0.742 0.871      0.965 0.714                     0.799            0.942          -            -
std                    0.046       0.180 0.179 0.090      0.061 0.219                     0.129            0.032          -            -
coef of var            0.180       0.249 0.242 0.103      0.063 0.307                     0.162            0.034          -            -

获取管道#

我们也可以通过它们的 id 获取任何管道的对象

[19]:

pipeline = automl.get_pipeline(1)
print(pipeline.name)
print(pipeline.parameters)

Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model
{'Label Encoder': {'positive_label': None}, 'Drop Columns Transformer': {'columns': ['currency']}, 'DateTime Featurizer': {'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder': {'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler': {'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'k_neighbors': 5}, 'RF Classifier Select From Model': {'number_features': None, 'n_estimators': 10, 'max_depth': None, 'percent_features': 0.5, 'threshold': 'median', 'n_jobs': -1}, 'Random Forest Classifier': {'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}

获取最佳管道#

如果您特别想获取最佳管道，有一个方便的访问器。返回的管道已经在我们传递给 AutoMLSearch 的输入 X, y 数据上拟合过了。要关闭此默认行为，请在初始化 AutoMLSearch 时设置 train_best_pipeline=False。

[20]:

best_pipeline = automl.best_pipeline
print(best_pipeline.name)
print(best_pipeline.parameters)
best_pipeline.predict(X_train)

LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler
{'Label Encoder': {'positive_label': None}, 'Numeric Pipeline - Select Columns By Type Transformer': {'column_types': ['category', 'EmailAddress', 'URL'], 'exclude': True}, 'Numeric Pipeline - Label Encoder': {'positive_label': None}, 'Numeric Pipeline - Drop Columns Transformer': {'columns': ['currency']}, 'Numeric Pipeline - DateTime Featurizer': {'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Numeric Pipeline - Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Numeric Pipeline - Select Columns Transformer': {'columns': ['card_id', 'store_id', 'amount', 'customer_present', 'lat', 'lng', 'datetime_month', 'datetime_day_of_week', 'datetime_hour']}, 'Categorical Pipeline - Select Columns Transformer': {'columns': ['expiration_date', 'provider', 'region', 'country']}, 'Categorical Pipeline - Label Encoder': {'positive_label': None}, 'Categorical Pipeline - Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Categorical Pipeline - One Hot Encoder': {'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler': {'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48], 'k_neighbors': 5}, 'LightGBM Classifier': {'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9, 'verbose': -1}}

[20]:

id
144    False
253     True
221    False
432    False
384    False
       ...
128    False
98     False
472    False
642    False
494    False
Name: fraud, Length: 520, dtype: bool

使用 AutoMLSearch 训练和评分多个管道#

AutoMLSearch 将自动在整个训练数据上拟合最佳管道。它还提供了一个简单的 API 用于训练和评分其他管道。

如果您想在整个训练数据上训练一个或多个管道，可以使用 train_pipelines 方法。

同样，如果您想在特定数据集上为一个或多个管道评分，可以使用 score_pipelines 方法。

[21]:

trained_pipelines = automl.train_pipelines([automl.get_pipeline(i) for i in [0, 1, 2]])
trained_pipelines

[21]:

{'Mode Baseline Binary Classification Pipeline': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Baseline Classifier': ['Baseline Classifier', 'Label Encoder.x', 'Label Encoder.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Baseline Classifier':{'strategy': 'mode'}}, custom_name='Mode Baseline Binary Classification Pipeline', random_seed=0),
 'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Drop Columns Transformer': ['Drop Columns Transformer', 'X', 'Label Encoder.y'], 'DateTime Featurizer': ['DateTime Featurizer', 'Drop Columns Transformer.x', 'Label Encoder.y'], 'Imputer': ['Imputer', 'DateTime Featurizer.x', 'Label Encoder.y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'Label Encoder.y'], 'Oversampler': ['Oversampler', 'One Hot Encoder.x', 'Label Encoder.y'], 'RF Classifier Select From Model': ['RF Classifier Select From Model', 'Oversampler.x', 'Oversampler.y'], 'Random Forest Classifier': ['Random Forest Classifier', 'RF Classifier Select From Model.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Drop Columns Transformer':{'columns': ['currency']}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'k_neighbors': 5}, 'RF Classifier Select From Model':{'number_features': None, 'n_estimators': 10, 'max_depth': None, 'percent_features': 0.5, 'threshold': 'median', 'n_jobs': -1}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0),
 'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': pipeline = BinaryClassificationPipeline(component_graph={'Label Encoder': ['Label Encoder', 'X', 'y'], 'Numeric Pipeline - Select Columns By Type Transformer': ['Select Columns By Type Transformer', 'X', 'Label Encoder.y'], 'Numeric Pipeline - Label Encoder': ['Label Encoder', 'Numeric Pipeline - Select Columns By Type Transformer.x', 'Label Encoder.y'], 'Numeric Pipeline - Drop Columns Transformer': ['Drop Columns Transformer', 'Numeric Pipeline - Select Columns By Type Transformer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - DateTime Featurizer': ['DateTime Featurizer', 'Numeric Pipeline - Drop Columns Transformer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - Imputer': ['Imputer', 'Numeric Pipeline - DateTime Featurizer.x', 'Numeric Pipeline - Label Encoder.y'], 'Numeric Pipeline - Select Columns Transformer': ['Select Columns Transformer', 'Numeric Pipeline - Imputer.x', 'Numeric Pipeline - Label Encoder.y'], 'Categorical Pipeline - Select Columns Transformer': ['Select Columns Transformer', 'X', 'Label Encoder.y'], 'Categorical Pipeline - Label Encoder': ['Label Encoder', 'Categorical Pipeline - Select Columns Transformer.x', 'Label Encoder.y'], 'Categorical Pipeline - Imputer': ['Imputer', 'Categorical Pipeline - Select Columns Transformer.x', 'Categorical Pipeline - Label Encoder.y'], 'Categorical Pipeline - One Hot Encoder': ['One Hot Encoder', 'Categorical Pipeline - Imputer.x', 'Categorical Pipeline - Label Encoder.y'], 'Oversampler': ['Oversampler', 'Numeric Pipeline - Select Columns Transformer.x', 'Categorical Pipeline - One Hot Encoder.x', 'Categorical Pipeline - Label Encoder.y'], 'LightGBM Classifier': ['LightGBM Classifier', 'Oversampler.x', 'Oversampler.y']}, parameters={'Label Encoder':{'positive_label': None}, 'Numeric Pipeline - Select Columns By Type Transformer':{'column_types': ['category', 'EmailAddress', 'URL'], 'exclude': True}, 'Numeric Pipeline - Label Encoder':{'positive_label': None}, 'Numeric Pipeline - Drop Columns Transformer':{'columns': ['currency']}, 'Numeric Pipeline - DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'Numeric Pipeline - Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Numeric Pipeline - Select Columns Transformer':{'columns': ['card_id', 'store_id', 'amount', 'customer_present', 'lat', 'lng', 'datetime_month', 'datetime_day_of_week', 'datetime_hour']}, 'Categorical Pipeline - Select Columns Transformer':{'columns': ['expiration_date', 'provider', 'region', 'country']}, 'Categorical Pipeline - Label Encoder':{'positive_label': None}, 'Categorical Pipeline - Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Categorical Pipeline - One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Oversampler':{'sampling_ratio': 0.25, 'k_neighbors_default': 5, 'n_jobs': -1, 'sampling_ratio_dict': None, 'categorical_features': [3, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48], 'k_neighbors': 5}, 'LightGBM Classifier':{'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9, 'verbose': -1}}, random_seed=0)}

[22]:

pipeline_holdout_scores = automl.score_pipelines(
    [trained_pipelines[name] for name in trained_pipelines.keys()],
    X_holdout,
    y_holdout,
    ["Accuracy Binary", "F1", "AUC"],
)
pipeline_holdout_scores

[22]:

{'Mode Baseline Binary Classification Pipeline': OrderedDict([('Accuracy Binary',
               0.8615384615384616),
              ('F1', 0.0),
              ('AUC', 0.5)]),
 'Random Forest Classifier w/ Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': OrderedDict([('Accuracy Binary',
               0.9769230769230769),
              ('F1', 0.9090909090909091),
              ('AUC', 0.9250992063492064)]),
 'LightGBM Classifier w/ Label Encoder + Select Columns By Type Transformer + Label Encoder + Drop Columns Transformer + DateTime Featurizer + Imputer + Select Columns Transformer + Select Columns Transformer + Label Encoder + Imputer + One Hot Encoder + Oversampler': OrderedDict([('Accuracy Binary',
               0.9692307692307692),
              ('F1', 0.875),
              ('AUC', 0.9201388888888888)])}

保存 AutoMLSearch 和来自 AutoMLSearch 的管道#

有两种方法可以保存 AutoMLSearch 的结果。

您可以保存 AutoMLSearch 对象本身，通过调用 .save(<filepath>) 来实现。这将允许您保存 AutoMLSearch 的状态，并从此处重新加载所有管道。
如果您想从 AutoMLSearch 中保存管道以备将来使用，管道类本身有一个 .save(<filepath>) 方法。

[23]:

# saving the entire automl search
automl.save("automl.cloudpickle")
automl2 = evalml.automl.AutoMLSearch.load("automl.cloudpickle")
# saving the best pipeline using .save()
best_pipeline.save("pipeline.cloudpickle")
best_pipeline_copy = evalml.pipelines.PipelineBase.load("pipeline.cloudpickle")

限制 AutoML 搜索空间#

AutoML 搜索算法首先使用其默认值训练管道中的每个组件。第一次迭代后，它会使用这些组件预定义的超参数范围来调整这些组件的参数。要限制在某些超参数范围内的搜索，您可以在 AutoMLSearch 参数中指定 search_parameters 参数。这些参数将限制超参数搜索空间或管道参数空间。

可以通过每个组件的 API 参考找到超参数范围。参数必须指定为字典，但关联的值必须是用于设置超参数范围的 skopt.space Real、Integer 或 Categorical 对象。

但是，如果您想为 AutoML 搜索算法的初始批次指定某些值，可以使用 search_parameters 参数并使用非 skopt.space 对象。这将把初始批次的组件参数设置为通过此参数传递的值。

[24]:

from evalml import AutoMLSearch
from evalml.demos import load_fraud
from skopt.space import Categorical
from evalml.model_family import ModelFamily
import woodwork as ww

X, y = load_fraud(n_rows=1000)

# example of setting parameter to just one value
search_parameters = {"Imputer": {"numeric_impute_strategy": "mean"}}


# limit the numeric impute strategy to include only `median` and `most_frequent`
# `mean` is the default value for this argument, but it doesn't need to be included in the specified hyperparameter range for this to work
search_parameters = {
    "Imputer": {"numeric_impute_strategy": Categorical(["median", "most_frequent"])}
}

# using this custom hyperparameter means that our Imputer components in these pipelines will only search through
# 'median' and 'most_frequent' strategies for 'numeric_impute_strategy'
automl_constrained = AutoMLSearch(
    X_train=X,
    y_train=y,
    problem_type="binary",
    search_parameters=search_parameters,
    verbose=True,
)

             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1000
Targets
False    85.90%
True     14.10%
Name: count, dtype: object
AutoMLSearch will use mean CV score to rank pipelines.
Using default limit of max_batches=2.

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.