流水线#

EvalML 流水线表示应用于数据的一系列操作,其中每个操作都是数据转换或机器学习建模算法。

流水线包含一个或多个组件的组合,这些组件将按顺序应用于新的输入数据。

每个组件和流水线都支持一组配置其行为的参数。AutoML 搜索过程旨在找到在数据上表现最佳的流水线结构和流水线参数组合。

定义流水线实例#

可以使用以下任何类实例化流水线实例

  • 回归流水线

  • 二元分类流水线

  • 多类别分类流水线

  • 时间序列回归流水线

  • 时间序列二元分类流水线

  • 时间序列多类别分类流水线

您想要使用的类将取决于您的问d题类型。实例化流水线实例唯一需要的参数输入是 component_graph,它可以是 ComponentGraph 实例、列表或包含一系列要拟合和评估的组件的字典。

一个 component_graph 列表是默认表示形式,它表示转换组件的线性顺序,估计器作为最后一个组件。一个 component_graph 字典用于表示组件的非线性图,其中键是每个组件的唯一名称,值是组件类作为第一个元素、组件的任何父级作为后续元素(或多个元素)的列表。对于这两种 component_graph 格式,自定义组件可以作为对组件类的引用提供,而 EvalML 中定义的组件可以作为字符串名称或对组件类的引用提供。

如果您选择提供 ComponentGraph 实例并希望为流水线设置自定义参数,请通过流水线初始化进行设置,而不是通过 ComponentGraph.instantiate() 进行设置。

[1]:
from evalml.pipelines import MulticlassClassificationPipeline, ComponentGraph

component_graph_as_list = ["Imputer", "Random Forest Classifier"]
MulticlassClassificationPipeline(component_graph=component_graph_as_list)
[1]:
pipeline = MulticlassClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'Random Forest Classifier': ['Random Forest Classifier', 'Imputer.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0)
[2]:
component_graph_as_dict = {
    "Imputer": ["Imputer", "X", "y"],
    "Encoder": ["One Hot Encoder", "Imputer.x", "y"],
    "Random Forest Clf": ["Random Forest Classifier", "Encoder.x", "y"],
    "Elastic Net Clf": ["Elastic Net Classifier", "Encoder.x", "y"],
    "Final Estimator": [
        "Logistic Regression Classifier",
        "Random Forest Clf.x",
        "Elastic Net Clf.x",
        "y",
    ],
}

MulticlassClassificationPipeline(component_graph=component_graph_as_dict)
[2]:
pipeline = MulticlassClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'Encoder': ['One Hot Encoder', 'Imputer.x', 'y'], 'Random Forest Clf': ['Random Forest Classifier', 'Encoder.x', 'y'], 'Elastic Net Clf': ['Elastic Net Classifier', 'Encoder.x', 'y'], 'Final Estimator': ['Logistic Regression Classifier', 'Random Forest Clf.x', 'Elastic Net Clf.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Clf':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}, 'Elastic Net Clf':{'penalty': 'elasticnet', 'C': 1.0, 'l1_ratio': 0.15, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'saga'}, 'Final Estimator':{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'}}, random_seed=0)
[3]:
cg = ComponentGraph(component_graph_as_dict)

# set parameters in the pipeline rather than through cg.instantiate()
MulticlassClassificationPipeline(component_graph=cg, parameters={})
[3]:
pipeline = MulticlassClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'Encoder': ['One Hot Encoder', 'Imputer.x', 'y'], 'Random Forest Clf': ['Random Forest Classifier', 'Encoder.x', 'y'], 'Elastic Net Clf': ['Elastic Net Classifier', 'Encoder.x', 'y'], 'Final Estimator': ['Logistic Regression Classifier', 'Random Forest Clf.x', 'Elastic Net Clf.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Clf':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}, 'Elastic Net Clf':{'penalty': 'elasticnet', 'C': 1.0, 'l1_ratio': 0.15, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'saga'}, 'Final Estimator':{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'}}, random_seed=0)

如果您使用自己的自定义组件,可以这样引用它们

[4]:
from evalml.pipelines.components import Transformer


class NewTransformer(Transformer):
    name = "New Transformer"
    hyperparameter_ranges = {"parameter_1": ["a", "b", "c"]}

    def __init__(self, parameter_1=1, random_seed=0):
        parameters = {"parameter_1": parameter_1}
        super().__init__(parameters=parameters, random_seed=random_seed)

    def transform(self, X, y=None):
        # Your code here!
        return X


MulticlassClassificationPipeline([NewTransformer, "Random Forest Classifier"])
[4]:
pipeline = MulticlassClassificationPipeline(component_graph={'New Transformer': [NewTransformer, 'X', 'y'], 'Random Forest Classifier': ['Random Forest Classifier', 'New Transformer.x', 'y']}, parameters={'New Transformer':{'parameter_1': 1}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0)

流水线用法#

所有流水线都定义了以下方法

  • fit 按顺序拟合提供的训练数据上的每个组件。

  • predict 计算组件图在提供的数据上的预测。

  • score 计算目标函数在提供的数据上的值。

[5]:
from evalml.demos import load_wine

X, y = load_wine()

pipeline = MulticlassClassificationPipeline(
    component_graph={
        "Label Encoder": ["Label Encoder", "X", "y"],
        "Imputer": ["Imputer", "X", "Label Encoder.y"],
        "Random Forest Classifier": [
            "Random Forest Classifier",
            "Imputer.x",
            "Label Encoder.y",
        ],
    }
)
pipeline.fit(X, y)
print(pipeline.predict(X))
print(pipeline.score(X, y, objectives=["log loss multiclass"]))
         Number of Features
Numeric                  13

Number of training examples: 178
Targets
class_1    39.89%
class_0    33.15%
class_2    26.97%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
0      class_0
1      class_0
2      class_0
3      class_0
4      class_0
        ...
173    class_2
174    class_2
175    class_2
176    class_2
177    class_2
Length: 178, dtype: category
Categories (3, object): ['class_0', 'class_1', 'class_2']
OrderedDict([('Log Loss Multiclass', 0.04132737017536072)])
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

自定义名称#

默认情况下,流水线的名称是使用构成流水线的组件图创建的。例如,包含填充器、独热编码器和逻辑回归分类器的流水线将被命名为“带填充器 + 独热编码器的逻辑回归分类器”。

如果您想覆盖流水线的名称属性,可以在初始化流水线时设置 custom_name 参数,如下所示

[6]:
component_graph = ["Imputer", "One Hot Encoder", "Logistic Regression Classifier"]
pipeline = MulticlassClassificationPipeline(component_graph)
print("Pipeline with default name:", pipeline.name)


pipeline_with_name = MulticlassClassificationPipeline(
    component_graph, custom_name="My cool custom pipeline"
)
print("Pipeline with custom name:", pipeline_with_name.name)
Pipeline with default name: Logistic Regression Classifier w/ Imputer + One Hot Encoder
Pipeline with custom name: My cool custom pipeline

流水线参数#

您还可以使用 parameters 参数传入自定义参数,这些参数将在实例化 component_graph 中的每个组件时使用。参数字典需要采用两层字典的格式,其中键值对是组件名称和相应的组件参数字典。组件参数字典由(参数名称,参数值)键值对组成。

下面将展示一个示例。组件参数的 API 参考也可以在此处找到。

[7]:
parameters = {
    "Imputer": {
        "categorical_impute_strategy": "most_frequent",
        "numeric_impute_strategy": "median",
    },
    "Logistic Regression Classifier": {
        "penalty": "l2",
        "C": 1.0,
    },
}
component_graph = [
    "Imputer",
    "One Hot Encoder",
    "Standard Scaler",
    "Logistic Regression Classifier",
]
MulticlassClassificationPipeline(component_graph=component_graph, parameters=parameters)
[7]:
pipeline = MulticlassClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'y'], 'Standard Scaler': ['Standard Scaler', 'One Hot Encoder.x', 'y'], 'Logistic Regression Classifier': ['Logistic Regression Classifier', 'Standard Scaler.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'median', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Logistic Regression Classifier':{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'}}, random_seed=0)

流水线描述#

您可以调用 .graph() 查看每个组件及其参数。每个组件接收数据并将其传递给下一个。

[8]:
component_graph = [
    "Imputer",
    "One Hot Encoder",
    "Standard Scaler",
    "Logistic Regression Classifier",
]
pipeline = MulticlassClassificationPipeline(
    component_graph=component_graph, parameters=parameters
)
pipeline.graph()
[8]:
../_images/user_guide_pipelines_14_0.svg
[9]:
component_graph_as_dict = {
    "Imputer": ["Imputer", "X", "y"],
    "Encoder": ["One Hot Encoder", "Imputer.x", "y"],
    "Random Forest Clf": ["Random Forest Classifier", "Encoder.x", "y"],
    "Elastic Net Clf": ["Elastic Net Classifier", "Encoder.x", "y"],
    "Final Estimator": [
        "Logistic Regression Classifier",
        "Random Forest Clf.x",
        "Elastic Net Clf.x",
        "y",
    ],
}

nonlinear_pipeline = MulticlassClassificationPipeline(
    component_graph=component_graph_as_dict
)
nonlinear_pipeline.graph()
[9]:
../_images/user_guide_pipelines_15_0.svg

您可以通过调用 .describe() 查看流水线的文本表示。

[10]:
pipeline.describe()
*********************************************************************************
* Logistic Regression Classifier w/ Imputer + One Hot Encoder + Standard Scaler *
*********************************************************************************

Problem Type: multiclass
Model Family: Linear

Pipeline Steps
==============
1. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : median
         * boolean_impute_strategy : most_frequent
         * categorical_fill_value : None
         * numeric_fill_value : None
         * boolean_fill_value : None
2. One Hot Encoder
         * top_n : 10
         * features_to_encode : None
         * categories : None
         * drop : if_binary
         * handle_unknown : ignore
         * handle_missing : error
3. Standard Scaler
4. Logistic Regression Classifier
         * penalty : l2
         * C : 1.0
         * n_jobs : -1
         * multi_class : auto
         * solver : lbfgs
[11]:
nonlinear_pipeline.describe()
*******************************************************************************************************************
* Logistic Regression Classifier w/ Imputer + One Hot Encoder + Random Forest Classifier + Elastic Net Classifier *
*******************************************************************************************************************

Problem Type: multiclass
Model Family: Linear

Pipeline Steps
==============
1. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * boolean_impute_strategy : most_frequent
         * categorical_fill_value : None
         * numeric_fill_value : None
         * boolean_fill_value : None
2. One Hot Encoder
         * top_n : 10
         * features_to_encode : None
         * categories : None
         * drop : if_binary
         * handle_unknown : ignore
         * handle_missing : error
3. Random Forest Classifier
         * n_estimators : 100
         * max_depth : 6
         * n_jobs : -1
4. Elastic Net Classifier
         * penalty : elasticnet
         * C : 1.0
         * l1_ratio : 0.15
         * n_jobs : -1
         * multi_class : auto
         * solver : saga
5. Logistic Regression Classifier
         * penalty : l2
         * C : 1.0
         * n_jobs : -1
         * multi_class : auto
         * solver : lbfgs

组件图#

您可以使用 pipeline.get_component(name) 并提供组件名称来访问任何组件(API 参考此处

[12]:
pipeline.get_component("Imputer")
[12]:
Imputer(categorical_impute_strategy='most_frequent', numeric_impute_strategy='median', boolean_impute_strategy='most_frequent', categorical_fill_value=None, numeric_fill_value=None, boolean_fill_value=None)
[13]:
nonlinear_pipeline.get_component("Elastic Net Clf")
[13]:
ElasticNetClassifier(penalty='elasticnet', C=1.0, l1_ratio=0.15, n_jobs=-1, multi_class='auto', solver='saga')

或者,您可以直接索引流水线来获取组件

[14]:
first_component = pipeline[0]
print(first_component.name)
Imputer
[15]:
nonlinear_pipeline["Final Estimator"]
[15]:
LogisticRegressionClassifier(penalty='l2', C=1.0, n_jobs=-1, multi_class='auto', solver='lbfgs')

流水线估计器#

EvalML 强制线性流水线的最后一个组件是估计器。您可以通过使用 pipeline.estimator 直接访问此估计器。

[16]:
pipeline.estimator
[16]:
LogisticRegressionClassifier(penalty='l2', C=1.0, n_jobs=-1, multi_class='auto', solver='lbfgs')

输入特征名称#

流水线拟合后,您可以访问流水线的 input_feature_names 属性,以获取一个字典,其中包含传递给流水线每个组件的特征名称列表。这对于调试(例如特征可能已被删除)或检测意外行为特别有用。

[17]:
pipeline = MulticlassClassificationPipeline(["Imputer", "Random Forest Classifier"])
pipeline.fit(X, y)
pipeline.input_feature_names
[17]:
{'Imputer': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline'],
 'Random Forest Classifier': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']}

二元分类流水线阈值#

对于二元分类流水线,您可以选择调整决策边界阈值,这允许流水线区分正负预测。如果未设置阈值,则默认边界为 0.5,这意味着概率 >= 0.5 的预测被分类为正类,而所有其他预测为负类。

您可以使用二元分类流水线的 optimize_thresholds 方法来选择适合目标函数的最佳阈值,或者手动设置阈值。EvalML 的 AutoMLSearch 默认对 binary 问题使用 optimize_thresholds,并使用 F1 作为默认优化目标。通过传入 optimize_thresholds=False 可以关闭此功能,或者通过更改 objectivealternate_thresholding_objective 参数来更改使用的目标函数。

[18]:
from evalml.demos import load_breast_cancer
from evalml.pipelines import BinaryClassificationPipeline

X, y = load_breast_cancer()
X_to_predict = X.tail(10)

bcp = BinaryClassificationPipeline(
    {
        "Imputer": ["Imputer", "X", "y"],
        "Label Encoder": ["Label Encoder", "Imputer.x", "y"],
        "RFC": ["Random Forest Classifier", "Imputer.x", "Label Encoder.y"],
    }
)
bcp.fit(X, y)

predict_proba = bcp.predict_proba(X_to_predict)
predict_proba
         Number of Features
Numeric                  30

Number of training examples: 569
Targets
benign       62.74%
malignant    37.26%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[18]:
良性 恶性
559 0.925711 0.074289
560 0.939512 0.060488
561 0.991177 0.008823
562 0.010155 0.989845
563 0.000155 0.999845
564 0.000100 0.999900
565 0.000155 0.999845
566 0.011528 0.988472
567 0.000155 0.999845
568 0.994452 0.005548
[19]:
# view the current threshold
print("The threshold is {}".format(bcp.threshold))

# view the first few predictions
print(bcp.predict(X_to_predict))
The threshold is None
559       benign
560       benign
561       benign
562    malignant
563    malignant
564    malignant
565    malignant
566    malignant
567    malignant
568       benign
dtype: category
Categories (2, object): ['benign', 'malignant']
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

请注意,上面的默认阈值为 None,这意味着流水线默认使用 0.5 作为阈值。

您也可以手动设置阈值

[20]:
# you can manually set the threshold
bcp.threshold = 0.99
# view the threshold
print("The threshold is {}".format(bcp.threshold))

# view the first few predictions
print(bcp.predict(X_to_predict))
The threshold is 0.99
559       benign
560       benign
561       benign
562       benign
563    malignant
564    malignant
565    malignant
566       benign
567    malignant
568       benign
Name: malignant, dtype: category
Categories (2, object): ['benign', 'malignant']
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

然而,设置阈值的最佳方法是使用流水线的 optimize_threshold 方法。此方法接收预测值、真实值以及要优化的目标函数,并找到能使此目标函数值最大化的最佳阈值。

此方法最好与验证数据一起使用,因为在训练数据上优化可能导致过拟合,而在测试数据上优化会引入较大的偏差。

下面将介绍使用 F1 目标进行阈值调整的过程。

[21]:
from evalml.objectives import F1

# get predictions for positive class only
predict_proba = predict_proba.iloc[:, -1]
bcp.optimize_threshold(X_to_predict, y.tail(10), predict_proba, F1())

print("The new threshold is {}".format(bcp.threshold))

# view the first few predictions
print(bcp.predict(X_to_predict))
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
The new threshold is 0.4912626081108861
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
559       benign
560       benign
561       benign
562    malignant
563    malignant
564    malignant
565    malignant
566    malignant
567    malignant
568       benign
Name: malignant, dtype: category
Categories (2, object): ['benign', 'malignant']

获取决策边界附近的行#

对于二元分类问题,您还可以使用 rows_of_interest 查看最接近决策边界的行。此方法返回感兴趣的索引,然后可以使用这些索引获取最接近决策边界的数据子集。这有助于进一步分析模型,并让您更好地了解模型可能在哪些行上遇到困难。

rows_of_interest 接受一个 epsilon 参数(默认为 0.1),该参数确定要返回哪些行。返回的行是正类概率落在 threshold +- epsilon 范围内的行。增大 epsilon 值可以获得更多行,减小则获得更少行。

下面是使用 rows_of_interest 的演练,基于之前已进行阈值处理的流水线。

[22]:
from evalml.pipelines.utils import rows_of_interest

indices = rows_of_interest(bcp, X, y, types="all")
X.iloc[indices].head()
[22]:
平均半径 平均纹理 平均周长 平均面积 平均平滑度 平均紧致度 平均凹度 平均凹点 平均对称性 平均分形维数 ... 最差半径 最差纹理 最差周长 最差面积 最差平滑度 最差紧致度 最差凹度 最差凹点 最差对称性 最差分形维数
40 13.44 21.58 86.18 563.0 0.08162 0.06031 0.03110 0.02031 0.1784 0.05587 ... 15.93 30.25 102.5 787.9 0.1094 0.20430 0.2085 0.1112 0.2994 0.07146
297 11.76 18.14 75.00 431.1 0.09968 0.05914 0.02685 0.03515 0.1619 0.06287 ... 13.36 23.39 85.1 553.6 0.1137 0.07974 0.0612 0.0716 0.1978 0.06915

2 行 × 30 列

您可以查看这些行的概率,以确定它们与新的流水线阈值有多接近。此处使用 X 是为了简洁。

[23]:
pred_proba = bcp.predict_proba(X)
pos_value_proba = pred_proba.iloc[:, -1]
pos_value_proba.iloc[indices].head()
[23]:
40     0.448465
297    0.565925
Name: malignant, dtype: float64

保存和加载流水线#

您可以使用 Python 的 pickle 格式保存和加载训练或未训练的流水线实例,如下所示

[24]:
import pickle

pipeline_to_pickle = BinaryClassificationPipeline(
    ["Imputer", "Random Forest Classifier"]
)

with open("pipeline.pkl", "wb") as f:
    pickle.dump(pipeline_to_pickle, f)

pickled_pipeline = None
with open("pipeline.pkl", "rb") as f:
    pickled_pipeline = pickle.load(f)

assert pickled_pipeline == pipeline_to_pickle
pickled_pipeline.fit(X, y)
[24]:
pipeline = BinaryClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'Random Forest Classifier': ['Random Forest Classifier', 'Imputer.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, random_seed=0)

生成代码#

实例化流水线后,您可以生成 Python 代码字符串来重新创建此流水线,然后可以将其保存并在 EvalML 的其他地方运行。generate_pipeline_code 需要流水线实例作为输入。它也可以处理自定义组件,但不会返回定义组件所需的代码。请注意,创建流水线实例时使用的任何外部库也需要导入才能执行返回的代码。

尚不支持非线性流水线的代码生成。

[25]:
from evalml.pipelines.utils import generate_pipeline_code
from evalml.pipelines import BinaryClassificationPipeline
import pandas as pd
from evalml.utils import infer_feature_types
from skopt.space import Integer


class MyDropNullColumns(Transformer):
    """Transformer to drop features whose percentage of NaN values exceeds a specified threshold"""

    name = "My Drop Null Columns Transformer"
    hyperparameter_ranges = {}

    def __init__(self, pct_null_threshold=1.0, random_seed=0, **kwargs):
        """Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.

        Args:
            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.
                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.
                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.
        """
        if pct_null_threshold < 0 or pct_null_threshold > 1:
            raise ValueError(
                "pct_null_threshold must be a float between 0 and 1, inclusive."
            )
        parameters = {"pct_null_threshold": pct_null_threshold}
        parameters.update(kwargs)

        self._cols_to_drop = None
        super().__init__(
            parameters=parameters, component_obj=None, random_seed=random_seed
        )

    def fit(self, X, y=None):
        pct_null_threshold = self.parameters["pct_null_threshold"]
        X = infer_feature_types(X)
        percent_null = X.isnull().mean()
        if pct_null_threshold == 0.0:
            null_cols = percent_null[percent_null > 0]
        else:
            null_cols = percent_null[percent_null >= pct_null_threshold]
        self._cols_to_drop = list(null_cols.index)
        return self

    def transform(self, X, y=None):
        """Transforms data X by dropping columns that exceed the threshold of null values.
        Args:
            X (pd.DataFrame): Data to transform
            y (pd.Series, optional): Targets
        Returns:
            pd.DataFrame: Transformed X
        """

        X = infer_feature_types(X)
        return X.drop(columns=self._cols_to_drop)


pipeline_instance = BinaryClassificationPipeline(
    [
        "Imputer",
        MyDropNullColumns,
        "DateTime Featurizer",
        "Natural Language Featurizer",
        "One Hot Encoder",
        "Random Forest Classifier",
    ],
    custom_name="Pipeline with Custom Component",
    random_seed=20,
)

code = generate_pipeline_code(pipeline_instance)
print(code)

# This string can then be pasted into a separate window and run, although since the pipeline has custom component `MyDropNullColumns`,
#      the code for that component must also be included
from evalml.demos import load_fraud

X, y = load_fraud(1000)
exec(code)
pipeline.fit(X, y)
from evalml.pipelines.binary_classification_pipeline import BinaryClassificationPipeline

pipeline = BinaryClassificationPipeline(
    component_graph={
        "Imputer": ["Imputer", "X", "y"],
        "My Drop Null Columns Transformer": [MyDropNullColumns, "Imputer.x", "y"],
        "DateTime Featurizer": [
            "DateTime Featurizer",
            "My Drop Null Columns Transformer.x",
            "y",
        ],
        "Natural Language Featurizer": [
            "Natural Language Featurizer",
            "DateTime Featurizer.x",
            "y",
        ],
        "One Hot Encoder": ["One Hot Encoder", "Natural Language Featurizer.x", "y"],
        "Random Forest Classifier": [
            "Random Forest Classifier",
            "One Hot Encoder.x",
            "y",
        ],
    },
    parameters={
        "Imputer": {
            "categorical_impute_strategy": "most_frequent",
            "numeric_impute_strategy": "mean",
            "boolean_impute_strategy": "most_frequent",
            "categorical_fill_value": None,
            "numeric_fill_value": None,
            "boolean_fill_value": None,
        },
        "My Drop Null Columns Transformer": {"pct_null_threshold": 1.0},
        "DateTime Featurizer": {
            "features_to_extract": ["year", "month", "day_of_week", "hour"],
            "encode_as_categories": False,
            "time_index": None,
        },
        "One Hot Encoder": {
            "top_n": 10,
            "features_to_encode": None,
            "categories": None,
            "drop": "if_binary",
            "handle_unknown": "ignore",
            "handle_missing": "error",
        },
        "Random Forest Classifier": {"n_estimators": 100, "max_depth": 6, "n_jobs": -1},
    },
    custom_name="Pipeline with Custom Component",
    random_seed=20,
)

             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1000
Targets
False    85.90%
True     14.10%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[25]:
pipeline = BinaryClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'My Drop Null Columns Transformer': [MyDropNullColumns, 'Imputer.x', 'y'], 'DateTime Featurizer': ['DateTime Featurizer', 'My Drop Null Columns Transformer.x', 'y'], 'Natural Language Featurizer': ['Natural Language Featurizer', 'DateTime Featurizer.x', 'y'], 'One Hot Encoder': ['One Hot Encoder', 'Natural Language Featurizer.x', 'y'], 'Random Forest Classifier': ['Random Forest Classifier', 'One Hot Encoder.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'boolean_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None, 'boolean_fill_value': None}, 'My Drop Null Columns Transformer':{'pct_null_threshold': 1.0}, 'DateTime Featurizer':{'features_to_extract': ['year', 'month', 'day_of_week', 'hour'], 'encode_as_categories': False, 'time_index': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1}}, custom_name='Pipeline with Custom Component', random_seed=20)