utils#

Automl 中有用的工具函数。

模块内容#

函数#

`check_all_pipeline_names_unique`	检查所有流水线名称是否唯一。
`get_best_sampler_for_data`	返回用于 AutoMLSearch 的采样器组件名称。
`get_default_primary_search_objective`	获取问题类型的默认主要搜索目标。
`get_pipelines_from_component_graphs`	根据指定的问题类型，从传入的组件图中返回创建的流水线。
`get_threshold_tuning_info`	对于给定的 automl 配置和流水线，确定其阈值调优目标，以及是否需要进一步分割训练数据以实现正确的阈值调优。
`make_data_splitter`	给定训练数据和机器学习问题参数，计算在 AutoML 搜索期间使用的数据分割方法。
`resplit_training_data`	进一步分割给定流水线的训练数据。对于二元流水线，这对于正确调优阈值是必需的。
`tune_binary_threshold`	将二元流水线的阈值调优到 X 和 y 阈值数据。

属性摘要#

AutoMLConfig

内容#

evalml.automl.utils.AutoMLConfig#

evalml.automl.utils.check_all_pipeline_names_unique(pipelines)[source]#

检查所有流水线名称是否唯一。

参数: pipelines (list[PipelineBase]) – 要检查名称是否都唯一的流水线列表。
抛出: ValueError – 如果任何流水线名称重复。

evalml.automl.utils.get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio)[source]#

返回用于 AutoMLSearch 的采样器组件名称。

参数

X (pd.DataFrame) – 输入特征数据
y (pd.Series) – 输入目标数据
sampler_method (str) – 传递给 AutoMLSearch 的 sampler_type 参数
sampler_balanced_ratio (float) – 我们认为平衡的 min:majority 目标比例，或者应该将类别平衡到的比例。

返回

要使用的采样组件的字符串名称，如果不需要采样器则为 None

返回类型

str, None

evalml.automl.utils.get_default_primary_search_objective(problem_type)[source]#

获取问题类型的默认主要搜索目标。

参数: problem_type (str or ProblemType) – 感兴趣的问题类型。
返回: 问题类型的主要目标实例。
返回类型: ObjectiveBase

evalml.automl.utils.get_pipelines_from_component_graphs(component_graphs_dict, problem_type, parameters=None, random_seed=0)[source]#

根据指定的问题类型，从传入的组件图中返回创建的流水线。

参数

component_graphs_dict (dict) – 组件图字典。
problem_type (str or ProblemType) – 将创建流水线的问题类型。
parameters (dict) – 应传递给建议流水线的流水线级别参数。默认为 None。
random_seed (int) – 随机种子。默认为 0。

返回

从传入的组件图生成的流水线列表。

返回类型

list

evalml.automl.utils.get_threshold_tuning_info(automl_config, pipeline)[source]#

对于给定的 automl 配置和流水线，确定其阈值调优目标，以及是否需要进一步分割训练数据以实现正确的阈值调优。

也可以在 automl 搜索执行后使用，以确定是否使用了完整的训练数据来训练流水线。

参数

automl_config (AutoMLConfig) – AutoMLSearch 的配置对象。用于确定阈值调优目标以及数据是否需要重新分割。
pipeline (Pipeline) – 要进行阈值处理的流水线实例。

返回

threshold_tuning_objective, data_needs_resplitting (str, bool)

evalml.automl.utils.make_data_splitter(X, y, problem_type, problem_configuration=None, n_splits=3, shuffle=True, random_seed=0)[source]#

给定训练数据和机器学习问题参数，计算在 AutoML 搜索期间使用的数据分割方法。

参数

X (pd.DataFrame) – 输入训练数据，形状为 [n_samples, n_features]。
y (pd.Series) – 目标训练数据，长度为 [n_samples]。
problem_type (ProblemType) – 机器学习问题的类型。
problem_configuration (dict, None) – 配置搜索所需的额外参数。例如，在时间序列问题中，应为 time_index、gap 和 max_delay 变量传入值。默认为 None。
n_splits (int, None) – 交叉验证（CV）分割的数量（如果适用）。默认为 3。
shuffle (bool) – 是否在分割前打乱数据（如果适用）。默认为 True。
random_seed (int) – 随机数生成器的种子。默认为 0。

返回

数据分割方法。

返回类型

sklearn.model_selection.BaseCrossValidator

抛出

ValueError – 如果未为时间序列问题提供 problem_configuration。

evalml.automl.utils.resplit_training_data(pipeline, X_train, y_train)[source]#

进一步分割给定流水线的训练数据。对于二元流水线，这对于正确调优阈值是必需的。

可以在执行 automl 搜索后使用，以重新创建用于训练流水线的数据。

参数

pipeline (PipelineBase) – 我们正在分割其训练数据的流水线
X_train (pd.DataFrame or np.ndarray) – 形状为 [n_samples, n_features] 的训练数据
y_train (pd.Series, or np.ndarray) – 长度为 [n_samples] 的训练目标数据

返回

特征和目标数据分别分割为训练集和阈值调优集。

返回类型

pd.DataFrame, pd.DataFrame, pd.Series, pd.Series

evalml.automl.utils.tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning, y_threshold_tuning, X=None, y=None)[source]#

将二元流水线的阈值调优到 X 和 y 阈值数据。

参数

pipeline (Pipeline) – 要进行阈值处理的流水线实例。
objective (ObjectiveBase) – 我们想要调优的目标。如果不可调优且 best_pipeline 为 True，将使用 F1。
problem_type (ProblemType) – 流水线的问题类型。
X_threshold_tuning (pd.DataFrame) – 用于调优流水线的特征。
y_threshold_tuning (pd.Series) – 用于调优流水线的目标数据。
X (pd.DataFrame) – 用于训练流水线的特征（用于时间序列二元问题）。默认为 None。
y (pd.Series) – 用于训练流水线的目标（用于时间序列二元问题）。默认为 None。