utils#

EvalML 组件的实用方法。

模块内容#

类摘要#

`WrappedSKClassifier`	Scikit-learn 分类器包装类。
`WrappedSKRegressor`	Scikit-learn 回归器包装类。

函数#

`all_components`	获取所有可用组件。
`allowed_model_families`	列出特定问题类型允许的模型类型。
`convert_bool_to_double`	将数据框中的所有布尔列转换为双精度浮点数。如果 include_ints 为 True，则也将所有整数列转换为双精度浮点数。
`estimator_unable_to_handle_nans`	如果为 True，则提供的估计器类无法处理 NaN 值作为输入。
`generate_component_code`	创建并返回一个字符串，其中包含运行 EvalML 组件所需的 Python 导入和代码。
`get_estimators`	返回特定问题类型允许的估计器。
`get_prediction_intevals_for_tree_regressors`	为基于树的回归器查找预测区间。
`handle_component_class`	如有必要，将字符串名称输入标准化为 ComponentBase 子类。
`handle_float_categories_for_catboost`	更新输入数据使其与 CatBoost 估计器兼容。
`make_balancing_dictionary`	为过采样器组件创建字典。查找每个类别与多数类别的比例。如果比例小于 sampling_ratio，我们希望进行过采样，否则我们根本不进行采样，并保留原始数据。
`match_indices`	将传递的数据框的索引与传递的 series 的索引匹配。
`scikit_learn_wrapped_estimator`	将 EvalML 对象包装为 scikit-learn 估计器。

目录#

evalml.pipelines.components.utils.all_components()[source]#: 获取所有可用组件。

evalml.pipelines.components.utils.allowed_model_families(problem_type)[source]#

列出特定问题类型允许的模型类型。

参数: problem_type (ProblemTypes or str) – ProblemTypes 枚举或字符串。
返回值: 模型族列表。
返回类型: list[ModelFamily]

evalml.pipelines.components.utils.convert_bool_to_double(data: pandas.DataFrame, include_ints: bool = False) → pandas.DataFrame[source]#

将数据框中的所有布尔列转换为双精度浮点数。如果 include_ints 为 True，则也将所有整数列转换为双精度浮点数。

参数

data (pd.DataFrame) – 输入数据框。
include_ints (bool) – 如果为 True，也将所有整数列转换为双精度浮点数。默认为 False。

返回值

输入数据框，其中所有布尔值列已转换为双精度浮点数。

返回类型

pd.DataFrame

evalml.pipelines.components.utils.estimator_unable_to_handle_nans(estimator_class)[source]#

如果为 True，则提供的估计器类无法处理 NaN 值作为输入。

参数: estimator_class (Estimator) – 估计器类
抛出: ValueError – 如果估计器不是有效的估计器类。
返回值: 如果估计器类无法处理 NaN 值则为 True，否则为 False。
返回类型: bool

evalml.pipelines.components.utils.generate_component_code(element)[source]#

创建并返回一个字符串，其中包含运行 EvalML 组件所需的 Python 导入和代码。

参数: element (component instance) – 要生成字符串 Python 代码的组件实例。
返回值: Python 代码的字符串表示形式，可以单独运行以重新创建组件实例。不包含自定义组件实现的 A 代码。
抛出: ValueError – 如果输入元素不是组件实例。

示例

>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor
>>> assert generate_component_code(DecisionTreeRegressor()) == "from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor\n\ndecisionTreeRegressor = DecisionTreeRegressor(**{'criterion': 'squared_error', 'max_features': 'sqrt', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0})"
...
>>> from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer
>>> assert generate_component_code(SimpleImputer()) == "from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer\n\nsimpleImputer = SimpleImputer(**{'impute_strategy': 'most_frequent', 'fill_value': None})"

evalml.pipelines.components.utils.get_estimators(problem_type, model_families=None, excluded_model_families=None)[source]#

返回特定问题类型允许的估计器。

也可以选择通过模型类型列表进行过滤。

参数

problem_type (ProblemTypes or str) – 要过滤的问题类型。
model_families (list(str, ModelFamily)) – 要过滤的模型族。
excluded_model_families (list(str, ModelFamily)) – 要从结果中排除的模型族列表。

返回值

估计器子类列表。

返回类型

list[class]

抛出

TypeError – 如果 model_families 参数不是列表。
RuntimeError – 如果模型族对问题类型无效。

evalml.pipelines.components.utils.get_prediction_intevals_for_tree_regressors(X: pandas.DataFrame, predictions: pandas.Series, coverage: List[float], estimators: List[evalml.pipelines.components.estimators.estimator.Estimator]) → Dict[str, pandas.Series][source]#

为基于树的回归器查找预测区间。

参数

X (pd.DataFrame) – 形状为 [n_samples, n_features] 的数据。
predictions (pd.Series) – 回归器的预测结果。
coverage (list[float]) – 一个浮点数列表，取值范围在 0 到 1 之间，表示应计算预测区间上限和下限的覆盖度。
estimators (list) – 拟合后的子估计器集合。

返回值

预测区间，键的格式为 {coverage}_lower 或 {coverage}_upper。

返回类型

dict

evalml.pipelines.components.utils.handle_component_class(component_class)[source]#

如有必要，将字符串名称输入标准化为 ComponentBase 子类。

如果提供了字符串，将尝试按名称查找 ComponentBase 类并返回一个新实例。否则，如果提供了 ComponentBase 子类或 Component 实例，将直接返回，不做修改。

参数

component_class (str, ComponentBase) – 需要标准化的输入。

返回值

ComponentBase

抛出

ValueError – 如果输入不是有效的组件类。
MissingComponentError – 如果找不到组件。

示例

>>> from evalml.pipelines.components.estimators.regressors.decision_tree_regressor import DecisionTreeRegressor
>>> handle_component_class(DecisionTreeRegressor)
<class 'evalml.pipelines.components.estimators.regressors.decision_tree_regressor.DecisionTreeRegressor'>
>>> handle_component_class("Random Forest Regressor")
<class 'evalml.pipelines.components.estimators.regressors.rf_regressor.RandomForestRegressor'>

evalml.pipelines.components.utils.handle_float_categories_for_catboost(X)[source]#

更新输入数据使其与 CatBoost 估计器兼容。

CatBoost 无法处理 X 中属于带有浮点类别的 Woodwork 分类逻辑类型的数据。此实用程序确定浮点类别是否可以在不截断任何数据的情况下转换为整数，如果可以，则将其转换为 int64 类别。不会尝试使用真正的浮点值。

参数

X (pd.DataFrame) – 已初始化 Woodwork 的 CatBoost 输入数据。

返回值

输入数据，其 Woodwork 类型信息与原始数据完全相同，但所有浮点类别都已尽可能转换为 int64。: converted to be int64 when possible.

返回类型

DataFrame

抛出

ValueError – 如果数字类别是无法在不截断数据的情况下转换为整数的实际浮点数。

evalml.pipelines.components.utils.make_balancing_dictionary(y, sampling_ratio)[source]#

为过采样器组件创建字典。查找每个类别与多数类别的比例。如果比例小于 sampling_ratio，我们希望进行过采样，否则我们根本不进行采样，并保留原始数据。

参数

y (pd.Series) – 目标数据。
sampling_ratio (float) – 我们希望样本满足的平衡比例。

返回值

字典，其中键是类别，对应的值是满足 sampling_ratio 的每个类别的样本计数。

返回类型

dict

抛出

ValueError – 如果采样比例不在 (0, 1] 范围内或目标为空。

示例

>>> import pandas as pd
>>> y = pd.Series([1] * 4 + [2] * 8 + [3])
>>> assert make_balancing_dictionary(y, 0.5) == {2: 8, 1: 4, 3: 4}
>>> assert make_balancing_dictionary(y, 0.9) == {2: 8, 1: 7, 3: 7}
>>> assert make_balancing_dictionary(y, 0.1) == {2: 8, 1: 4, 3: 1}

evalml.pipelines.components.utils.match_indices(X: pandas.DataFrame, y: pandas.Series) → Tuple[pandas.DataFrame, Union[pandas.Series, pandas.DataFrame]][source]#

将传递的数据框的索引与传递的 series 的索引匹配。

参数

X (pd.DataFrame) – 要从中匹配索引的数据框。
y (pd.Series) – 要与之匹配索引的 Series。

返回值: Tuple(pd.DataFrame, pd.Series): 具有匹配索引的数据框和 Series。

evalml.pipelines.components.utils.scikit_learn_wrapped_estimator(evalml_obj)[source]#: 将 EvalML 对象包装为 scikit-learn 估计器。

class evalml.pipelines.components.utils.WrappedSKClassifier(pipeline)[source]#

Scikit-learn 分类器包装类。

方法

`fit`	将组件拟合到数据。
`get_metadata_routing`	获取此对象的元数据路由。
`get_params`	获取此估计器的参数。
`predict`	使用选定的特征进行预测。
`predict_proba`	为标签进行概率估计。
`score`	返回给定测试数据和标签的平均准确度。
`set_params`	设置此估计器的参数。

fit(self, X, y)[source]#

将组件拟合到数据。

参数

X (pd.DataFrame or np.ndarray) – 形状为 [n_samples, n_features] 的输入训练数据。
y (pd.Series, optional) – 长度为 [n_samples] 的目标训练数据。

返回值

self

get_metadata_routing(self)#

获取此对象的元数据路由。

请查看用户指南了解路由机制如何工作。

返回值: routing – 封装路由信息的 MetadataRequest。
返回类型: MetadataRequest

get_params(self, deep=True)#

获取此估计器的参数。

参数: deep (bool, default=True) – 如果为 True，将返回此估计器及其包含的作为估计器的子对象的参数。
返回值: params – 参数名称及其对应值的映射。
返回类型: dict

predict(self, X)[source]#

使用选定的特征进行预测。

参数: X (pd.DataFrame) – 特征。
返回值: 预测值。
返回类型: np.ndarray

predict_proba(self, X)[source]#

为标签进行概率估计。

参数: X (pd.DataFrame) – 特征。
返回值: 概率估计。
返回类型: np.ndarray

score(self, X, y, sample_weight=None)#

返回给定测试数据和标签的平均准确度。

在多标签分类中，这是子集准确度，这是一个严格的指标，因为它要求每个样本的每个标签集都必须被正确预测。

参数

X (array-like of shape (n_samples, n_features)) – 测试样本。
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – X 的真实标签。
sample_weight (array-like of shape (n_samples,), default=None) – 样本权重。

返回值

score – self.predict(X) 相对于 y 的平均准确度。

返回类型

float

set_params(self, **params)#

设置此估计器的参数。

此方法适用于简单的估计器以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，因此可以更新嵌套对象的每个组件。

参数: **params (dict) – 估计器参数。
返回值: self – 估计器实例。
返回类型: 估计器实例

class evalml.pipelines.components.utils.WrappedSKRegressor(pipeline)[source]#

Scikit-learn 回归器包装类。

方法

`fit`	将组件拟合到数据。
`get_metadata_routing`	获取此对象的元数据路由。
`get_params`	获取此估计器的参数。
`predict`	使用选定的特征进行预测。
`score`	返回预测的决定系数。
`set_params`	设置此估计器的参数。

fit(self, X, y)[source]#

将组件拟合到数据。

参数

X (pd.DataFrame or np.ndarray) – 形状为 [n_samples, n_features] 的输入训练数据
y (pd.Series, optional) – 长度为 [n_samples] 的目标训练数据

返回值

self

get_metadata_routing(self)#

获取此对象的元数据路由。

请查看用户指南了解路由机制如何工作。

返回值: routing – 封装路由信息的 MetadataRequest。
返回类型: MetadataRequest

get_params(self, deep=True)#

获取此估计器的参数。

参数: deep (bool, default=True) – 如果为 True，将返回此估计器及其包含的作为估计器的子对象的参数。
返回值: params – 参数名称及其对应值的映射。
返回类型: dict

predict(self, X)[source]#

使用选定的特征进行预测。

参数: X (pd.DataFrame) – 特征。
返回值: 预测值。
返回类型: np.ndarray

score(self, X, y, sample_weight=None)#

返回预测的决定系数。

决定系数 \(R^2\) 定义为 \((1 - \frac{u}{v})\)，其中 \(u\) 是残差平方和 ((y_true - y_pred)** 2).sum()，\(v\) 是总平方和 ((y_true - y_true.mean()) ** 2).sum()。最佳可能分数为 1.0，也可能为负（因为模型可能任意差）。一个总是预测 y 的期望值而忽略输入特征的常数模型将获得 0.0 的 \(R^2\) 分数。

参数

X (array-like of shape (n_samples, n_features)) – 测试样本。对于某些估计器，这可能是一个预计算的核矩阵或通用对象列表，形状为 (n_samples, n_samples_fitted)，其中 n_samples_fitted 是用于估计器拟合的样本数量。
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – X 的真实值。
sample_weight (array-like of shape (n_samples,), default=None) – 样本权重。

返回值

score – self.predict(X) 相对于 y 的 \(R^2\) 值。

返回类型

float

注意

对回归器调用 score 时使用的 \(R^2\) 分数从 0.23 版本开始使用 multioutput='uniform_average'，以与 r2_score() 的默认值保持一致。这会影响所有多输出回归器（除了 MultiOutputRegressor）的 score 方法。

set_params(self, **params)#

设置此估计器的参数。

此方法适用于简单的估计器以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，因此可以更新嵌套对象的每个组件。

参数: **params (dict) – 估计器参数。
返回值: self – 估计器实例。
返回类型: 估计器实例