gen_utils#

通用实用方法。

模块内容#

类摘要#

classproperty

允许将函数作为类级别属性进行访问。

函数#

`are_datasets_separated_by_gap_time_index`	使用 time_index 确定训练集和测试集是否通过 gap 数量的单位分开。
`are_ts_parameters_valid_for_split`	验证 problem_configuration 中的时间序列参数与分割大小兼容。
`contains_all_ts_parameters`	验证问题配置包含所有必需的键。
`convert_to_seconds`	将描述一段时间长度的字符串转换为以秒为单位的长度。
`deprecate_arg`	用于在使用已弃用参数时引发警告的帮助函数。
`drop_rows_with_nans`	删除所有 dataframe 或 series 中包含任何 NaNs 的行。
`get_importable_subclasses`	获取基类的可导入子类。用于动态列出我们所有的 estimator、transformer、component 和 pipeline。
`get_random_seed`	给定一个 numpy.random.RandomState 对象，生成一个整数，该整数表示另一个随机数生成器的种子值。或者，如果给定一个整数，则返回该整数。
`get_random_state`	使用 seed 生成一个 numpy.random.RandomState 实例。
`get_time_index`	确定给定数据中应作为时间索引的列。
`import_or_raise`	尝试按名称导入请求的库。如果导入失败，则引发 ImportError 或警告。
`is_all_numeric`	检查给定的 DataFrame 是否仅包含数值。
`jupyter_check`	获取代码是否在 Ipython 环境中（例如 Jupyter Notebook 或 Jupyter Lab）运行。
`pad_with_nans`	在开头 num_to_pad 行填充 nans。
`safe_repr`	将给定值转换为可以安全地用于 repr 的字符串。
`save_plot`	如果指定了 filepath，则将 fig 保存到 filepath；如果未指定，则保存到默认位置。
`validate_holdout_datasets`	验证 holdout 数据集是否符合我们的预期。

属性摘要#

`logger`
`SEED_BOUNDS`

内容#

evalml.utils.gen_utils.are_datasets_separated_by_gap_time_index(train, test, pipeline_params, freq=None)[source]#

使用 time_index 确定训练集和测试集是否通过 gap 数量的单位分开。

当用户在未见数据上进行预测时，这将为 true；但在交叉验证期间不为 true，因为目标是已知的。

参数

train (pd.DataFrame) – 训练数据。
test (pd.DataFrame) – 形状为 [n_samples, n_features] 的数据。
pipeline_params (dict) – 时间序列参数字典。
freq (str) – 时间索引的频率。

返回

如果时间单位的差异等于 gap + 1，则为 true。

返回类型

bool

evalml.utils.gen_utils.are_ts_parameters_valid_for_split(gap, max_delay, forecast_horizon, n_obs, n_splits)[source]#

验证 problem_configuration 中的时间序列参数与分割大小兼容。

参数

gap (int) – gap 值。
max_delay (int) – max_delay 值。
forecast_horizon (int) – forecast_horizon 值。
n_obs (int) – 数据集中的观测数量。
n_splits (int) – 交叉验证分割的数量。

返回

TsParameterValidationResult - 包含四个字段的命名元组: is_valid (bool)：如果参数有效则为 True。msg (str)：包含要显示错误消息。如果 is_valid 为 true 则为空。smallest_split_size (int)：给定 n_obs 和 n_splits 的最小分割大小。max_window_size (int)：给定 gap、max_delay、forecast_horizon 的最大窗口大小。

class evalml.utils.gen_utils.classproperty(func)[source]#

允许将函数作为类级别属性进行访问。

示例：.. 代码块

class LogisticRegressionBinaryPipeline(PipelineBase):
    component_graph = ['Simple Imputer', 'Logistic Regression Classifier']

    @classproperty
    def summary(cls):
    summary = ""
    for component in cls.component_graph:
        component = handle_component_class(component)
        summary += component.name + " + "
    return summary

assert LogisticRegressionBinaryPipeline.summary == "Simple Imputer + Logistic Regression Classifier + "
assert LogisticRegressionBinaryPipeline().summary == "Simple Imputer + Logistic Regression Classifier + "

evalml.utils.gen_utils.contains_all_ts_parameters(problem_configuration)[source]#

验证问题配置包含所有必需的键。

参数

problem_configuration (dict) – 问题配置。

返回

如果配置包含所有参数，则为 True。如果为 False，则 msg 是一个非空的: 包含错误消息的字符串。

返回类型

bool, str

evalml.utils.gen_utils.convert_to_seconds(input_str)[source]#

将描述一段时间长度的字符串转换为以秒为单位的长度。

参数: input_str (str) – 要解析并转换为秒的字符串。
返回: 如果导入成功，则返回库。
抛出: AssertionError – 如果使用了无效的单位。

示例

>>> assert convert_to_seconds("10 hr") == 36000.0
>>> assert convert_to_seconds("30 minutes") == 1800.0
>>> assert convert_to_seconds("2.5 min") == 150.0

evalml.utils.gen_utils.deprecate_arg(old_arg, new_arg, old_value, new_value)[source]#

用于在使用已弃用参数时引发警告的帮助函数。

参数

old_arg (str) – 旧的/已弃用的参数名称。
new_arg (str) – 新参数的名称。
old_value (Any) – 用户为旧参数传入的值。
new_value (Any) – 用户为新参数传入的值。

返回

如果 old_value 不为 None，则为 old_value，否则为 new_value。

evalml.utils.gen_utils.drop_rows_with_nans(*pd_data)[source]#

删除所有 dataframe 或 series 中包含任何 NaNs 的行。

参数: *pd_data – pd.Series 或 pd.DataFrame 或 None 的序列
返回: pd.DataFrame 或 pd.Series 或 None 的列表

evalml.utils.gen_utils.get_importable_subclasses(base_class, used_in_automl=True)[source]#

获取基类的可导入子类。用于动态列出我们所有的 estimator、transformer、component 和 pipeline。

参数

base_class (abc.ABCMeta) – 用于查找所有子类的基类。
used_in_automl – 并非所有组件/管道/估计器都用于 automl 搜索。如果为 True，则仅包含搜索中使用的那些子类。这意味着排除与 ExtraTrees、ElasticNet 和 Baseline 估计器相关的类。

返回

子类列表。

evalml.utils.gen_utils.get_random_seed(random_state, min_bound=SEED_BOUNDS.min_bound, max_bound=SEED_BOUNDS.max_bound)[source]#

给定一个 numpy.random.RandomState 对象，生成一个整数，该整数表示另一个随机数生成器的种子值。或者，如果给定一个整数，则返回该整数。

为了防止输入特定库的随机数生成器时出现无效输入，如果提供了一个整数值且该值超出范围“[min_bound, max_bound)”，则将使用模运算将该值投影到 min_bound（包含）和 max_bound（不包含）之间的范围内。

参数

random_state (int, numpy.random.RandomState) – 随机状态
min_bound (None, int) – 如果不是默认值 None，则在生成种子时将作为最小界限（包含）。必须小于 max_bound。
max_bound (None, int) – 如果不是默认值 None，则在生成种子时将作为最大界限（不包含）。必须大于 min_bound。

返回

随机数生成器的种子

返回类型

int

抛出

ValueError – 如果边界无效。

evalml.utils.gen_utils.get_random_state(seed)[source]#

使用 seed 生成一个 numpy.random.RandomState 实例。

参数: seed (None, int, np.random.RandomState object) – 用于生成 numpy.random.RandomState 的种子。必须在 SEED_BOUNDS.min_bound 和 SEED_BOUNDS.max_bound 之间（包含）。
抛出: ValueError – 如果输入的种子不在可接受范围内。
返回: 一个 numpy.random.RandomState 实例。

evalml.utils.gen_utils.get_time_index(X: pandas.DataFrame, y: pandas.Series, time_index_name: str)[source]#: 确定给定数据中应作为时间索引的列。

evalml.utils.gen_utils.import_or_raise(library, error_msg=None, warning=False)[source]#

尝试按名称导入请求的库。如果导入失败，则引发 ImportError 或警告。

参数

library (str) – 库的名称。
error_msg (str) – 如果导入失败，要返回的错误消息。
warning (bool) – 如果为 True，则 import_or_raise 会发出警告而不是 ImportError。默认为 False。

返回

如果导入成功，则返回库。

抛出

ImportError – 如果尝试导入库失败，因为库未安装。
Exception – 如果导入库失败。

evalml.utils.gen_utils.is_all_numeric(df)[source]#

检查给定的 DataFrame 是否仅包含数值。

参数: df (pd.DataFrame) – 要检查数据类型的 DataFrame。
返回: 如果所有列都是数值类型且没有缺失值，则为 True，否则为 False。

evalml.utils.gen_utils.jupyter_check()[source]#

获取代码是否在 Ipython 环境中（例如 Jupyter Notebook 或 Jupyter Lab）运行。

返回: 如果是 Ipython，则为 True，否则为 False。
返回类型: boolean

evalml.utils.gen_utils.logger#

evalml.utils.gen_utils.pad_with_nans(pd_data, num_to_pad)[source]#

在开头 num_to_pad 行填充 nans。

参数

pd_data (pd.DataFrame or pd.Series) – 要填充的数据。
num_to_pad (int) – 要填充的 nans 数量。

返回

pd.DataFrame 或 pd.Series

evalml.utils.gen_utils.safe_repr(value)[source]#

将给定值转换为可以安全地用于 repr 的字符串。

参数: value – 要转换的项
返回: 值的字符串表示形式

evalml.utils.gen_utils.save_plot(fig, filepath=None, format='png', interactive=False, return_filepath=False)[source]#

如果指定了 filepath，则将 fig 保存到 filepath；如果未指定，则保存到默认位置。

参数

fig (Figure) – 要保存的图。
filepath (str or Path, optional) – 保存文件的位置。默认文件名是“test_plot”。
format (str) – 图形保存时的扩展名。如果 interactive 为 True 且 fig 是 plotly.Figure 类型，则忽略此项。默认为 ‘png’。
interactive (bool, optional) – 如果为 True 且 fig 是 plotly.Figure 类型，则将 fig 保存为交互式而不是静态，并且 format 将设置为 ‘html’。默认为 False。
return_filepath (bool, optional) – 是否返回保存图像的最终文件路径。默认为 False。

返回

如果 return_filepath 设置为 True，则返回保存图像的最终文件路径的字符串。默认为 None。

evalml.utils.gen_utils.SEED_BOUNDS#

evalml.utils.gen_utils.validate_holdout_datasets(X, X_train, pipeline_params)[source]#

验证 holdout 数据集是否符合我们的预期。

此函数在时间序列管道中调用 predict 之前运行。它验证 X（holdout 集）与训练集相距 gap 单位，且小于或等于 forecast_horizon。

参数

X (pd.DataFrame) – 形状为 [n_samples, n_features] 的数据。
X_train (pd.DataFrame) – 训练数据。
pipeline_params (dict) – 包含 gap、forecast_horizon 和 time_index 的时间序列参数字典。

返回

TSHoldoutValidationResult - 包含三个字段的命名元组: is_valid (bool)：如果 holdout 数据有效则为 True。error_messages (list)：要显示的错误消息列表。如果 is_valid 为 true 则为空。error_codes (list)：要显示的错误代码列表。如果 is_valid 为 true 则为空。