no_variance_data_check#

检查目标或任何特征是否没有方差的数据检查。

模块内容#

类摘要#

NoVarianceDataCheck

检查目标或任何特征是否没有方差。

目录#

class evalml.data_checks.no_variance_data_check.NoVarianceDataCheck(count_nan_as_value=False)[源代码]#

检查目标或任何特征是否没有方差。

参数

count_nan_as_value (布尔值) – 如果为 True,缺失值将被视为其自身的唯一值进行计数。此外,如果为 True,如果特征数据大部分缺失且只有一个唯一值,将返回 DataCheckWarning 而不是错误。默认为 False。

方法

name

返回描述数据检查的名称。

validate

检查目标或任何特征是否没有方差(只有 1 个唯一值)。

name(cls)#

返回描述数据检查的名称。

validate(self, X, y=None)[源代码]#

检查目标或任何特征是否没有方差(只有 1 个唯一值)。

参数
  • X (pd.DataFrame, np.ndarray) – 输入特征。

  • y (pd.Series, np.ndarray) – 可选,目标数据。

返回值

一个字典,包含与无方差特征或目标相对应的警告/错误。

返回值类型

dict

示例

>>> import pandas as pd

只有单个唯一值的列或目标数据将引发错误。

>>> X = pd.DataFrame([2, 2, 2, 2, 2, 2, 2, 2], columns=["First_Column"])
>>> y = pd.Series([1, 1, 1, 1, 1, 1, 1, 1])
...
>>> novar_dc = NoVarianceDataCheck()
>>> assert novar_dc.validate(X, y) == [
...     {
...         "message": "'First_Column' has 1 unique value.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["First_Column"], "rows": None},
...         "code": "NO_VARIANCE",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "NoVarianceDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["First_Column"], "rows": None}
...             },
...         ]
...     },
...     {
...         "message": "Y has 1 unique value.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Y"], "rows": None},
...         "code": "NO_VARIANCE",
...         "action_options": []
...     }
... ]

默认情况下,NaN 不被视为不同的值。在第一个示例中,除了 None 之外,仍有两个不同的值。在第二个示例中,由于目标完全为 null,没有不同的值。

>>> X["First_Column"] = [2, 2, 2, 3, 3, 3, None, None]
>>> y = pd.Series([1, 1, 1, 2, 2, 2, None, None])
>>> assert novar_dc.validate(X, y) == []
...
...
>>> y = pd.Series([None] * 7)
>>> assert novar_dc.validate(X, y) == [
...     {
...         "message": "Y has 0 unique values.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Y"], "rows": None},
...         "code": "NO_VARIANCE_ZERO_UNIQUE",
...         "action_options":[]
...     }
... ]

由于默认情况下 None 不被视为不同的值,X 和 y 中只有一个唯一值。

>>> X["First_Column"] = [2, 2, 2, 2, None, None, None, None]
>>> y = pd.Series([1, 1, 1, 1, None, None, None, None])
>>> assert novar_dc.validate(X, y) == [
...     {
...         "message": "'First_Column' has 1 unique value.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["First_Column"], "rows": None},
...         "code": "NO_VARIANCE",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "NoVarianceDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["First_Column"], "rows": None}
...             },
...         ]
...     },
...     {
...         "message": "Y has 1 unique value.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Y"], "rows": None},
...         "code": "NO_VARIANCE",
...         "action_options": []
...     }
... ]

如果 count_nan_as_value 设置为 True,则 NaN 被计为唯一值。如果只有因为 count_nan_as_value 设置为 True 才存在足够数量的唯一值,将发出警告,以便用户可以对这些值进行编码。

>>> novar_dc = NoVarianceDataCheck(count_nan_as_value=True)
>>> assert novar_dc.validate(X, y) == [
...     {
...         "message": "'First_Column' has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["First_Column"], "rows": None},
...         "code": "NO_VARIANCE_WITH_NULL",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                  "data_check_name": "NoVarianceDataCheck",
...                  "parameters": {},
...                  "metadata": {"columns": ["First_Column"], "rows": None}
...             },
...         ]
...     },
...     {
...         "message": "Y has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.",
...         "data_check_name": "NoVarianceDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Y"], "rows": None},
...         "code": "NO_VARIANCE_WITH_NULL",
...         "action_options": []
...     }
... ]