no_variance_data_check#
检查目标或任何特征是否没有方差的数据检查。
模块内容#
类摘要#
检查目标或任何特征是否没有方差。 |
目录#
- class evalml.data_checks.no_variance_data_check.NoVarianceDataCheck(count_nan_as_value=False)[源代码]#
检查目标或任何特征是否没有方差。
- 参数
count_nan_as_value (布尔值) – 如果为 True,缺失值将被视为其自身的唯一值进行计数。此外,如果为 True,如果特征数据大部分缺失且只有一个唯一值,将返回 DataCheckWarning 而不是错误。默认为 False。
方法
- name(cls)#
返回描述数据检查的名称。
- validate(self, X, y=None)[源代码]#
检查目标或任何特征是否没有方差(只有 1 个唯一值)。
- 参数
X (pd.DataFrame, np.ndarray) – 输入特征。
y (pd.Series, np.ndarray) – 可选,目标数据。
- 返回值
一个字典,包含与无方差特征或目标相对应的警告/错误。
- 返回值类型
dict
示例
>>> import pandas as pd
只有单个唯一值的列或目标数据将引发错误。
>>> X = pd.DataFrame([2, 2, 2, 2, 2, 2, 2, 2], columns=["First_Column"]) >>> y = pd.Series([1, 1, 1, 1, 1, 1, 1, 1]) ... >>> novar_dc = NoVarianceDataCheck() >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [] ... } ... ]
默认情况下,NaN 不被视为不同的值。在第一个示例中,除了 None 之外,仍有两个不同的值。在第二个示例中,由于目标完全为 null,没有不同的值。
>>> X["First_Column"] = [2, 2, 2, 3, 3, 3, None, None] >>> y = pd.Series([1, 1, 1, 2, 2, 2, None, None]) >>> assert novar_dc.validate(X, y) == [] ... ... >>> y = pd.Series([None] * 7) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "Y has 0 unique values.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE_ZERO_UNIQUE", ... "action_options":[] ... } ... ]
由于默认情况下 None 不被视为不同的值,X 和 y 中只有一个唯一值。
>>> X["First_Column"] = [2, 2, 2, 2, None, None, None, None] >>> y = pd.Series([1, 1, 1, 1, None, None, None, None]) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has 1 unique value.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE", ... "action_options": [] ... } ... ]
如果 count_nan_as_value 设置为 True,则 NaN 被计为唯一值。如果只有因为 count_nan_as_value 设置为 True 才存在足够数量的唯一值,将发出警告,以便用户可以对这些值进行编码。
>>> novar_dc = NoVarianceDataCheck(count_nan_as_value=True) >>> assert novar_dc.validate(X, y) == [ ... { ... "message": "'First_Column' has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["First_Column"], "rows": None}, ... "code": "NO_VARIANCE_WITH_NULL", ... "action_options": [ ... { ... "code": "DROP_COL", ... "data_check_name": "NoVarianceDataCheck", ... "parameters": {}, ... "metadata": {"columns": ["First_Column"], "rows": None} ... }, ... ] ... }, ... { ... "message": "Y has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning.", ... "data_check_name": "NoVarianceDataCheck", ... "level": "warning", ... "details": {"columns": ["Y"], "rows": None}, ... "code": "NO_VARIANCE_WITH_NULL", ... "action_options": [] ... } ... ]