id_columns_data_check#

检查任何特征是否可能是 ID 列的数据检查。

模块内容#

类摘要#

IDColumnsDataCheck

检查任何特征是否可能是 ID 列。

内容#

evalml.data_checks.id_columns_data_check.IDColumnsDataCheck(id_threshold=1.0, exclude_time_index=True)[源]#

检查任何特征是否可能是 ID 列。

参数
  • id_threshold (浮点数) – 被认为是 ID 列的概率阈值。默认为 1.0。

  • exclude_time_index (布尔值) – 如果为 True,则设置为时间索引的列将不包含在数据检查中。默认为 True。

方法

name

返回描述数据检查的名称。

validate

检查任何特征是否可能是 ID 列。当前执行一些简单的检查。

name(cls)#

返回描述数据检查的名称。

validate(self, X, y=None)[源]#

检查任何特征是否可能是 ID 列。当前执行一些简单的检查。

执行的检查有

  • 列名是“id”

  • 列名以“_id”结尾

  • 列包含所有唯一值(并且是分类或整数类型)

参数
  • X (pd.DataFrame, np.ndarray) – 要检查的输入特征。

  • y (pd.Series) – 目标。默认为 None。忽略。

返回

一个字典,包含列名或索引及其作为 ID 列的概率

返回类型

dict

示例

>>> import pandas as pd

列名以“_id”结尾且完全唯一的列很可能是 ID 列。

>>> df = pd.DataFrame({
...     "profits": [25, 15, 15, 31, 19],
...     "customer_id": [123, 124, 125, 126, 127],
...     "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["customer_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["customer_id"], "rows": None}
...             }
...         ]
...    }
... ]

名为“ID”且包含所有唯一值的列也将被识别为 ID 列。

>>> df = df.rename(columns={"customer_id": "ID"})
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'ID' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["ID"], "rows": None},
...         "action_options": [
...            {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["ID"], "rows": None}
...             }
...         ]
...     }
... ]

尽管所有值都是唯一的,但“Country_Rank”不会被识别为 ID 列,因为 id_threshold 默认为 1.0,并且其名称并未表明它是 ID。

>>> df = pd.DataFrame({
...    "humidity": ["high", "very high", "low", "low", "high"],
...    "Country_Rank": [1, 2, 3, 4, 5],
...    "Sales": ["very high", "high", "high", "medium", "very low"]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == []

但是,降低阈值将导致此列被识别为 ID。

>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "details": {"columns": ["Country_Rank"], "rows": None},
...         "code": "HAS_ID_COLUMN",
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["Country_Rank"], "rows": None}
...             }
...         ]
...     }
... ]

如果数据框的第一列包含所有唯一值,并且命名为“ID”或以“_id”结尾,则它很可能是主键。应删除其他 ID 列。

>>> df = pd.DataFrame({
...     "sales_id": [0, 1, 2, 3, 4],
...     "customer_id": [123, 124, 125, 126, 127],
...     "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == [
...     {
...         "message": "The first column 'sales_id' is likely to be the primary key",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_FIRST_COLUMN",
...         "details": {"columns": ["sales_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "SET_FIRST_COL_ID",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["sales_id"], "rows": None}
...             }
...         ]
...    },
...    {
...        "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
...         "data_check_name": "IDColumnsDataCheck",
...         "level": "warning",
...         "code": "HAS_ID_COLUMN",
...         "details": {"columns": ["customer_id"], "rows": None},
...         "action_options": [
...             {
...                 "code": "DROP_COL",
...                 "data_check_name": "IDColumnsDataCheck",
...                 "parameters": {},
...                 "metadata": {"columns": ["customer_id"], "rows": None}
...             }
...         ]
...    }
... ]