理解数据检查操作#

EvalML 简化了表格数据的机器学习模型的创建和实现。它提供的众多功能之一是数据检查,这有助于我们在训练模型之前确定数据的“健康”状况。这些数据检查关联着相应的操作,将在本 Notebook 中展示。在默认数据检查中,我们有以下检查:

  • NullDataCheck: 检查行或列是否为空或严重为空

  • IDColumnsDataCheck: 检查可能为 ID 列的列

  • TargetLeakageDataCheck: 检查任何输入特征是否与目标有高度关联

  • InvalidTargetDataCheck: 检查目标中是否存在空值或其他无效值

  • NoVarianceDataCheck: 检查目标或任何特征是否没有方差

EvalML 还有更多数据检查,可在此处查看此处,使用示例在此处此处。下面,我们将逐步介绍 EvalML 默认数据检查和操作的使用。

首先,我们导入演示这些检查所需的依赖项。

[1]:
import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from evalml.preprocessing import split_data

让我们看看输入特征数据。EvalML 使用 Woodwork 库来表示这些数据。EvalML 返回的演示数据是 Woodwork DataTable 和 DataColumn。

[2]:
X, y = load_fraud(n_rows=1500)
X.head()
             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1500
Targets
False    86.60%
True     13.40%
Name: count, dtype: object
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[2]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country
id
0 32261 8516 2019-01-01 00:12:26 24900 CUC True 08/24 Mastercard 38.58894 -89.99038 Fairview Heights US
1 16434 8516 2019-01-01 09:42:03 15789 MYR False 11/21 Discover 38.58894 -89.99038 Fairview Heights US
2 23468 8516 2019-04-17 08:17:01 1883 AUD False 09/27 Discover 38.58894 -89.99038 Fairview Heights US
3 14364 8516 2019-01-30 11:54:30 82120 KRW True 09/20 JCB 16 digit 38.58894 -89.99038 Fairview Heights US
4 29407 8516 2019-05-01 17:59:36 25745 MUR True 09/22 American Express 38.58894 -89.99038 Fairview Heights US

添加噪声和不干净的数据#

此数据已干净并与 EvalML 的 AutoMLSearch 兼容。为了演示 EvalML 默认数据检查,我们将添加以下内容:

  • 一个大部分为空值的列(非空值 <0.5%)

  • 一个低方差/无方差的列

  • 一行空值

  • 一个缺失的目标值

我们将前两列添加到整个数据集中,并将后两项仅添加到训练数据中。注意:这些仅代表 EvalML 默认数据检查能够捕获的一些场景。

[3]:
# add a column with no variance in the data
X["no_variance"] = [1 for _ in range(X.shape[0])]

# add a column with >99.5% null values
X["mostly_nulls"] = [None] * (X.shape[0] - 5) + [i for i in range(5)]

# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
# let's split some training and validation data
X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type="binary")
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[4]:
# make row 1 all nan values
X_train.iloc[1] = [None] * X_train.shape[1]

# make one of the target values null
y_train[990] = None

X_train.ww.init()
y_train = ww.init_series(y_train, logical_type="Categorical")
# Let's take another look at the new X_train data
X_train
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[4]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country no_variance mostly_nulls
id
872 15492 2868 2019-08-03 02:50:04 80719 HNL True 08/27 American Express 5.47090 100.24529 Batu Feringgi MY 1 <NA>
1477 <NA> <NA> NaT <NA> NaN <NA> NaN NaN NaN NaN NaN NaN <NA> <NA>
158 22440 6813 2019-07-12 11:07:25 1849 SEK True 09/20 American Express 26.26490 81.54855 Jais IN 1 <NA>
808 8096 8096 2019-06-11 21:33:36 41358 MOP True 04/29 VISA 13 digit 59.37722 28.19028 Narva EE 1 <NA>
336 33270 1529 2019-03-23 21:44:00 32594 CUC False 04/22 Mastercard 51.39323 0.47713 Strood GB 1 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 8484 5358 2019-01-10 07:47:28 89503 GMD False 11/24 Maestro 47.30997 8.52462 Adliswil CH 1 <NA>
1383 17565 3929 2019-01-15 01:11:02 14264 DKK True 06/20 VISA 13 digit 50.72043 11.34046 Rudolstadt DE 1 <NA>
893 108 44 2019-05-17 00:53:39 93218 SLL True 12/24 JCB 16 digit 15.72892 120.57224 Burgos PH 1 <NA>
385 29983 152 2019-06-09 06:50:29 41105 RWF False 07/20 JCB 16 digit -6.80000 39.25000 Magomeni TZ 1 <NA>
1074 26197 4927 2019-05-22 15:57:27 50481 MNT False 05/26 JCB 15 digit 41.00510 -73.78458 Scarsdale US 1 <NA>

1200 行 × 14 列

如果我们对此数据调用 AutoMLSearch.search(),由于我们添加的列和问题,搜索将失败。注意:这里我们使用 try/except 来捕获 `AutoMLSearch` 引发的 ValueError。

[5]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary")
try:
    automl.search()
except ValueError as e:
    # to make the error message more distinct
    print("=" * 80, "\n")
    print("Search errored out! Message received is: {}".format(e))
    print("=" * 80, "\n")
================================================================================

Search errored out! Message received is: Input y contains NaN.
================================================================================

我们可以使用 EvalML 提供的 search_iterative() 函数来确定我们的数据有哪些潜在的“健康”问题。我们可以看到,这个 search_iterative 函数是 EvalML 通过 evalml.automl 提供的公共方法,并且与 EvalML 中 AutoMLSearch 类的 search 函数不同。这个 search_iterative() 函数允许我们对数据运行默认数据检查,如果没有错误,则会自动运行 AutoMLSearch.search()

[6]:
from evalml.automl import search_iterative

automl, messages = search_iterative(X_train, y_train, problem_type="binary")
automl, messages
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/statistics_utils/_calculate_dependence_measure.py:60: SparseDataWarning: One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.
  warnings.warn(
[6]:
(None,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': '1 row(s) (0.08333333333333334%) of target values are null',
   'data_check_name': 'InvalidTargetDataCheck',
   'level': 'error',
   'details': {'columns': None,
    'rows': [990],
    'num_null_rows': 1,
    'pct_null_rows': 0.08333333333333334},
   'code': 'TARGET_HAS_NULL',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'InvalidTargetDataCheck',
     'metadata': {'columns': None, 'rows': [990], 'is_target': True},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])

上述 search_iterative 函数的返回值是一个元组。第一个元素是运行时的 AutoMLSearch 对象(否则为 None),第二个元素是一个字典,其中包含默认数据检查在传入的 Xy 数据中发现的潜在警告和错误。在这个字典中,警告是数据检查提供的建议,处理这些建议有助于改善搜索结果,但不会中断 AutoMLSearch。另一方面,错误表示会中断 AutoMLSearch 并且需要用户解决的问题。

上面,我们可以看到存在错误,因此搜索没有自动运行。

解决警告和错误#

我们可以通过使用 make_pipeline_from_data_check_output 自动解决 search_iterative 返回的警告和错误,这是一个实用方法,它创建一个可以自动清理我们数据的流水线。我们只需将运行 DataCheck.validate() 生成的消息和我们的问题类型传递给此方法即可。

[7]:
from evalml.pipelines.utils import make_pipeline_from_data_check_output

actions_pipeline = make_pipeline_from_data_check_output("binary", messages)
actions_pipeline.fit(X_train, y_train)
X_train_cleaned, y_train_cleaned = actions_pipeline.transform(X_train, y_train)
print(
    "The new length of X_train is {} and y_train is {}".format(
        len(X_train_cleaned), len(X_train_cleaned)
    )
)
The new length of X_train is 1198 and y_train is 1198

现在,我们可以完整地运行 search_iterative

[8]:
results_cleaned = search_iterative(
    X_train_cleaned, y_train_cleaned, problem_type="binary"
)

注意,这次我们得到了一个 AutoMLSearch 对象作为元组的第一个元素返回。我们可以根据需要使用和检查此 AutoMLSearch 对象。

[9]:
automl_object = results_cleaned[0]
automl_object.rankings
[9]:
id 流水线名称 搜索顺序 排名分数 平均交叉验证分数 交叉验证分数标准差 优于基线的百分比 高方差交叉验证 参数
0 1 带有 Label Encoder + Da... 的随机森林分类器 1 0.240358 0.240358 0.010962 95.037942 False {'Label Encoder': {'positive_label': None}, 'D...'}
1 0 众数基线二元分类流水线 0 4.843912 4.843912 0.049015 0.000000 False {'Label Encoder': {'positive_label': None}, 'B...'}

如果我们检查元组中的第二个元素,可以看到不再检测到任何警告或错误!

[10]:
data_check_results = results_cleaned[1]
data_check_results
[10]:
[]

仅解决数据检查错误#

之前,我们使用 make_pipeline_from_actions 来解决 search_iterative 返回的所有警告和错误。现在我们将展示如何手动解决错误以允许 `AutoMLSearch` 运行,以及忽略警告可能会以性能为代价。

我们可以先打印出错误以便于阅读,然后我们将从原始训练数据中创建新的特征和目标。

[11]:
errors = [message for message in messages if message["level"] == "error"]
errors
[11]:
[{'message': '1 row(s) (0.08333333333333334%) of target values are null',
  'data_check_name': 'InvalidTargetDataCheck',
  'level': 'error',
  'details': {'columns': None,
   'rows': [990],
   'num_null_rows': 1,
   'pct_null_rows': 0.08333333333333334},
  'code': 'TARGET_HAS_NULL',
  'action_options': [{'code': 'DROP_ROWS',
    'data_check_name': 'InvalidTargetDataCheck',
    'metadata': {'columns': None, 'rows': [990], 'is_target': True},
    'parameters': {}}]}]
[12]:
# copy the DataTables to new variables
X_train_no_errors = X_train.copy()
y_train_no_errors = y_train.copy()

# We address the errors by looking at the resulting dictionary errors listed

# let's address the `TARGET_HAS_NULL` error
y_train_no_errors.fillna(False, inplace=True)

# let's reinitialize the Woodwork DataTable
X_train_no_errors.ww.init()
X_train_no_errors.head()
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
[12]:
card_id store_id datetime amount currency customer_present expiration_date provider lat lng region country no_variance mostly_nulls
id
872 15492 2868 2019-08-03 02:50:04 80719 HNL True 08/27 American Express 5.47090 100.24529 Batu Feringgi MY 1 <NA>
1477 <NA> <NA> NaT <NA> NaN <NA> NaN NaN NaN NaN NaN NaN <NA> <NA>
158 22440 6813 2019-07-12 11:07:25 1849 SEK True 09/20 American Express 26.26490 81.54855 Jais IN 1 <NA>
808 8096 8096 2019-06-11 21:33:36 41358 MOP True 04/29 VISA 13 digit 59.37722 28.19028 Narva EE 1 <NA>
336 33270 1529 2019-03-23 21:44:00 32594 CUC False 04/22 Mastercard 51.39323 0.47713 Strood GB 1 <NA>

现在我们可以对 X_train_no_errorsy_train_no_errors 运行搜索。注意,这里的搜索不会失败,因为我们已经解决了错误,但返回的元组中仍然存在警告。这次搜索允许 mostly_nulls 列在搜索期间保留在特征中。

[13]:
results_no_errors = search_iterative(
    X_train_no_errors, y_train_no_errors, problem_type="binary"
)
results_no_errors
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/statistics_utils/_calculate_dependence_measure.py:60: SparseDataWarning: One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.
  warnings.warn(
[13]:
(<evalml.automl.automl_search.AutoMLSearch at 0x7fa1282e54f0>,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])