理解数据检查操作#

EvalML 简化了表格数据的机器学习模型的创建和实现。它提供的众多功能之一是数据检查，这有助于我们在训练模型之前确定数据的“健康”状况。这些数据检查关联着相应的操作，将在本 Notebook 中展示。在默认数据检查中，我们有以下检查：

NullDataCheck: 检查行或列是否为空或严重为空
IDColumnsDataCheck: 检查可能为 ID 列的列
TargetLeakageDataCheck: 检查任何输入特征是否与目标有高度关联
InvalidTargetDataCheck: 检查目标中是否存在空值或其他无效值
NoVarianceDataCheck: 检查目标或任何特征是否没有方差

EvalML 还有更多数据检查，可在此处查看此处，使用示例在此处此处。下面，我们将逐步介绍 EvalML 默认数据检查和操作的使用。

首先，我们导入演示这些检查所需的依赖项。

[1]:

import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud
from evalml.preprocessing import split_data

让我们看看输入特征数据。EvalML 使用 Woodwork 库来表示这些数据。EvalML 返回的演示数据是 Woodwork DataTable 和 DataColumn。

[2]:

X, y = load_fraud(n_rows=1500)
X.head()

             Number of Features
Boolean                       1
Categorical                   6
Numeric                       5

Number of training examples: 1500
Targets
False    86.60%
True     13.40%
Name: count, dtype: object

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
pd.to_datetime(

[2]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country
id
0	32261	8516	2019-01-01 00:12:26	24900	CUC	True	08/24	Mastercard	38.58894	-89.99038	Fairview Heights	US
1	16434	8516	2019-01-01 09:42:03	15789	MYR	False	11/21	Discover	38.58894	-89.99038	Fairview Heights	US
2	23468	8516	2019-04-17 08:17:01	1883	AUD	False	09/27	Discover	38.58894	-89.99038	Fairview Heights	US
3	14364	8516	2019-01-30 11:54:30	82120	KRW	True	09/20	JCB 16 digit	38.58894	-89.99038	Fairview Heights	US
4	29407	8516	2019-05-01 17:59:36	25745	MUR	True	09/22	American Express	38.58894	-89.99038	Fairview Heights	US

添加噪声和不干净的数据#

此数据已干净并与 EvalML 的 AutoMLSearch 兼容。为了演示 EvalML 默认数据检查，我们将添加以下内容：

一个大部分为空值的列（非空值 <0.5%）
一个低方差/无方差的列
一行空值
一个缺失的目标值

我们将前两列添加到整个数据集中，并将后两项仅添加到训练数据中。注意：这些仅代表 EvalML 默认数据检查能够捕获的一些场景。

[3]:

# add a column with no variance in the data
X["no_variance"] = [1 for _ in range(X.shape[0])]

# add a column with >99.5% null values
X["mostly_nulls"] = [None] * (X.shape[0] - 5) + [i for i in range(5)]

# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
# let's split some training and validation data
X_train, X_valid, y_train, y_valid = split_data(X, y, problem_type="binary")

[4]:

# make row 1 all nan values
X_train.iloc[1] = [None] * X_train.shape[1]

# make one of the target values null
y_train[990] = None

X_train.ww.init()
y_train = ww.init_series(y_train, logical_type="Categorical")
# Let's take another look at the new X_train data
X_train

[4]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country	no_variance	mostly_nulls
id
872	15492	2868	2019-08-03 02:50:04	80719	HNL	True	08/27	American Express	5.47090	100.24529	Batu Feringgi	MY	1	<NA>
1477	<NA>	<NA>	NaT	<NA>	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>
158	22440	6813	2019-07-12 11:07:25	1849	SEK	True	09/20	American Express	26.26490	81.54855	Jais	IN	1	<NA>
808	8096	8096	2019-06-11 21:33:36	41358	MOP	True	04/29	VISA 13 digit	59.37722	28.19028	Narva	EE	1	<NA>
336	33270	1529	2019-03-23 21:44:00	32594	CUC	False	04/22	Mastercard	51.39323	0.47713	Strood	GB	1	<NA>
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	8484	5358	2019-01-10 07:47:28	89503	GMD	False	11/24	Maestro	47.30997	8.52462	Adliswil	CH	1	<NA>
1383	17565	3929	2019-01-15 01:11:02	14264	DKK	True	06/20	VISA 13 digit	50.72043	11.34046	Rudolstadt	DE	1	<NA>
893	108	44	2019-05-17 00:53:39	93218	SLL	True	12/24	JCB 16 digit	15.72892	120.57224	Burgos	PH	1	<NA>
385	29983	152	2019-06-09 06:50:29	41105	RWF	False	07/20	JCB 16 digit	-6.80000	39.25000	Magomeni	TZ	1	<NA>
1074	26197	4927	2019-05-22 15:57:27	50481	MNT	False	05/26	JCB 15 digit	41.00510	-73.78458	Scarsdale	US	1	<NA>

1200 行 × 14 列

如果我们对此数据调用 AutoMLSearch.search()，由于我们添加的列和问题，搜索将失败。注意：这里我们使用 try/except 来捕获 `AutoMLSearch` 引发的 ValueError。

[5]:

automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary")
try:
    automl.search()
except ValueError as e:
    # to make the error message more distinct
    print("=" * 80, "\n")
    print("Search errored out! Message received is: {}".format(e))
    print("=" * 80, "\n")

================================================================================

Search errored out! Message received is: Input y contains NaN.
================================================================================

我们可以使用 EvalML 提供的 search_iterative() 函数来确定我们的数据有哪些潜在的“健康”问题。我们可以看到，这个 search_iterative 函数是 EvalML 通过 evalml.automl 提供的公共方法，并且与 EvalML 中 AutoMLSearch 类的 search 函数不同。这个 search_iterative() 函数允许我们对数据运行默认数据检查，如果没有错误，则会自动运行 AutoMLSearch.search()。

[6]:

from evalml.automl import search_iterative

automl, messages = search_iterative(X_train, y_train, problem_type="binary")
automl, messages

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/statistics_utils/_calculate_dependence_measure.py:60: SparseDataWarning: One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.
  warnings.warn(

[6]:

(None,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': '1 row(s) (0.08333333333333334%) of target values are null',
   'data_check_name': 'InvalidTargetDataCheck',
   'level': 'error',
   'details': {'columns': None,
    'rows': [990],
    'num_null_rows': 1,
    'pct_null_rows': 0.08333333333333334},
   'code': 'TARGET_HAS_NULL',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'InvalidTargetDataCheck',
     'metadata': {'columns': None, 'rows': [990], 'is_target': True},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])

上述 search_iterative 函数的返回值是一个元组。第一个元素是运行时的 AutoMLSearch 对象（否则为 None），第二个元素是一个字典，其中包含默认数据检查在传入的 X 和 y 数据中发现的潜在警告和错误。在这个字典中，警告是数据检查提供的建议，处理这些建议有助于改善搜索结果，但不会中断 AutoMLSearch。另一方面，错误表示会中断 AutoMLSearch 并且需要用户解决的问题。

上面，我们可以看到存在错误，因此搜索没有自动运行。

解决警告和错误#

我们可以通过使用 make_pipeline_from_data_check_output 自动解决 search_iterative 返回的警告和错误，这是一个实用方法，它创建一个可以自动清理我们数据的流水线。我们只需将运行 DataCheck.validate() 生成的消息和我们的问题类型传递给此方法即可。

[7]:

from evalml.pipelines.utils import make_pipeline_from_data_check_output

actions_pipeline = make_pipeline_from_data_check_output("binary", messages)
actions_pipeline.fit(X_train, y_train)
X_train_cleaned, y_train_cleaned = actions_pipeline.transform(X_train, y_train)
print(
    "The new length of X_train is {} and y_train is {}".format(
        len(X_train_cleaned), len(X_train_cleaned)
    )
)

The new length of X_train is 1198 and y_train is 1198

现在，我们可以完整地运行 search_iterative。

[8]:

results_cleaned = search_iterative(
    X_train_cleaned, y_train_cleaned, problem_type="binary"
)

注意，这次我们得到了一个 AutoMLSearch 对象作为元组的第一个元素返回。我们可以根据需要使用和检查此 AutoMLSearch 对象。

[9]:

automl_object = results_cleaned[0]
automl_object.rankings

[9]:

	id	流水线名称	搜索顺序	排名分数	平均交叉验证分数	交叉验证分数标准差	优于基线的百分比	高方差交叉验证	参数
0	1	带有 Label Encoder + Da... 的随机森林分类器	1	0.240358	0.240358	0.010962	95.037942	False	{'Label Encoder': {'positive_label': None}, 'D...'}
1	0	众数基线二元分类流水线	0	4.843912	4.843912	0.049015	0.000000	False	{'Label Encoder': {'positive_label': None}, 'B...'}

如果我们检查元组中的第二个元素，可以看到不再检测到任何警告或错误！

[10]:

data_check_results = results_cleaned[1]
data_check_results

[10]:

[]

仅解决数据检查错误#

之前，我们使用 make_pipeline_from_actions 来解决 search_iterative 返回的所有警告和错误。现在我们将展示如何手动解决错误以允许 `AutoMLSearch` 运行，以及忽略警告可能会以性能为代价。

我们可以先打印出错误以便于阅读，然后我们将从原始训练数据中创建新的特征和目标。

[11]:

errors = [message for message in messages if message["level"] == "error"]
errors

[11]:

[{'message': '1 row(s) (0.08333333333333334%) of target values are null',
  'data_check_name': 'InvalidTargetDataCheck',
  'level': 'error',
  'details': {'columns': None,
   'rows': [990],
   'num_null_rows': 1,
   'pct_null_rows': 0.08333333333333334},
  'code': 'TARGET_HAS_NULL',
  'action_options': [{'code': 'DROP_ROWS',
    'data_check_name': 'InvalidTargetDataCheck',
    'metadata': {'columns': None, 'rows': [990], 'is_target': True},
    'parameters': {}}]}]

[12]:

# copy the DataTables to new variables
X_train_no_errors = X_train.copy()
y_train_no_errors = y_train.copy()

# We address the errors by looking at the resulting dictionary errors listed

# let's address the `TARGET_HAS_NULL` error
y_train_no_errors.fillna(False, inplace=True)

# let's reinitialize the Woodwork DataTable
X_train_no_errors.ww.init()
X_train_no_errors.head()

[12]:

	card_id	store_id	datetime	amount	currency	customer_present	expiration_date	provider	lat	lng	region	country	no_variance	mostly_nulls
id
872	15492	2868	2019-08-03 02:50:04	80719	HNL	True	08/27	American Express	5.47090	100.24529	Batu Feringgi	MY	1	<NA>
1477	<NA>	<NA>	NaT	<NA>	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	<NA>
158	22440	6813	2019-07-12 11:07:25	1849	SEK	True	09/20	American Express	26.26490	81.54855	Jais	IN	1	<NA>
808	8096	8096	2019-06-11 21:33:36	41358	MOP	True	04/29	VISA 13 digit	59.37722	28.19028	Narva	EE	1	<NA>
336	33270	1529	2019-03-23 21:44:00	32594	CUC	False	04/22	Mastercard	51.39323	0.47713	Strood	GB	1	<NA>

现在我们可以对 X_train_no_errors 和 y_train_no_errors 运行搜索。注意，这里的搜索不会失败，因为我们已经解决了错误，但返回的元组中仍然存在警告。这次搜索允许 mostly_nulls 列在搜索期间保留在特征中。

[13]:

results_no_errors = search_iterative(
    X_train_no_errors, y_train_no_errors, problem_type="binary"
)
results_no_errors

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/statistics_utils/_calculate_dependence_measure.py:60: SparseDataWarning: One or more pairs of columns did not share enough rows of non-null data to measure the relationship.  The measurement for these columns will be NaN.  Use 'extra_stats=True' to get the shared rows for each pair of columns.
  warnings.warn(

[13]:

(<evalml.automl.automl_search.AutoMLSearch at 0x7fa1282e54f0>,
 [{'message': '1 out of 1200 rows are 95.0% or more null',
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': None,
    'rows': [1477],
    'pct_null_cols': id
    1477    1.0
    dtype: float64},
   'code': 'HIGHLY_NULL_ROWS',
   'action_options': [{'code': 'DROP_ROWS',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': None, 'rows': [1477]},
     'parameters': {}}]},
  {'message': "Column(s) 'mostly_nulls' are 95.0% or more null",
   'data_check_name': 'NullDataCheck',
   'level': 'warning',
   'details': {'columns': ['mostly_nulls'],
    'rows': None,
    'pct_null_rows': {'mostly_nulls': 0.9966666666666667}},
   'code': 'HIGHLY_NULL_COLS',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NullDataCheck',
     'metadata': {'columns': ['mostly_nulls'], 'rows': None},
     'parameters': {}}]},
  {'message': "'no_variance' has 1 unique value.",
   'data_check_name': 'NoVarianceDataCheck',
   'level': 'warning',
   'details': {'columns': ['no_variance'], 'rows': None},
   'code': 'NO_VARIANCE',
   'action_options': [{'code': 'DROP_COL',
     'data_check_name': 'NoVarianceDataCheck',
     'metadata': {'columns': ['no_variance'], 'rows': None},
     'parameters': {}}]}])