datetime_format_data_check#
此数据检查用于检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。
模块内容#
类摘要#
检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。 |
内容#
- class evalml.data_checks.datetime_format_data_check.DateTimeFormatDataCheck(datetime_column='index', nan_duplicate_threshold=0.75, series_id=None)[源]#
检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。
如果用于多序列问题,则专门处理堆叠数据集。
- 参数
datetime_column (str, int) – datetime 列的名称。如果 datetime 值在索引中,则传入“index”。
nan_duplicate_threshold (float) – 在返回 DATETIME_NO_FREQUENCY_INFERRED 而非 DATETIME_HAS_UNEVEN_INTERVALS 之前,datetime_column 中必须非重复或非 nan 的值所占的百分比。例如,如果设置为 0.80,则 datetime_column 中只有 20% 的值可以是重复的或 nan。默认为 0.75。
series_id (str) – 多序列的 series_id 列的名称。默认为 None。
方法
- name(cls)#
返回描述数据检查的名称。
- validate(self, X, y)[源]#
检查目标数据是否具有等间隔并且单调递增。
如果数据不是 datetime 类型、不是递增的、包含冗余或缺失行、包含无效值 (NaN 或 None),或者包含与假定频率不一致的值,将返回 DataCheckError(s)。
如果用于多序列问题,则专门处理堆叠数据集。
- 参数
X (pd.DataFrame, np.ndarray) – 特征。
y (pd.Series, np.ndarray) – 目标数据。
- 返回值
如果在 datetime 列中发现不等间隔,则返回 DataCheckErrors 列表。
- 返回类型
dict (DataCheckError)
示例
>>> import pandas as pd
列“dates”包含两组日期:一组是每日频率,一组是每小时频率,一组是每月频率。
>>> X = pd.DataFrame(pd.date_range("2015-01-01", periods=2).append(pd.date_range("2015-01-08", periods=2, freq="H").append(pd.date_range("2016-03-02", periods=2, freq="M"))), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "No frequency could be detected in column 'dates', possibly due to uneven intervals or too many duplicate/missing values.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_NO_FREQUENCY_INFERRED", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... } ... ]
列“dates”在值中存在间隙,这意味着缺少许多日期。
>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=50)), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0]) >>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'dates' has datetime values missing between start and end date.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_IS_MISSING_VALUES", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'dates', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ]
列“dates”末尾附加了一个重复的日期 2021-01-09,这被认为是冗余的,将引发错误。
>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-09", periods=1)), columns=["dates"]) >>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0]) >>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'dates' has more than one row with the same datetime value.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_REDUNDANT_ROW", ... "details": {"columns": None, "rows": None}, ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'dates', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ]
列“Weeks”包含一个不遵循每周模式的日期,这被认为是不对齐的。
>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=12).append(pd.date_range("2021-03-22", periods=1)), columns=["Weeks"]) >>> ww_payload = infer_frequency(X["Weeks"], debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Column 'Weeks' has datetime values that do not align with the inferred frequency.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_HAS_MISALIGNED_VALUES", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'Weeks', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'Weeks', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ]
列“Weeks”传入了整数而不是 datetime 数据,这将引发错误。
>>> X = pd.DataFrame([1, 2, 3, 4], columns=["Weeks"]) >>> y = pd.Series([0] * 4) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Datetime information could not be found in the data, or was not in a supported datetime format.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_INFORMATION_NOT_FOUND", ... "action_options": [] ... } ... ]
然而,将相同的整数数据转换为 datetime 是有效的。
>>> X = pd.DataFrame(pd.to_datetime([1, 2, 3, 4]), columns=["Weeks"]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == []
>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=10), columns=["Weeks"]) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == []
虽然传入的数据是 datetime 类型,但时间序列要求 datetime_column 中的 datetime 信息单调递增(升序)。
>>> X = X.iloc[::-1] >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks") >>> assert datetime_format_dc.validate(X, y) == [ ... { ... "message": "Datetime values must be sorted in ascending order.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_IS_NOT_MONOTONIC", ... "action_options": [] ... } ... ]
列“index”中的第一个值被 NaT 替换,这将在此数据检查中引发错误。
>>> dates = [["2-1-21", "3-1-21"], ... ["2-2-21", "3-2-21"], ... ["2-3-21", "3-3-21"], ... ["2-4-21", "3-4-21"], ... ["2-5-21", "3-5-21"], ... ["2-6-21", "3-6-21"], ... ["2-7-21", "3-7-21"], ... ["2-8-21", "3-8-21"], ... ["2-9-21", "3-9-21"], ... ["2-10-21", "3-10-21"], ... ["2-11-21", "3-11-21"], ... ["2-12-21", "3-12-21"]] >>> dates[0][0] = None >>> df = pd.DataFrame(dates, columns=["days", "days2"]) >>> ww_payload = infer_frequency(pd.to_datetime(df["days"]), debug=True, window_length=5, threshold=0.8) >>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="days") >>> assert datetime_format_dc.validate(df, y) == [ ... { ... "message": "Input datetime column 'days' contains NaN values. Please impute NaN values or drop these rows.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_HAS_NAN", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'days', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'days', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ]
对于多序列,数据检查将遍历每个序列并对其执行类似于单序列情况的检查 为了表示数据检查正在检查多序列,将 series_id 列的名称传递给数据检查。
>>> X = pd.DataFrame( ... { ... "date": pd.date_range("2021-01-01", periods=15).repeat(2), ... "series_id": pd.Series(list(range(2)) * 15, dtype="str") ... } ... ) >>> X = X.drop([15]) >>> dc = DateTimeFormatDataCheck(datetime_column="date", series_id="series_id") >>> ww_payload_expected_series1 = infer_frequency((X[X["series_id"] == "1"]["date"].reset_index(drop=True)), debug=True, window_length=4, threshold=0.4) >>> xd = dc.validate(X,y) >>> assert dc.validate(X, y) == [ ... { ... "message": "Column 'date' for series '1' has datetime values missing between start and end date.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "details": {"columns": None, "rows": None}, ... "code": "DATETIME_IS_MISSING_VALUES", ... "action_options": [] ... }, ... { ... "message": "A frequency was detected in column 'date' for series '1', but there are faulty datetime values that need to be addressed.", ... "data_check_name": "DateTimeFormatDataCheck", ... "level": "error", ... "code": "DATETIME_HAS_UNEVEN_INTERVALS", ... "details": {'columns': None, 'rows': None}, ... "action_options": [ ... { ... 'code': 'REGULARIZE_AND_IMPUTE_DATASET', ... 'data_check_name': 'DateTimeFormatDataCheck', ... 'metadata': { ... 'columns': None, ... 'is_target': True, ... 'rows': None ... }, ... 'parameters': { ... 'time_index': { ... 'default_value': 'date', ... 'parameter_type': 'global', ... 'type': 'str' ... }, ... 'frequency_payload': { ... 'default_value': ww_payload_expected_series1, ... 'parameter_type': 'global', ... 'type': 'tuple' ... } ... } ... } ... ] ... } ... ]