datetime_format_data_check#

此数据检查用于检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。

模块内容#

类摘要#

DateTimeFormatDataCheck

检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。

内容#

class evalml.data_checks.datetime_format_data_check.DateTimeFormatDataCheck(datetime_column='index', nan_duplicate_threshold=0.75, series_id=None)[源]#

检查 datetime 列是否具有等间隔,并且单调递增或递减,以便支持时间序列估计器。

如果用于多序列问题,则专门处理堆叠数据集。

参数
  • datetime_column (str, int) – datetime 列的名称。如果 datetime 值在索引中,则传入“index”。

  • nan_duplicate_threshold (float) – 在返回 DATETIME_NO_FREQUENCY_INFERRED 而非 DATETIME_HAS_UNEVEN_INTERVALS 之前,datetime_column 中必须非重复或非 nan 的值所占的百分比。例如,如果设置为 0.80,则 datetime_column 中只有 20% 的值可以是重复的或 nan。默认为 0.75。

  • series_id (str) – 多序列的 series_id 列的名称。默认为 None。

方法

name

返回描述数据检查的名称。

validate

检查目标数据是否具有等间隔并且单调递增。

name(cls)#

返回描述数据检查的名称。

validate(self, X, y)[源]#

检查目标数据是否具有等间隔并且单调递增。

如果数据不是 datetime 类型、不是递增的、包含冗余或缺失行、包含无效值 (NaN 或 None),或者包含与假定频率不一致的值,将返回 DataCheckError(s)。

如果用于多序列问题,则专门处理堆叠数据集。

参数
  • X (pd.DataFrame, np.ndarray) – 特征。

  • y (pd.Series, np.ndarray) – 目标数据。

返回值

如果在 datetime 列中发现不等间隔,则返回 DataCheckErrors 列表。

返回类型

dict (DataCheckError)

示例

>>> import pandas as pd

列“dates”包含两组日期:一组是每日频率,一组是每小时频率,一组是每月频率。

>>> X = pd.DataFrame(pd.date_range("2015-01-01", periods=2).append(pd.date_range("2015-01-08", periods=2, freq="H").append(pd.date_range("2016-03-02", periods=2, freq="M"))), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "No frequency could be detected in column 'dates', possibly due to uneven intervals or too many duplicate/missing values.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_NO_FREQUENCY_INFERRED",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      }
... ]

列“dates”在值中存在间隙,这意味着缺少许多日期。

>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-31", periods=50)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'dates' has datetime values missing between start and end date.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_IS_MISSING_VALUES",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'dates',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

列“dates”末尾附加了一个重复的日期 2021-01-09,这被认为是冗余的,将引发错误。

>>> X = pd.DataFrame(pd.date_range("2021-01-01", periods=9).append(pd.date_range("2021-01-09", periods=1)), columns=["dates"])
>>> y = pd.Series([0, 1, 0, 1, 1, 0, 0, 0, 1, 0])
>>> ww_payload = infer_frequency(X["dates"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="dates")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'dates' has more than one row with the same datetime value.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_REDUNDANT_ROW",
...         "details": {"columns": None, "rows": None},
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'dates', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'dates',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

列“Weeks”包含一个不遵循每周模式的日期,这被认为是不对齐的。

>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=12).append(pd.date_range("2021-03-22", periods=1)), columns=["Weeks"])
>>> ww_payload = infer_frequency(X["Weeks"], debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Column 'Weeks' has datetime values that do not align with the inferred frequency.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_HAS_MISALIGNED_VALUES",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'Weeks', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'Weeks',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

列“Weeks”传入了整数而不是 datetime 数据,这将引发错误。

>>> X = pd.DataFrame([1, 2, 3, 4], columns=["Weeks"])
>>> y = pd.Series([0] * 4)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Datetime information could not be found in the data, or was not in a supported datetime format.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_INFORMATION_NOT_FOUND",
...         "action_options": []
...      }
... ]

然而,将相同的整数数据转换为 datetime 是有效的。

>>> X = pd.DataFrame(pd.to_datetime([1, 2, 3, 4]), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == []
>>> X = pd.DataFrame(pd.date_range("2021-01-01", freq="W", periods=10), columns=["Weeks"])
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == []

虽然传入的数据是 datetime 类型,但时间序列要求 datetime_column 中的 datetime 信息单调递增(升序)。

>>> X = X.iloc[::-1]
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="Weeks")
>>> assert datetime_format_dc.validate(X, y) == [
...     {
...         "message": "Datetime values must be sorted in ascending order.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_IS_NOT_MONOTONIC",
...         "action_options": []
...      }
... ]

列“index”中的第一个值被 NaT 替换,这将在此数据检查中引发错误。

>>> dates = [["2-1-21", "3-1-21"],
...         ["2-2-21", "3-2-21"],
...         ["2-3-21", "3-3-21"],
...         ["2-4-21", "3-4-21"],
...         ["2-5-21", "3-5-21"],
...         ["2-6-21", "3-6-21"],
...         ["2-7-21", "3-7-21"],
...         ["2-8-21", "3-8-21"],
...         ["2-9-21", "3-9-21"],
...         ["2-10-21", "3-10-21"],
...         ["2-11-21", "3-11-21"],
...         ["2-12-21", "3-12-21"]]
>>> dates[0][0] = None
>>> df = pd.DataFrame(dates, columns=["days", "days2"])
>>> ww_payload = infer_frequency(pd.to_datetime(df["days"]), debug=True, window_length=5, threshold=0.8)
>>> datetime_format_dc = DateTimeFormatDataCheck(datetime_column="days")
>>> assert datetime_format_dc.validate(df, y) == [
...     {
...         "message": "Input datetime column 'days' contains NaN values. Please impute NaN values or drop these rows.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_HAS_NAN",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'days', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'days',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]

对于多序列,数据检查将遍历每个序列并对其执行类似于单序列情况的检查 为了表示数据检查正在检查多序列,将 series_id 列的名称传递给数据检查。

>>> X = pd.DataFrame(
...     {
...         "date": pd.date_range("2021-01-01", periods=15).repeat(2),
...         "series_id": pd.Series(list(range(2)) * 15, dtype="str")
...     }
... )
>>> X = X.drop([15])
>>> dc = DateTimeFormatDataCheck(datetime_column="date", series_id="series_id")
>>> ww_payload_expected_series1 = infer_frequency((X[X["series_id"] == "1"]["date"].reset_index(drop=True)), debug=True, window_length=4, threshold=0.4)
>>> xd = dc.validate(X,y)
>>> assert dc.validate(X, y) == [
...     {
...         "message": "Column 'date' for series '1' has datetime values missing between start and end date.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "details": {"columns": None, "rows": None},
...         "code": "DATETIME_IS_MISSING_VALUES",
...         "action_options": []
...      },
...     {
...         "message": "A frequency was detected in column 'date' for series '1', but there are faulty datetime values that need to be addressed.",
...         "data_check_name": "DateTimeFormatDataCheck",
...         "level": "error",
...         "code": "DATETIME_HAS_UNEVEN_INTERVALS",
...         "details": {'columns': None, 'rows': None},
...         "action_options": [
...             {
...                 'code': 'REGULARIZE_AND_IMPUTE_DATASET',
...                 'data_check_name': 'DateTimeFormatDataCheck',
...                 'metadata': {
...                         'columns': None,
...                         'is_target': True,
...                         'rows': None
...                 },
...                 'parameters': {
...                         'time_index': {
...                             'default_value': 'date',
...                             'parameter_type': 'global',
...                             'type': 'str'
...                         },
...                         'frequency_payload': {
...                             'default_value': ww_payload_expected_series1,
...                             'parameter_type': 'global',
...                             'type': 'tuple'
...                         }
...                 }
...             }
...         ]
...     }
... ]