使用文本数据与 EvalML#

在此演示中,我们将向您展示如何使用 EvalML 构建使用文本数据的模型。

[1]:
import evalml
from evalml import AutoMLSearch

数据集#

我们将使用一个包含短信文本消息的数据集,其中一些被归类为垃圾邮件(spam),另一些则不是(“正常邮件”,ham)。该数据集最初来自 Kaggle,但经过修改以使垃圾邮件与正常邮件的分布更加均匀。

[2]:
from urllib.request import urlopen
import pandas as pd

input_data = urlopen(
    "https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv"
)
data = pd.read_csv(input_data)[:750]

X = data.drop(["Category"], axis=1)
y = data["Category"]

display(X.head())
消息
0 Free entry in 2 a wkly comp to win FA Cup fina...
1 FreeMsg Hey there darling it's been 3 week's n...
2 WINNER!! As a valued network customer you have...
3 Had your mobile 11 months or more? U R entitle...
4 SIX chances to win CASH! From 100 to 20,000 po...

数据的正常邮件(ham)与垃圾邮件(spam)分布是 3:1,因此任何机器学习模型必须获得高于 75% 的准确率(accuracy),才能优于一个简单地将所有内容分类为正常邮件的普通基线模型。

[3]:
y.value_counts(normalize=True)
[3]:
Category
spam    0.593333
ham     0.406667
Name: proportion, dtype: float64

为了正确使用 Woodwork 的‘Natural Language’(自然语言)类型推断,我们需要在初始化期间传入此参数。否则,它将被视为‘Unknown’(未知)类型并在搜索中被丢弃。

[4]:
X.ww.init(logical_types={"Message": "NaturalLanguage"})

搜索最佳管道#

为了验证管道创建和优化过程的结果,我们将把部分数据保存为保留集(holdout set)。

[5]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  pd.to_datetime(

EvalML 使用 Woodwork 自动检测哪些列是文本列,因此您可以像没有文本数据一样正常运行搜索。我们可以打印出 Message 列的逻辑类型,并断言它确实被推断为自然语言列。

[6]:
X_train.ww
[6]:
物理类型 逻辑类型 语义标签
消息 string NaturalLanguage []

由于垃圾邮件/正常邮件标签是二元的,我们将使用 AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')。当我们调用 .search() 时,搜索最佳管道的过程将开始。

[7]:
automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=1,
    optimize_thresholds=True,
    verbose=True,
)

automl.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of None pipelines.
Allowed model families:

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 14.658

*****************************
* Evaluating Batch Number 1 *
*****************************

Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.249

Search finished after 6.66 seconds
Best pipeline: Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model
Best pipeline Log Loss Binary: 0.248763
[7]:
{1: {'Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model': 5.933701276779175,
  'Total time of batch': 6.061897277832031}}

查看排名并选择管道#

拟合过程完成后,我们可以看到所有搜索过的管道。

[8]:
automl.rankings
[8]:
ID 管道名称 搜索顺序 排名得分 平均交叉验证得分 交叉验证得分标准差 优于基线的百分比 高方差交叉验证 参数
0 1 Random Forest Classifier w/ Label Encoder + Na... 1 0.248763 0.248763 0.056686 98.302858 False {'Label Encoder': {'positive_label': None}, 'I...
1 0 模式基线二元分类管道 0 14.657752 14.657752 0.104049 0.000000 False {'Label Encoder': {'positive_label': None}, 'B...

要选择最佳管道,我们可以调用 automl.best_pipeline

[9]:
best_pipeline = automl.best_pipeline

描述管道#

您可以获取任何管道的更多详细信息,包括其在其他目标函数上的表现。

[10]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
***********************************************************************************************************************
* Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model *
***********************************************************************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Label Encoder
         * positive_label : None
2. Natural Language Featurizer
3. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * boolean_impute_strategy : most_frequent
         * categorical_fill_value : None
         * numeric_fill_value : None
         * boolean_fill_value : None
4. RF Classifier Select From Model
         * number_features : None
         * n_estimators : 10
         * max_depth : None
         * percent_features : 0.5
         * threshold : median
         * n_jobs : -1
5. Random Forest Classifier
         * n_estimators : 100
         * max_depth : 6
         * n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 5.9 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0.251       0.793 0.917 0.958      0.930 0.868                     0.886            0.900        400          200
1                      0.191       0.844 0.964 0.982      0.934 0.904                     0.917            0.925        400          200
2                      0.304       0.782 0.900 0.950      0.886 0.870                     0.889            0.895        400          200
mean                   0.249       0.806 0.927 0.963      0.917 0.881                     0.897            0.907          -            -
std                    0.057       0.033 0.033 0.017      0.027 0.020                     0.017            0.016          -            -
coef of var            0.228       0.041 0.036 0.017      0.029 0.023                     0.019            0.018          -            -
[11]:
best_pipeline.graph()
[11]:
../_images/demos_text_input_21_0.svg

注意上面提到,管道的第一步是 Natural Language Featurizer(自然语言特征提取器)。AutoMLSearch 使用 woodwork accessor 识别出 'Message' 是一个文本列,并将此文本转换为可由估计器处理的数值。

在保留集上评估#

现在,我们可以使用二元分类问题的排名目标对保留集数据上的管道进行评分。

[12]:
scores = best_pipeline.score(
    X_holdout, y_holdout, objectives=evalml.objectives.get_ranking_objectives("binary")
)
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.9333333333333333

正如您所见,该模型在此数据集上表现相对良好,即使在未见过的数据上也是如此。

自然语言特征提取器(Natural Language Featurizer)的作用是什么?#

机器学习模型无法处理非数值数据。任何文本都必须分解为数值特征,以提供关于该文本的有用信息。自然语言特征提取器(Natural Language Featurizer)首先通过删除标点符号和其他非字母数字字符并将所有大写字母转换为小写来规范化您的文本。然后,它将文本传递到 featuretoolsnlp_primitives dfs 搜索中,从而产生几个有用的特征来替换数据集中的原始列:多样性得分(Diversity Score)、平均每词字符数(Mean Characters per Word)、极性得分(Polarity Score)、LSA(潜在语义分析,Latent Semantic Analysis)、字符数(Number of Characters)和词数(Number of Words)。

多样性得分(Diversity Score)是唯一词汇数与总词汇数的比率。

平均每词字符数(Mean Characters per Word)是每个词的平均字母数。

极性得分(Polarity Score)是文本“极性”的预测值,范围从 -1(极端负面)到 1(极端正面)。

潜在语义分析(Latent Semantic Analysis, LSA)是一种抽象表示,衡量每个词相对于整个文本的重要性,并将其简化为每个文本的两个值。其他文本特征都只有一个列,而此特征会为您的数据添加两个列,分别是 LSA(column_name)[0]LSA(column_name)[1]

字符数(Number of Characters)是文本中的字符个数。

词数(Number of Words)是文本中的词语个数。

让我们看看在我们的垃圾邮件/正常邮件示例中这是怎样的。

[13]:
best_pipeline.input_feature_names
[13]:
{'Label Encoder': ['Message'],
 'Natural Language Featurizer': ['Message'],
 'Imputer': ['DIVERSITY_SCORE(Message)',
  'MEAN_CHARACTERS_PER_WORD(Message)',
  'NUM_CHARACTERS(Message)',
  'NUM_WORDS(Message)',
  'POLARITY_SCORE(Message)',
  'LSA(Message)[0]',
  'LSA(Message)[1]'],
 'RF Classifier Select From Model': ['DIVERSITY_SCORE(Message)',
  'MEAN_CHARACTERS_PER_WORD(Message)',
  'NUM_CHARACTERS(Message)',
  'NUM_WORDS(Message)',
  'POLARITY_SCORE(Message)',
  'LSA(Message)[0]',
  'LSA(Message)[1]'],
 'Random Forest Classifier': ['DIVERSITY_SCORE(Message)',
  'MEAN_CHARACTERS_PER_WORD(Message)',
  'NUM_CHARACTERS(Message)',
  'LSA(Message)[0]']}

在这里,自然语言特征提取器(Natural Language Featurizer)接收一个单独的“Message”列,但管道中的下一个组件 Imputer(填充器)接收五个输入列。这五个列是对文本类型“Message”列进行特征提取的结果。最重要的是,这些经过特征提取的列最终会被传递给估计器(estimator)。

如果数据集中包含任何非文本列,此过程将不会对其进行处理。如果数据集包含多个文本列,每个文本列将独立地分解为这五个特征列。

更直接地查看特征#

除了检查新的列名,让我们直接查看此组件的输出。我们可以通过单独运行该组件来看到这一点。

[14]:
natural_language_featurizer = evalml.pipelines.components.NaturalLanguageFeaturizer()
X_featurized = natural_language_featurizer.fit_transform(X_train)

现在我们可以将输入数据与自然语言特征提取器(Natural Language Featurizer)的输出进行比较

[15]:
X_train.head()
[15]:
消息
296 Sunshine Hols. To claim ur med holiday send a ...
652 Yup ü not comin :-(
526 Hello hun how ru? Its here by the way. Im good...
571 I tagged MY friends that you seemed to count a...
472 What happened to our yo date?
[16]:
X_featurized.head()
[16]:
多样性得分(Message) 平均每词字符数(Message) 字符数(Message) 词数(Message) 极性得分(Message) LSA(Message)[0] LSA(Message)[1]
296 1.0 4.344828 154.0 29.0 0.003 0.150556 -0.072443
652 1.0 3.000000 16.0 4.0 0.000 0.017340 -0.005411
526 1.0 3.363636 143.0 33.0 0.162 0.169954 0.022670
571 0.8 4.083333 60.0 12.0 0.681 0.144713 0.036799
472 1.0 3.833333 28.0 6.0 0.000 0.109373 -0.042754

这些数值现在代表了关于原始文本的重要信息,管道末端的估计器可以成功地利用这些信息进行预测。

为何要这样编码文本?#

为了展示文本特定建模的重要性,让我们使用相同的数据集训练一个模型,但不让 AutoMLSearch 检测文本列。我们可以通过使用实用方法 infer_feature_types,在 Woodwork 中将 'Message' 列的数据类型显式设置为 Categorical(类别型)来实现这一点。

[17]:
from evalml.utils import infer_feature_types

X = infer_feature_types(X, {"Message": "Categorical"})
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)
/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

/home/docs/checkouts/readthedocs.org/user_builds/feature-labs-inc-evalml/envs/stable/lib/python3.9/site-packages/woodwork/type_sys/utils.py:33: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

[18]:
automl_no_text = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=1,
    optimize_thresholds=True,
    verbose=True,
)

automl_no_text.search(interactive_plot=False)
AutoMLSearch will use mean CV score to rank pipelines.

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary.
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of None pipelines.
Allowed model families:

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 14.658

*****************************
* Evaluating Batch Number 1 *
*****************************

Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model:
        Starting cross validation
        Finished cross validation - mean Log Loss Binary: 0.249

Search finished after 4.98 seconds
Best pipeline: Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model
Best pipeline Log Loss Binary: 0.248763
[18]:
{1: {'Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model': 4.418203592300415,
  'Total time of batch': 4.546222925186157}}

像之前一样,我们可以查看排名并选择最佳管道。

[19]:
automl_no_text.rankings
[19]:
ID 管道名称 搜索顺序 排名得分 平均交叉验证得分 交叉验证得分标准差 优于基线的百分比 高方差交叉验证 参数
0 1 Random Forest Classifier w/ Label Encoder + Na... 1 0.248763 0.248763 0.056686 98.302858 False {'Label Encoder': {'positive_label': None}, 'I...
1 0 模式基线二元分类管道 0 14.657752 14.657752 0.104049 0.000000 False {'Label Encoder': {'positive_label': None}, 'B...
[20]:
best_pipeline_no_text = automl_no_text.best_pipeline

在这里,更改文本列的数据类型将 Natural Language Featurizer(自然语言特征提取器)从管道中移除。

[21]:
best_pipeline_no_text.graph()
[21]:
../_images/demos_text_input_43_0.svg
[22]:
automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])
***********************************************************************************************************************
* Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model *
***********************************************************************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
==============
1. Label Encoder
         * positive_label : None
2. Natural Language Featurizer
3. Imputer
         * categorical_impute_strategy : most_frequent
         * numeric_impute_strategy : mean
         * boolean_impute_strategy : most_frequent
         * categorical_fill_value : None
         * numeric_fill_value : None
         * boolean_fill_value : None
4. RF Classifier Select From Model
         * number_features : None
         * n_estimators : 10
         * max_depth : None
         * percent_features : 0.5
         * threshold : median
         * n_jobs : -1
5. Random Forest Classifier
         * n_estimators : 100
         * max_depth : 6
         * n_jobs : -1

Training
========
Training for binary problems.
Total training time (including CV): 4.4 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0.251       0.793 0.917 0.958      0.930 0.868                     0.886            0.900        400          200
1                      0.191       0.844 0.964 0.982      0.934 0.904                     0.917            0.925        400          200
2                      0.304       0.782 0.900 0.950      0.886 0.870                     0.889            0.895        400          200
mean                   0.249       0.806 0.927 0.963      0.917 0.881                     0.897            0.907          -            -
std                    0.057       0.033 0.033 0.017      0.027 0.020                     0.017            0.016          -            -
coef of var            0.228       0.041 0.036 0.017      0.029 0.023                     0.019            0.018          -            -
[23]:
# get standard performance metrics on holdout data
scores = best_pipeline_no_text.score(
    X_holdout, y_holdout, objectives=evalml.objectives.get_ranking_objectives("binary")
)
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
Accuracy Binary: 0.9333333333333333

如果没有 Natural Language Featurizer(自然语言特征提取器),'Message' 列将被视为类别列,因此文本到数值特征的转换发生在 One Hot Encoder(独热编码器)中。最佳管道对这些文本中出现频率最高的前 10 个“类别”进行了编码,这意味着只有 10 条短信被独热编码,而其余的都被丢弃了。显然,这几乎移除了数据集中的所有信息,因为我们可以看到 best_pipeline_no_text 的表现与在每种情况下随机猜测“正常邮件(ham)”非常相似。