keras二分类问题: 影评是正面还是负面的

所用环境为Colab.

定义任务目标

拿到一些对电影的评论, 判断出来哪些评论是正面的, 哪些是负面的.

数据收集

数据收集

使用 IMDB 数据集.

1
2
3
# 加载数据集(需联网)
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000) # num_words=10000, 仅保留数据中前10000个最常出现的单词

数据可视化

建立对数据形状的感受.

1
2
3
4
5
6
# 检查数据
train_data.shape, train_labels.shape, test_data.shape, test_labels.shape, train_data.dtype
# ((25000,), (25000,), (25000,), (25000,), dtype('O'))

type(train_data), type(train_data[0]), train_data.ndim
# (numpy.ndarray, list, 1)

看一下数据集里的数据到底长什么样子, 数据对应的标签是什么.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 查看第1条评论的前10个单词, 以及它的标签(正面还是负面)

train_data[0][:10] # 电影评论里的单词被转换为一个个数字.
# [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

train_labels[0]
# 1

# 最长的一条电影评论的长度是多少
max([max(item) for item in train_data]) # 先得到每条评论的长度, 再得到长度的最大值

数据集里的评论被编码为了数字, 解码成正常文本看看.

1
2
3
4
5
6
7
8
word_index = imdb.get_word_index() # word_index 是一个字典, 键是单词, 值是对应的一个整数.
word_index["hello"] # hello这个单词对应的整数是4822

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # 一个新字典, 键是一个整数, 值是该整数对应的单词
reverse_word_index[4822] # 'hello'

decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[0]]) # 把训练集中的第一条评论解码成正常文本. 索引减3是因为, 训练集里的单词对应的整数, 相比于字典, 都向右偏移了3.
decoded_review

解码出来的文本:

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

很显然这是一条正面评论. 这条评论的标签是"1", 这时我们也清楚了, 标签为"1"代表该评论是正面的, 为"0"代表该评论是负面的.

数据标注

已标注好

数据清理

无需清理

选择模型评估方法

确定简单基准(模型应能超越这个基准)

如果为评论随机指定分类, 能分类正确的概率为50%, 因此建立的模型的分类正确率应该超过50%.

评估方法选择

这里选择精度(accuracy)作为模型评估的指标. 精度, 即正确分类的图像所占比例.

怎样用? 构建好模型后, 在模型编译阶段, 将model.compile()metrics参数值设定为["accuracy"]

数据预处理

数据向量化和规范化

对文本列表进行multi-hot编码, 将其转换为由0和1组成的向量. 把每条评论都转换为一个10000维向量, 如果一个单词在该评论里出现, 就把该单词的索引(单词对应的那个整数)对应位置的元素设为1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
  results = np.zeros((len(sequences), dimension)) # 创建一个零矩阵
  for i, sequence in enumerate(sequences):
    for j in sequence:
      results[i, j] = 1.
  return results

x_train = vectorize_sequences(train_data) # 将训练数据向量化
x_test = vectorize_sequences(test_data) # 将测试数据向量化

y_train = np.asarray(train_labels).astype("float32") # 将标签向量化
y_test = np.asarray(test_labels).astype("float32")

x_train.shape, x_test.shape, x_train.ndim
# ((25000, 10000), (25000, 10000), 2)

x_train[0] # 第1条评论现在变成了什么样子
# array([0., 1., 1., ..., 0., 0., 0.])

y_train.shape, y_test.shape
# ((25000,), (25000,))

处理缺失值

不需要处理缺失值.

数据划分: 训练集, 验证集, 测试集

从训练集中分出一部分作为验证集.

1
2
3
4
5
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

构建第一个模型

特征选择(过滤没有信息量的特征; 开发新特征)

在本例中特征就是评论里编码后的单词了.

选择架构

两个中间层, 每层16个单元. 第三层输出标量预测值, 代表一条评论的情感类别(正面, 负面).

层1: Dense层, “表示空间"的维数units16, 激活函数activationrelu.

层2: Dense层, “表示空间"的维数units16, 激活函数activationrelu.

层3: Dense层, “表示空间"的维数units1, 激活函数activationsigmoid. 输出是一个介于0~1之间的概率值, 表示样本目标值等于"1"的可能性, 即评论为正面的可能性.

训练配置(损失函数, 批量大小, 学习率)

优化器 optimizer

这里选rmsprop

损失函数 loss function

这里选二元交叉熵损失函数binary_crossentropy

训练轮数

这里训练20轮.

数据批量大小

批量大小设为512.

模型构建代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from tensorflow import keras
from tensorflow.keras import layers

# 构建模型
model = keras.Sequential([
  layers.Dense(16, activation="relu"),
  layers.Dense(16, activation="relu"),
  layers.Dense(1, activation="sigmoid")
])

# 编译模型
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

拟合模型

1
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

输出:

Epoch 1/20 30/30 [==============================] - 3s 77ms/step - loss: 0.5066 - accuracy: 0.7891 - val_loss: 0.3772 - val_accuracy: 0.8726 
Epoch 2/20 30/30 [==============================] - 1s 40ms/step - loss: 0.3095 - accuracy: 0.8973 - val_loss: 0.3270 - val_accuracy: 0.8708 
Epoch 3/20 30/30 [==============================] - 1s 41ms/step - loss: 0.2321 - accuracy: 0.9209 - val_loss: 0.2788 - val_accuracy: 0.8913 
Epoch 4/20 30/30 [==============================] - 1s 44ms/step - loss: 0.1899 - accuracy: 0.9361 - val_loss: 0.2989 - val_accuracy: 0.8797 
Epoch 5/20 30/30 [==============================] - 1s 49ms/step - loss: 0.1579 - accuracy: 0.9477 - val_loss: 0.2835 - val_accuracy: 0.8862 
Epoch 6/20 30/30 [==============================] - 2s 58ms/step - loss: 0.1358 - accuracy: 0.9569 - val_loss: 0.3212 - val_accuracy: 0.8725 
Epoch 7/20 30/30 [==============================] - 2s 65ms/step - loss: 0.1184 - accuracy: 0.9631 - val_loss: 0.3048 - val_accuracy: 0.8793 
Epoch 8/20 30/30 [==============================] - 2s 61ms/step - loss: 0.1000 - accuracy: 0.9695 - val_loss: 0.3149 - val_accuracy: 0.8811 
Epoch 9/20 30/30 [==============================] - 1s 41ms/step - loss: 0.0861 - accuracy: 0.9749 - val_loss: 0.3366 - val_accuracy: 0.8834 
Epoch 10/20 30/30 [==============================] - 2s 56ms/step - loss: 0.0746 - accuracy: 0.9780 - val_loss: 0.3619 - val_accuracy: 0.8708 
Epoch 11/20 30/30 [==============================] - 2s 55ms/step - loss: 0.0660 - accuracy: 0.9827 - val_loss: 0.3676 - val_accuracy: 0.8783 
Epoch 12/20 30/30 [==============================] - 2s 52ms/step - loss: 0.0575 - accuracy: 0.9855 - val_loss: 0.3926 - val_accuracy: 0.8737 
Epoch 13/20 30/30 [==============================] - 1s 44ms/step - loss: 0.0465 - accuracy: 0.9896 - val_loss: 0.4083 - val_accuracy: 0.8760 
Epoch 14/20 30/30 [==============================] - 1s 40ms/step - loss: 0.0418 - accuracy: 0.9895 - val_loss: 0.4366 - val_accuracy: 0.8732 
Epoch 15/20 30/30 [==============================] - 1s 40ms/step - loss: 0.0373 - accuracy: 0.9906 - val_loss: 0.4559 - val_accuracy: 0.8735 
Epoch 16/20 30/30 [==============================] - 2s 73ms/step - loss: 0.0281 - accuracy: 0.9953 - val_loss: 0.4732 - val_accuracy: 0.8741 
Epoch 17/20 30/30 [==============================] - 2s 53ms/step - loss: 0.0269 - accuracy: 0.9951 - val_loss: 0.5042 - val_accuracy: 0.8730 
Epoch 18/20 30/30 [==============================] - 2s 54ms/step - loss: 0.0247 - accuracy: 0.9955 - val_loss: 0.5203 - val_accuracy: 0.8721 
Epoch 19/20 30/30 [==============================] - 2s 54ms/step - loss: 0.0180 - accuracy: 0.9980 - val_loss: 0.5411 - val_accuracy: 0.8714 
Epoch 20/20 30/30 [==============================] - 1s 43ms/step - loss: 0.0154 - accuracy: 0.9983 - val_loss: 0.5657 - val_accuracy: 0.8704

可以看到, 训练了20轮后, 虽然在训练集上的精度达到了0.9983, 但在验证集上, 精度却只有0.8704.

1
2
3
4
history_dict = history.history # 一个字典, 键是"指标"; 值是列表, 指标在每轮训练时的值.

history_dict.keys()
# dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

可视化拟合结果

绘制训练损失和验证损失

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import matplotlib.pyplot as plt

history_dict = history.history

loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, "bo", label="Training loss") # "bo"表示蓝色圆点
plt.plot(epochs, val_loss_values, "b", label="Validation loss") # "b"表示蓝色实线

plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")

plt.legend() # 用于为图表添加图例
plt.show()

绘制训练精度和验证精度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# import matplotlib.pyplot as plt
# history_dict = history.history

plt.clf() # 清空图像

acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]

# epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, acc, "bo", label="Training acc") # "bo"表示蓝色圆点
plt.plot(epochs, val_acc, "b", label="Validation acc") # "b"表示蓝色实线

plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")

plt.legend() # 用于为图表添加图例
plt.show()

从图中看出, 模型在训练到第4轮后, 开始出现过拟合的现象.

改进模型

让模型在训练4轮后停止

因为模型在训练到第4轮后就开始过拟合, 因此让模型训练4轮, 之后再次评估模型.

1
2
# 注意, 这次拟合时没有再从训练集中分出一部分做验证, 而是全部用来训练
history = model.fit(x_train, y_train, epochs=4, batch_size=512)

评价模型

前面选择精度(accuracy)作为模型评估的指标, 因此在对模型效果进行评价时, 使用模型在整个测试集上的平均精度.

1
2
3
4
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"测试精度: {test_acc}")
# 782/782 [==============================] - 2s 3ms/step - loss: 0.4298 - accuracy: 0.8632 
测试精度: 0.8631600141525269

可以看到, 模型在测试集上的精度为0.8632, 表明模型还应进一步改进..

利用模型进行预测

拿到一条评论, 对它像在上面那样把它编码为一个由0和1组成的向量, 然后用model.predict()进行预测.

1
2
3
4
5
6
7
8
predictions = model.predict(x_test)

predic_result = [0 if item < 0.5 else 1 for item in predictions[:10]]
predic_result[:10]
# 预测值 [0, 1, 0, 1, 1, 1, 1, 0, 1, 1]

list(y_test[0:10].astype(int))
# 实际值 [0, 1, 1, 0, 1, 1, 1, 0, 0, 1]

部署模型

不部署.