所用环境为Colab.
定义任务目标
拿到一些对电影的评论, 判断出来哪些评论是正面的, 哪些是负面的.
数据收集
数据收集
使用 IMDB 数据集.
|
|
数据可视化
建立对数据形状的感受.
|
|
看一下数据集里的数据到底长什么样子, 数据对应的标签是什么.
|
|
数据集里的评论被编码为了数字, 解码成正常文本看看.
|
|
解码出来的文本:
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
很显然这是一条正面评论. 这条评论的标签是"1", 这时我们也清楚了, 标签为"1"代表该评论是正面的, 为"0"代表该评论是负面的.
数据标注
已标注好
数据清理
无需清理
选择模型评估方法
确定简单基准(模型应能超越这个基准)
如果为评论随机指定分类, 能分类正确的概率为50%, 因此建立的模型的分类正确率应该超过50%.
评估方法选择
这里选择精度(accuracy)作为模型评估的指标. 精度, 即正确分类的图像所占比例.
怎样用? 构建好模型后, 在模型编译阶段, 将model.compile()
的metrics
参数值设定为["accuracy"]
数据预处理
数据向量化和规范化
对文本列表进行multi-hot编码, 将其转换为由0和1组成的向量. 把每条评论都转换为一个10000维向量, 如果一个单词在该评论里出现, 就把该单词的索引(单词对应的那个整数)对应位置的元素设为1.
|
|
处理缺失值
不需要处理缺失值.
数据划分: 训练集, 验证集, 测试集
从训练集中分出一部分作为验证集.
|
|
构建第一个模型
特征选择(过滤没有信息量的特征; 开发新特征)
在本例中特征就是评论里编码后的单词了.
选择架构
两个中间层, 每层16个单元. 第三层输出标量预测值, 代表一条评论的情感类别(正面, 负面).
层1: Dense层, “表示空间"的维数units
为16
, 激活函数activation
为relu
.
层2: Dense层, “表示空间"的维数units
为16
, 激活函数activation
为relu
.
层3: Dense层, “表示空间"的维数units
为1
, 激活函数activation
为sigmoid
. 输出是一个介于0~1之间的概率值, 表示样本目标值等于"1"的可能性, 即评论为正面的可能性.
训练配置(损失函数, 批量大小, 学习率)
优化器 optimizer
这里选rmsprop
损失函数 loss function
这里选二元交叉熵损失函数binary_crossentropy
训练轮数
这里训练20轮.
数据批量大小
批量大小设为512.
模型构建代码
|
|
拟合模型
|
|
输出:
Epoch 1/20 30/30 [==============================] - 3s 77ms/step - loss: 0.5066 - accuracy: 0.7891 - val_loss: 0.3772 - val_accuracy: 0.8726
Epoch 2/20 30/30 [==============================] - 1s 40ms/step - loss: 0.3095 - accuracy: 0.8973 - val_loss: 0.3270 - val_accuracy: 0.8708
Epoch 3/20 30/30 [==============================] - 1s 41ms/step - loss: 0.2321 - accuracy: 0.9209 - val_loss: 0.2788 - val_accuracy: 0.8913
Epoch 4/20 30/30 [==============================] - 1s 44ms/step - loss: 0.1899 - accuracy: 0.9361 - val_loss: 0.2989 - val_accuracy: 0.8797
Epoch 5/20 30/30 [==============================] - 1s 49ms/step - loss: 0.1579 - accuracy: 0.9477 - val_loss: 0.2835 - val_accuracy: 0.8862
Epoch 6/20 30/30 [==============================] - 2s 58ms/step - loss: 0.1358 - accuracy: 0.9569 - val_loss: 0.3212 - val_accuracy: 0.8725
Epoch 7/20 30/30 [==============================] - 2s 65ms/step - loss: 0.1184 - accuracy: 0.9631 - val_loss: 0.3048 - val_accuracy: 0.8793
Epoch 8/20 30/30 [==============================] - 2s 61ms/step - loss: 0.1000 - accuracy: 0.9695 - val_loss: 0.3149 - val_accuracy: 0.8811
Epoch 9/20 30/30 [==============================] - 1s 41ms/step - loss: 0.0861 - accuracy: 0.9749 - val_loss: 0.3366 - val_accuracy: 0.8834
Epoch 10/20 30/30 [==============================] - 2s 56ms/step - loss: 0.0746 - accuracy: 0.9780 - val_loss: 0.3619 - val_accuracy: 0.8708
Epoch 11/20 30/30 [==============================] - 2s 55ms/step - loss: 0.0660 - accuracy: 0.9827 - val_loss: 0.3676 - val_accuracy: 0.8783
Epoch 12/20 30/30 [==============================] - 2s 52ms/step - loss: 0.0575 - accuracy: 0.9855 - val_loss: 0.3926 - val_accuracy: 0.8737
Epoch 13/20 30/30 [==============================] - 1s 44ms/step - loss: 0.0465 - accuracy: 0.9896 - val_loss: 0.4083 - val_accuracy: 0.8760
Epoch 14/20 30/30 [==============================] - 1s 40ms/step - loss: 0.0418 - accuracy: 0.9895 - val_loss: 0.4366 - val_accuracy: 0.8732
Epoch 15/20 30/30 [==============================] - 1s 40ms/step - loss: 0.0373 - accuracy: 0.9906 - val_loss: 0.4559 - val_accuracy: 0.8735
Epoch 16/20 30/30 [==============================] - 2s 73ms/step - loss: 0.0281 - accuracy: 0.9953 - val_loss: 0.4732 - val_accuracy: 0.8741
Epoch 17/20 30/30 [==============================] - 2s 53ms/step - loss: 0.0269 - accuracy: 0.9951 - val_loss: 0.5042 - val_accuracy: 0.8730
Epoch 18/20 30/30 [==============================] - 2s 54ms/step - loss: 0.0247 - accuracy: 0.9955 - val_loss: 0.5203 - val_accuracy: 0.8721
Epoch 19/20 30/30 [==============================] - 2s 54ms/step - loss: 0.0180 - accuracy: 0.9980 - val_loss: 0.5411 - val_accuracy: 0.8714
Epoch 20/20 30/30 [==============================] - 1s 43ms/step - loss: 0.0154 - accuracy: 0.9983 - val_loss: 0.5657 - val_accuracy: 0.8704
可以看到, 训练了20轮后, 虽然在训练集上的精度达到了0.9983
, 但在验证集上, 精度却只有0.8704
.
|
|
可视化拟合结果
绘制训练损失和验证损失
|
|
绘制训练精度和验证精度
|
|
从图中看出, 模型在训练到第4轮后, 开始出现过拟合的现象.
改进模型
让模型在训练4轮后停止
因为模型在训练到第4轮后就开始过拟合, 因此让模型训练4轮, 之后再次评估模型.
|
|
评价模型
前面选择精度(accuracy)作为模型评估的指标, 因此在对模型效果进行评价时, 使用模型在整个测试集上的平均精度.
|
|
可以看到, 模型在测试集上的精度为0.8632, 表明模型还应进一步改进..
利用模型进行预测
拿到一条评论, 对它像在上面那样把它编码为一个由0和1组成的向量, 然后用model.predict()
进行预测.
|
|
部署模型
不部署.