2017-07-29
CVPR2017

名词解释

Because a DNN uses many layers sequentially to map from the input space to the output space, the flow of solving a problem can be defined as the relationship between features from two layers.

Gramian Matrix格莱姆矩阵

Gramian矩阵是通过计算一组特征向量的内积进行生成的，包含了两个向量之间的方向关系，可以用来表示文理特征。

FSP matrix

The extracted feature maps from two layers are used to generate the flow of solution procedure (FSP) matrix. The student DNN is trained to make its FSP matrix similar to that of the teacher DNN

Distilled Knowledge 如果将DNN的输入视为问题，将DNN的输出作为回答，那么我们可以将中间的生成的特征看作回答问题过程中的中间结果。 老师教学生的过程也是类似的，学生需要着重学习的是某类题目的解答方法，而不是学会解某个题目。因此，一般是对问题的解决过程进行建模。

Model

Stage 1: 学习FSP矩阵

Weights of the student and teacher networks: Ws, Wt
1: Ws = arg minWs LFSP(Wt, Ws) # 就是上面优化FSP矩阵中提到的损失函数的优化


Stage 2: 针对原始任务进行训练

1: Ws = arg min Ws Lori(Ws)  # 例如是分类任务的话， 我们可以使用softmax交叉熵损失作为任务的损失函数进行学习和优化


实验

Fast Optimization

* 表示每个网络训练了21000iteration, 原始网络迭代次数为63000. 两个+的符号（++）表示Teacher Network在前面64000次迭代基础上，又训练了21000次迭代。 宝剑符号(+-)表示stage 1中，student network学习的是randomly shuffled 的FSP矩阵。Student*+-表示Student network在stage 1训练了21000次迭代，stage 2训练了21000次迭代。

As both teacher networks and student networks are of the same architecture, one can also transfer knowledge by directly copying weights. FSP is less restrictive than copying the weights and allows for better diversity and ensemble performance


Network Minimization

Because the student DNN and teacher DNN had the same number of channels, the sizes of the FSP matrices were the same. By minimizing the distance between the FSP matrices of the student network and teacher network, we found a good initial weight for the student network. Then,the student network was trained to solve the main task.


评测标准

Recognition rates

参考文献

