Softmax + 交叉熵

考虑一个广义的Softmax函数，q的logits为 $z_{i}$ 其中 $T$ 是温度，这是从统计力学中的玻尔兹曼分布中借用的概念。容易证明，当温度 $T$ 趋向0时，softmax输出将收敛为one-hot向量，温度 $T$ 趋向无穷时，输出更「软」。

$q_{i}=\frac{\exp \left(z_{i} / T\right)}{\sum_{j} \exp \left(z_{j} / T\right)}$

因此，在知识蒸馏中，训练新模型的时候，可以使用较高的T，使得softmax产生的分布足够软，这时让新模型的softmax输出近似原模型；在训练结束以后再使用正常的温度 $T=1$ 来预测。记新模型产生的分布为q，原模型产生的分布为p，p的logits为 $v_{i}$ (下面的推导只需要把T设为1，p设为one-hot向量，就是平时用数据集从头训练模型时的softmax+交叉熵得到的损失函数)

需要最小化的损失函数为C：

$C=-p^{\top} \log q$

下面求C关于z的偏导数，由链式法则得：

$\frac{\partial C}{\partial z}=\frac{\partial q}{\partial z} \frac{\partial C}{\partial q}$

p是原模型产生的softmax输出，与q无关。

$\frac{\partial C}{\partial q_{i}}=-\frac{p_{i}}{q_{i}}$

$\frac{\partial C}{\partial q}$是一个n维向量：

$\frac{\partial C}{\partial q}=\left[\begin{array}{c}{-\frac{p_{1}}{q_{1}}} \\ {-\frac{p_{2}}{q_{2}}} \\ {\vdots} \\ {-\frac{p_{n}}{q_{n}}}\end{array}\right]$

$\frac{\partial q}{\partial z}$是一个$n \times n$的方阵，记$Z=\sum_{k} \exp \left(z_{k} / T\right)$，可以求得$q_{i}$关于$z_{j}$的偏导为：

$\frac{\partial q_{i}}{\partial z_{j}}=\frac{1}{Z^{2}}\left(Z \frac{\partial \exp \left(z_{i} / T\right)}{\partial z_{j}}-\exp \left(z_{i} / T\right)\left[\frac{\partial Z}{\partial z_{j}}\right]\right)$

右侧方框部分可以展开为

$\frac{\partial Z}{\partial z_{j}}=\frac{1}{T} \exp \left(z_{j} / T\right)$

代入上式将括号展开，可以得到：

$\begin{aligned} \frac{\partial q_{i}}{\partial z_{j}} &=\frac{1}{Z} \frac{\partial \exp \left(z_{i} / T\right)}{\partial z_{j}}-\frac{1}{T Z^{2}} \exp \left(z_{i} / T\right) \exp \left(z_{j} / T\right) \\ &=\frac{1}{Z} \frac{\partial \exp \left(z_{i} / T\right)}{\partial z_{j}}-\frac{1}{T} \frac{\exp \left(z_{i} / T\right)}{Z} \frac{\exp \left(z_{j} / T\right)}{Z} \\ &=\frac{1}{Z}\left[\frac{\partial \exp \left(z_{i} / T\right)}{\partial z_{j}}\right]-\frac{1}{T} q_{i} q_{j} \end{aligned}$

左侧方框分类讨论得：

$\frac{\partial \exp \left(z_{i} / T\right)}{\partial z_{j}}=\left\{\begin{array}{ll}{\frac{1}{T} \exp \left(z_{i} / T\right),} & {\text { if } i=j} \\ {0,} & {\text { if } i \neq j}\end{array}\right.$

代入上式得：

$\begin{aligned} \frac{\partial q_{i}}{\partial z_{j}} &=\left\{\begin{array}{ll}{\frac{1}{T}\left(\frac{\exp \left(z_{i} / T\right)}{Z}-q_{i} q_{j}\right),} & {\text { if } i=j} \\ {-\frac{1}{T} q_{i} q_{j},} & {\text { if } i \neq j} \\ {\frac{1}{T}\left(q_{i}-q_{i} q_{j}\right),} & {\text { if } i=j} \\ {-\frac{1}{T} q_{i} q_{j},} & {\text { if } i \neq j}\end{array}\right.\end{aligned}$

所以$
\partial q / \partial z
$等于：

$\frac{\partial q}{\partial z}=\frac{1}{T}\left[\begin{array}{cccc}{q_{1}-q_{1}^{2}} & {-q_{1} q_{2}} & {\cdots} & {-q_{1} q_{n}} \\ {-q_{2} q_{1}} & {q_{2}-q_{2}^{2}} & {\cdots} & {-q_{2} q_{n}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {-q_{n} q_{1}} & {-q_{n} q_{2}} & {\cdots} & {q_{n}-q_{n}^{2}}\end{array}\right]$

这里就是为什么softmax函数对其输入的偏导是下列形式的原因，$g(·)$函数为softmax函数，$x$为输入向量，维度为$d$。

$\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}=\operatorname{diag}(\hat{\mathbf{y}})-\hat{\mathbf{y}} \hat{\mathbf{y}}^{\top} \quad \in \mathbb{R}^{d \times d}$ $\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}=\left[\begin{array}{cccc}{\hat{y}_{1}} & {0} & {\cdots} & {0} \\ {0} & {\hat{y}_{2}} & {\cdots} & {0} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {0} & {0} & {\cdots} & {\hat{y}_{d}}\end{array}\right]-\left[\begin{array}{cccc}{\hat{y}_{1}^{2}} & {\hat{y}_{1} \hat{y}_{2}} & {\cdots} & {\hat{y}_{1} \hat{y}_{d}} \\ {\hat{y}_{2} \hat{y}_{1}} & {\hat{y}_{2}^{2}} & {\cdots} & {\hat{y}_{2} \hat{y}_{d}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\hat{y}_{d} \hat{y}_{1}} & {\hat{y}_{d} \hat{y}_{2}} & {\cdots} & {\hat{y}_{d}^{2}}\end{array}\right]$

回到我们的问题，继续推导，可以得到：

$\frac{\partial C}{\partial z}=\frac{1}{T}\left[\begin{array}{cccc}{q_{1}-q_{1}^{2}} & {-q_{1} q_{2}} & {\cdots} & {-q_{1} q_{n}} \\ {-q_{2} q_{1}} & {q_{2}-q_{2}^{2}} & {\cdots} & {-q_{2} q_{n}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {-q_{n} q_{1}} & {-q_{n} q_{2}} & {\cdots} & {q_{n}-q_{n}^{2}}\end{array}\right]\left[\begin{array}{c}{-\frac{p_{1}}{q_{1}}} \\ {-\frac{p_{2}}{q_{2}}} \\ {\vdots} \\ {-\frac{p_{n}}{q_{n}}}\end{array}\right]$ $\begin{aligned} =\frac{1}{T}\left[\begin{array}{c}{-p_{1}+\sum_{k} p_{k} q_{1}} \\ {-p_{2}+\sum_{k} p_{k} q_{2}} \\ {\vdots} \\ {-p_{n}+\sum_{k} p_{k} q_{n}}\end{array}\right] =\frac{1}{T}\left[\begin{array}{c}{-p_{1}+q_{1}} \\ {-p_{2}+q_{2}} \\ {\vdots} \\ {-p_{n}+q_{n}}\end{array}\right] \end{aligned}$ $=\frac{1}{T}(q-p)$

所以：

$\frac{\partial C}{\partial z_{i}}=\frac{1}{T}\left(q_{i}-p_{i}\right)$

参考链接：https://zhuanlan.zhihu.com/p/90049906