科学蒸馏综述:代码整理
湟源娱乐新闻网 2025-10-24
首先运使用avgpool将尺码调整保持一致,然后运使用MSE Loss来计量两者差距。
4. SP: Similarity-Preserving原称:Similarity-Preserving Knowledge Distillation
URL:
刊发:ICCV19
SP应属基于彼此间的科学知识发酵分析方法。文章观念是驳斥关联性保留的科学知识,使得同学网络游戏平台和同学网络游戏平台亦会对有所不同的抽取产生类似于的介导。可以从下左图显借助于管控程序,同学网络游戏平台和同学网络游戏平台互换feature map通过算借助于内积,取得bsxbs的类似于度分量,然后运使用均方误差来计量两个类似于度分量。
再一Loss为:
G推选的就是bsxbs的分量。
解决问题如下:
class Similarity(nn.Module): """Similarity-Preserving Knowledge Distillation, ICCV2019, verified by original author""" def 紧接init紧接(self): super(Similarity, self).紧接init紧接() def forward(self, g_s, g_t): return [self.similarity_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)] def similarity_loss(self, f_s, f_t): bsz = f_s.shape[0] f_s = f_s.view(bsz, -1) f_t = f_t.view(bsz, -1) G_s = torch.mm(f_s, torch.t(f_s)) # G_s = G_s / G_s.norm(2) G_s = torch.nn.functional.normalize(G_s) G_t = torch.mm(f_t, torch.t(f_t)) # G_t = G_t / G_t.norm(2) G_t = torch.nn.functional.normalize(G_t) G_diff = G_t - G_s loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz) return loss5. CC: Correlation Congruence原称:Correlation Congruence for Knowledge Distillation
URL:
刊发:ICCV19
CC也应属基于彼此间的科学知识发酵分析方法。不无论如何不太可能引导同学网络游戏平台和同学网络游戏平台单个抽取分量彼此之间的歧异,还无论如何努力学习两个抽取彼此之间的差异性,而这个差异性运使用的是Correlation Congruence 同学网络游戏平台雨同学网络游戏平台差异性彼此之间的欧氏距离。
整体Loss如下:
解决问题如下:
class Correlation(nn.Module): """Similarity-preserving loss. My origianl own reimplementation based on the paper before emailing the original authors.""" def 紧接init紧接(self): super(Correlation, self).紧接init紧接() def forward(self, f_s, f_t): return self.similarity_loss(f_s, f_t) def similarity_loss(self, f_s, f_t): bsz = f_s.shape[0] f_s = f_s.view(bsz, -1) f_t = f_t.view(bsz, -1) G_s = torch.mm(f_s, torch.t(f_s)) G_s = G_s / G_s.norm(2) G_t = torch.mm(f_t, torch.t(f_t)) G_t = G_t / G_t.norm(2) G_diff = G_t - G_s loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz) return loss6. VID: Variational Information Distillation原称:Variational Information Distillation for Knowledge Transfer
URL:
刊发:CVPR19
利用互个人信息(Mutual Information)来计量同学网络游戏平台和同学网络游戏平台歧异。互个人信息可以对此借助于两个变量的相缺少相对,其值越大,对此变量彼此之间的缺少相对越高。互个人信息算借助于如下:
互个人信息是同学数学方法的熵减去在推断同学数学方法条件下同学数学方法的熵。目标是最大化互个人信息,因为互个人信息越大说明H(t|s)越小,即同学网络游戏平台确定的情况下,同学网络游戏平台的熵亦会变为,推论同学网络游戏平台已经努力学习的比较充分。
整体loss如下:
由于p(t|s)不能算借助于,可以运使用变分分布区q(t|s)去接近真实分布区。
其之前q(t|s)是运使用绝对值可努力学习的高斯分布区模拟(方程组之前的log_scale):
解决问题如下:
class VIDLoss(nn.Module): """Variational Information Distillation for Knowledge Transfer (CVPR 2019), code from author: """ def 紧接init紧接(self, num_input_channels, num_mid_channel, num_target_channels, init_pred_var=5.0, eps=1e-5): super(VIDLoss, self).紧接init紧接() def conv1x1(in_channels, out_channels, stride=1): return nn.Conv2d( in_channels, out_channels, kernel_size=1, padding=0, bias=False, stride=stride) self.regressor = nn.Sequential( conv1x1(num_input_channels, num_mid_channel), nn.ReLU(), conv1x1(num_mid_channel, num_mid_channel), nn.ReLU(), conv1x1(num_mid_channel, num_target_channels), ) self.log_scale = torch.nn.Parameter( np.log(np.exp(init_pred_var-eps)-1.0) * torch.ones(num_target_channels) ) self.eps = eps def forward(self, input, target): # pool for dimentsion match s_H, t_H = input.shape[2], target.shape[2] if s_H> t_H: input = F.adaptive_avg_pool2d(input, (t_H, t_H)) elif s_H < t_H: target = F.adaptive_avg_pool2d(target, (s_H, s_H)) else: pass pred_mean = self.regressor(input) pred_var = torch.log(1.0+torch.exp(self.log_scale))+self.eps pred_var = pred_var.view(1, -1, 1, 1) neg_log_prob = 0.5*( (pred_mean-target)**2/pred_var+torch.log(pred_var) ) loss = torch.mean(neg_log_prob) return loss7. RKD: Relation Knowledge Distillation原称:Relational Knowledge Disitllation
URL:
刊发:CVPR19
RKD也是基于彼此间的科学知识发酵分析方法,RKD驳斥了两种巨大损失变数,二阶的距离巨大损失和三阶的角度巨大损失。
Distance-wise LossAngle-wise Loss解决问题如下:
class RKDLoss(nn.Module): """Relational Knowledge Disitllation, CVPR2019""" def 紧接init紧接(self, w_d=25, w_a=50): super(RKDLoss, self).紧接init紧接() self.w_d = w_d self.w_a = w_a def forward(self, f_s, f_t): student = f_s.view(f_s.shape[0], -1) teacher = f_t.view(f_t.shape[0], -1) # RKD distance loss with torch.no_grad(): t_d = self.pdist(teacher, squared=False) mean_td = t_d[t_d> 0].mean() t_d = t_d / mean_td d = self.pdist(student, squared=False) mean_d = d[d> 0].mean() d = d / mean_d loss_d = F.smooth_l1_loss(d, t_d) # RKD Angle loss with torch.no_grad(): td = (teacher.unsqueeze(0) - teacher.unsqueeze(1)) norm_td = F.normalize(td, p=2, dim=2) t_angle = torch.bmm(norm_td, norm_td.transpose(1, 2)).view(-1) sd = (student.unsqueeze(0) - student.unsqueeze(1)) norm_sd = F.normalize(sd, p=2, dim=2) s_angle = torch.bmm(norm_sd, norm_sd.transpose(1, 2)).view(-1) loss_a = F.smooth_l1_loss(s_angle, t_angle) loss = self.w_d * loss_d + self.w_a * loss_a return loss @staticmethod def pdist(e, squared=False, eps=1e-12): e_square = e.pow(2).sum(dim=1) prod = e @ e.t() res = (e_square.unsqueeze(1) + e_square.unsqueeze(0) - 2 * prod).clamp(min=eps) if not squared: res = res.sqrt() res = res.clone() res[range(len(e)), range(len(e))] = 0 return res8. PKT:Probabilistic Knowledge Transfer原称:Probabilistic Knowledge Transfer for deep representation learning
URL:
刊发:CoRR18
驳斥一种期望值科学知识移往分析方法,引入了互个人信息来完成构建。该分析方法具可跨模态科学知识移往、无须慎重考虑使命类型、可将手工形态融入网络游戏平台等好像。
解决问题如下:
class PKT(nn.Module): """Probabilistic Knowledge Transfer for deep representation learning Code from author: _kt""" def 紧接init紧接(self): super(PKT, self).紧接init紧接() def forward(self, f_s, f_t): return self.cosine_similarity_loss(f_s, f_t) @staticmethod def cosine_similarity_loss(output_net, target_net, eps=0.0000001): # Normalize each vector by its norm output_net_norm = torch.sqrt(torch.sum(output_net ** 2, dim=1, keepdim=True)) output_net = output_net / (output_net_norm + eps) output_net[output_net != output_net] = 0 target_net_norm = torch.sqrt(torch.sum(target_net ** 2, dim=1, keepdim=True)) target_net = target_net / (target_net_norm + eps) target_net[target_net != target_net] = 0 # Calculate the cosine similarity model_similarity = torch.mm(output_net, output_net.transpose(0, 1)) target_similarity = torch.mm(target_net, target_net.transpose(0, 1)) # Scale cosine similarity to 0..1 model_similarity = (model_similarity + 1.0) / 2.0 target_similarity = (target_similarity + 1.0) / 2.0 # Transform them into probabilities model_similarity = model_similarity / torch.sum(model_similarity, dim=1, keepdim=True) target_similarity = target_similarity / torch.sum(target_similarity, dim=1, keepdim=True) # Calculate the KL-divergence loss = torch.mean(target_similarity * torch.log((target_similarity + eps) / (model_similarity + eps))) return loss9. AB: Activation Boundaries原称:Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons
URL:
刊发:AAAI18
目标:让同学网络游戏平台层的轴突的介导边界尽量和同学网络游戏平台的一样。都是的介导边界所称的是复合一个点(针对的是RELU这种介导变数),其决定了轴突的介导与失活。AB驳斥的介导移往巨大损失,让同学网络游戏平台与同学网络游戏平台彼此之间的复合边界必要保持一致。
解决问题如下:
class ABLoss(nn.Module): """Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons code: _distillation """ def 紧接init紧接(self, feat_num, margin=1.0): super(ABLoss, self).紧接init紧接() self.w = [2**(i-feat_num+1) for i in range(feat_num)] self.margin = margin def forward(self, g_s, g_t): bsz = g_s[0].shape[0] losses = [self.criterion_alternative_l2(s, t) for s, t in zip(g_s, g_t)] losses = [w * l for w, l in zip(self.w, losses)] # loss = sum(losses) / bsz # loss = loss / 1000 * 3 losses = [l / bsz for l in losses] losses = [l / 1000 * 3 for l in losses] return losses def criterion_alternative_l2(self, source, target): loss = ((source + self.margin) ** 2 * ((source> -self.margin) Max (target <= 0)).float() + (source - self.margin) ** 2 * ((source 0)).float()) return torch.abs(loss).sum()10. FT: Factor Transfer原称:Paraphrasing Complex Network: Network Compression via Factor Transfer
URL:
刊发:NIPS18
驳斥的是factor transfer的分析方法。都是的factor,其实是对数学方法最后的数据结果完成一个MPEG的过程,抽取借助于的一个factor分量,用同学网络游戏平台的factor来所称导同学网络游戏平台的factor。
FT算借助于方程组为:
解决问题如下:
class FactorTransfer(nn.Module): """Paraphrasing Complex Network: Network Compression via Factor Transfer, NeurIPS 2018""" def 紧接init紧接(self, p1=2, p2=1): super(FactorTransfer, self).紧接init紧接() self.p1 = p1 self.p2 = p2 def forward(self, f_s, f_t): return self.factor_loss(f_s, f_t) def factor_loss(self, f_s, f_t): s_H, t_H = f_s.shape[2], f_t.shape[2] if s_H> t_H: f_s = F.adaptive_avg_pool2d(f_s, (t_H, t_H)) elif s_H < t_H: f_t = F.adaptive_avg_pool2d(f_t, (s_H, s_H)) else: pass if self.p2 == 1: return (self.factor(f_s) - self.factor(f_t)).abs().mean() else: return (self.factor(f_s) - self.factor(f_t)).pow(self.p2).mean() def factor(self, f): return F.normalize(f.pow(self.p1).mean(1).view(f.size(0), -1))11. FSP: Flow of Solution Procedure原称:A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
URL:_cvpr_2017/papers/Yim_A_Gift_From_CVPR_2017_paper.pdf
刊发:CVPR17
FSP认为教教同学网络游戏平台不同层输借助于的feature彼此之间的彼此间比教教同学网络游戏平台结果好
定义了FSP分量来定义网络游戏平台内部形态层彼此之间的彼此间,是一个Gram分量总结老师教教同学的过程。
运使用的是L2 Loss完成约束FSP分量。
解决问题如下:
class FSP(nn.Module): """A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning""" def 紧接init紧接(self, s_shapes, t_shapes): super(FSP, self).紧接init紧接() assert len(s_shapes) == len(t_shapes), 'unequal length of feat list' s_c = [s[1] for s in s_shapes] t_c = [t[1] for t in t_shapes] if np.any(np.asarray(s_c) != np.asarray(t_c)): raise ValueError('num of channels not equal (error in FSP)') def forward(self, g_s, g_t): s_fsp = self.compute_fsp(g_s) t_fsp = self.compute_fsp(g_t) loss_group = [self.compute_loss(s, t) for s, t in zip(s_fsp, t_fsp)] return loss_group @staticmethod def compute_loss(s, t): return (s - t).pow(2).mean() @staticmethod def compute_fsp(g): fsp_list = [] for i in range(len(g) - 1): bot, top = g[i], g[i + 1] b_H, t_H = bot.shape[2], top.shape[2] if b_H> t_H: bot = F.adaptive_avg_pool2d(bot, (t_H, t_H)) elif b_H < t_H: top = F.adaptive_avg_pool2d(top, (b_H, b_H)) else: pass bot = bot.unsqueeze(1) top = top.unsqueeze(2) bot = bot.view(bot.shape[0], bot.shape[1], bot.shape[2], -1) top = top.view(top.shape[0], top.shape[1], top.shape[2], -1) fsp = (bot * top).mean(-1) fsp_list.append(fsp) return fsp_list12. NST: Neuron Selectivity Transfer原称:Like what you like: knowledge distill via neuron selectivity transfer
URL:
刊发:CoRR17
运使用原先巨大损失变数这样一来同学网络游戏平台与同学网络游戏平台彼此之间的Maximum Mean Discrepancy(MMD), 文之前同样的是对其同学网络游戏平台与同学网络游戏平台彼此之间轴突同样样式的分布区。
运使用核分裂技巧(互换示意左图poly kernel)并全面性展开以后可得:
实际上包括了Linear Kernel、Poly Kernel、Gaussian Kernel三种,这里解决问题只给了Poly这种,这是因为Poly这种分析方法可以与KD完成交叉,这样整体效果亦会并不好。
解决问题如下:
class NSTLoss(nn.Module): """like what you like: knowledge distill via neuron selectivity transfer""" def 紧接init紧接(self): super(NSTLoss, self).紧接init紧接() pass def forward(self, g_s, g_t): return [self.nst_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)] def nst_loss(self, f_s, f_t): s_H, t_H = f_s.shape[2], f_t.shape[2] if s_H> t_H: f_s = F.adaptive_avg_pool2d(f_s, (t_H, t_H)) elif s_H < t_H: f_t = F.adaptive_avg_pool2d(f_t, (s_H, s_H)) else: pass f_s = f_s.view(f_s.shape[0], f_s.shape[1], -1) f_s = F.normalize(f_s, dim=2) f_t = f_t.view(f_t.shape[0], f_t.shape[1], -1) f_t = F.normalize(f_t, dim=2) # set full_loss as False to avoid unnecessary computation full_loss = True if full_loss: return (self.poly_kernel(f_t, f_t).mean().detach() + self.poly_kernel(f_s, f_s).mean() - 2 * self.poly_kernel(f_s, f_t).mean()) else: return self.poly_kernel(f_s, f_s).mean() - 2 * self.poly_kernel(f_s, f_t).mean() def poly_kernel(self, a, b): a = a.unsqueeze(1) b = b.unsqueeze(2) res = (a * b).sum(-1).pow(2) return res13. CRD: Contrastive Representation Distillation原称:Contrastive Representation Distillation
URL:
刊发:ICLR20
将对比努力学习引入科学知识发酵之前,其目标修时是为:努力学习一个连续性,让时是抽取对的同学网络游戏平台与同学网络游戏平台必要接近,同列抽取对同学网络游戏平台与同学网络游戏平台必要远离。
协作的对比努力学习问题对此如下:
整体的发酵Loss对此如下:
解决问题如下:
class ContrastLoss(nn.Module): """ contrastive loss, corresponding to Eq (18) """ def 紧接init紧接(self, n_data): super(ContrastLoss, self).紧接init紧接() self.n_data = n_data def forward(self, x): bsz = x.shape[0] m = x.size(1) - 1 # noise distribution Pn = 1 / float(self.n_data) # loss for positive pair P_pos = x.select(1, 0) log_D1 = torch.div(P_pos, P_pos.add(m * Pn + eps)).log_() # loss for K negative pair P_neg = x.narrow(1, 1, m) log_D0 = torch.div(P_neg.clone().fill_(m * Pn), P_neg.add(m * Pn + eps)).log_() loss = - (log_D1.sum(0) + log_D0.view(-1, 1).sum(0)) / bsz return loss class CRDLoss(nn.Module): """CRD Loss function includes two symmetric parts: (a) using teacher as anchor, choose positive and negatives over the student side (b) using student as anchor, choose positive and negatives over the teacher side Args: opt.s_dim: the dimension of student's feature opt.t_dim: the dimension of teacher's feature opt.feat_dim: the dimension of the projection space opt.nce_k: number of negatives paired with each positive opt.nce_t: the temperature opt.nce_m: the momentum for updating the memory buffer opt.n_data: the number of samples in the training set, therefor the memory buffer is: opt.n_data x opt.feat_dim """ def 紧接init紧接(self, opt): super(CRDLoss, self).紧接init紧接() self.embed_s = Embed(opt.s_dim, opt.feat_dim) self.embed_t = Embed(opt.t_dim, opt.feat_dim) self.contrast = ContrastMemory(opt.feat_dim, opt.n_data, opt.nce_k, opt.nce_t, opt.nce_m) self.criterion_t = ContrastLoss(opt.n_data) self.criterion_s = ContrastLoss(opt.n_data) def forward(self, f_s, f_t, idx, contrast_idx=None): """ Args: f_s: the feature of student network, size [batch_size, s_dim] f_t: the feature of teacher network, size [batch_size, t_dim] idx: the indices of these positive samples in the dataset, size [batch_size] contrast_idx: the indices of negative samples, size [batch_size, nce_k] Returns: The contrastive loss """ f_s = self.embed_s(f_s) f_t = self.embed_t(f_t) out_s, out_t = self.contrast(f_s, f_t, idx, contrast_idx) s_loss = self.criterion_s(out_s) t_loss = self.criterion_t(out_t) loss = s_loss + t_loss return loss14. Overhaul原称:A Comprehensive Overhaul of Feature Distillation
URL:_ICCV_2019/papers/
刊发:CVPR19
teacher transform之前驳斥运使用margin RELU介导变数。 student transform之前驳斥运使用1x1正弦。distillation feature postion同样Pre-ReLU。 distance function大部分驳斥了Partial L2 巨大损失变数。大部分解决问题如下:
class OFD(nn.Module): ''' A Comprehensive Overhaul of Feature Distillation _ICCV_2019/papers/ Heo_A_Comprehensive_Overhaul_of_Feature_Distillation_ICCV_2019_paper.pdf ''' def 紧接init紧接(self, in_channels, out_channels): super(OFD, self).紧接init紧接() self.connector = nn.Sequential(*[ nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0, bias=False), nn.BatchNorm2d(out_channels) ]) for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') if m.bias is not None: nn.init.constant_(m.bias, 0) elif isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0) def forward(self, fm_s, fm_t): margin = self.get_margin(fm_t) fm_t = torch.max(fm_t, margin) fm_s = self.connector(fm_s) mask = 1.0 - ((fm_s <= fm_t) Max (fm_t <= 0.0)).float() loss = torch.mean((fm_s - fm_t)**2 * mask) return loss def get_margin(self, fm, eps=1e-6): mask = (fm < 0.0).float() masked_fm = fm * mask margin = masked_fm.sum(dim=(0,2,3), keepdim=True) / (mask.sum(dim=(0,2,3), keepdim=True)+eps) return margin参考文献_44579633/article/details/119350631
_46239293/article/details/120289163
_PP_JJ/article/details/121578722
_PP_JJ/article/details/121714957
_44633882/article/details/108927033
_46239293/article/details/120266111
_43402775/article/details/109011296
_37665984/article/details/103288582
_376659
。肝癌中晚期能活多久一般国内有几家做钇90介入手术
钇90微球治疗
肝癌晚期症状表现有哪些
肝癌晚期还有治疗的必要吗

-
时隔六年,周杰伦《最伟大的作品》MV上线,歌词是知道听不清!
写真 2025-10-30六年后,徐⓿⓿⓿MV上线了,曲调就让听不见!❶❶❶❶❶❶❶❶❶❶❶❶❶❶❶❶❶ 六年后,徐⓿⓿⓿MV上线了,曲调就让听不见! 7月6日,在网友的重申下,徐的从首歌曲⓿

-
6年后!终于等到周的新专辑,网友:我哭的时候总觉得很熟悉!
影视 2025-10-306年后!好不容易等到周的首歌曲,发帖:我听完的时候总真是很陌生! 7年初6日,周的首歌曲《最赞美的作品》发布了第一波曲中MV。在在一首变奏曲中,像一个两部的探索者,两部,转入20世

-
《开心超人之英雄的心》12周年庆国漫顶流齐稀 开心超人喜羊羊惊喜同框
影视 2025-10-30喜剧片《开悲蜘蛛侠之英雄人物的悲》12周年庆典 狮子王时小兔突击献舞 今日,“开悲蜘蛛侠”最新大喜剧片《开悲蜘蛛侠之英雄人物的悲》于东莞领展太阳启明合办了开悲蜘蛛侠新联盟十二周年圣

-
宋丹丹被骂上热搜,张黛安娜和刘敏涛被烤,老艺人真的不适合做综艺节目
影视 2025-10-30新赛季《年以与主人公》从它第一次试播时的宁静常在,它方才来到了一个令人窒息的阶段。张和陈的卑劣杂耍越来越让人吃惊,他们虽然同辈大,但在娱乐节目当中未必负责任。在这种意味著,陈坤刚刚参加了电视娱乐

-
王使用的官方语言完全不同!《这就是街舞》很吻合谁是热
八卦 2025-10-30《这就是街歌》三集有四名队员,王于,韩,赵承宪和赵。可以感叹,对于相当数量的网路上来感叹,季中都的亮点相比较是王于!除了成为此前两个联赛的冠军队员,这也是因为他超高的点击量和水流量。虽然现在的社