（测试版）使用 FX 构建卷积/批归一化融合器 - pytorch tutorials中文文档

PyTorch 入门指南

学习基础知识

快速入门

张量

数据集与数据加载器

变换操作

构建神经网络

自动微分与 torch.autograd

优化模型参数

保存和加载模型

PyTorch 自定义操作符

学习 PyTorch

PyTorch 深度学习实战：60 分钟快速入门教程

通过示例学习 PyTorch

torch.nn 究竟是什么？

从零开始的自然语言处理

使用 TensorBoard 可视化模型、数据和训练过程。

关于在 PyTorch 中使用非阻塞和 pin_memory() 的良好实践指南

图像和视频

TorchVision 目标检测微调教程

计算机视觉中的迁移学习教程

对抗样本生成

DCGAN教程

空间变换网络教程

优化视觉变压器模型以进行部署

使用 PyTorch 和 TIAToolbox 进行全-slide 图像分类

音频

音频输入输出

音频重采样

音频数据增强

音频特征提取

音频特征增强

音频数据集

基于 Wav2Vec2 的语音识别技术

基于Tacotron2的文本转语音系统

使用 Wav2Vec2 进行强制对齐

后端

ONNX 入门

强化学习

强化学习（DQN）教程

强化学习（PPO）与 TorchRL 教程

训练一个玩马里奥的游戏代理，使用强化学习方法。

Pendulum：用TorchRL编写环境和转换

在生产环境中部署 PyTorch 模型

ONNX 入门

通过 Flask 框架使用 REST API 在 Python 中部署 PyTorch

TorchScript简介

在 C++ 中加载 TorchScript 模型

（可选）将 PyTorch 模型导出为 ONNX，并使用 ONNX Runtime 进行运行。

在 Raspberry Pi 4 上实现实时推理（30 帧/秒！）

Profiling PyTorch

_profiling您的PyTorch模块_

Holistic Trace 分析介绍

使用整体痕迹分析的痕迹差异追踪或者更自然一些：基于整体痕迹分析的痕迹差异追踪

代码变换与FX

（测试版）在FX中构建卷积和批量归一化的融合器

（测试版）使用FX 构建简单的CPU性能剖析工具

前端API

(beta) PyTorch 中的 Channels Last 内存格式

前向模式自动微分（ Beta 版）

雅可比矩阵、海森矩阵、HVP、VHP 等：组合函数变换

模型集成

per-样本梯度

使用 PyTorch 的 C++ 前端

TorchScript中的动态并行计算

C++ 前端的自动微分

扩展 PyTorch

PyTorch 自定义操作符

Python 自定义运算符

自定义 C++ 和 CUDA 操作符

双反向传播与自定义函数

使用自定义函数将卷积和批量归一化融合在一起

自定义 C++ 和 CUDA 扩展

使用自定义 C++ 操作符扩展 TorchScript

使用自定义 C++ 类扩展 TorchScript

在 C++ 中注册一个调度操作符

在 C++ 中扩展调度器以支持新的后端

通过PrivateUse1简化新后端集成

模型优化

_profiling您的PyTorch模块_

使用 TensorBoard 的 PyTorch 分析器

使用 Ray Tune 进行超参数调优

优化视觉变压器模型以进行部署

参数化教程

剪枝教程

（测试版）LSTM 单词语言模型的动态量化

（测试版）BERT的动态量化

（测试版）计算机视觉中的量化迁移学习教程

（测试版）PyTorch 中的静态量化（带 Eager 模式）

从基础知识出发，掌握 PyTorch 在英特尔 CPU 上的性能

从基础知识出发，掌握 PyTorch 在英特尔 CPU 上的性能（第二部分）

入门 - 使用 nvFuser 加速您的脚本

使用 Ax 进行多目标神经架构搜索

torch.compile 介绍

编译的自动微分：为 torch.compile 捕获更大范围的反向图

Inductor CPU 后端调试与性能分析

（测试版）使用缩放点积注意力（SDPA）实现高性能变压器

知识蒸馏教程

并行和分布式训练

分布式和并行训练教程

PyTorch 分布式概述

PyTorch 分布式数据并行 - 视频教程

单机模型并行的最佳实践

分布式数据并行入门

使用 PyTorch 编写分布式应用程序

开始使用全 shards 数据并行 (FSDP)

使用全数据并行（FSDP）进行高级模型训练

Libuv TCPStore 后端简介

使用张量并行（TP）进行大规模变压器模型训练

分布式管道并行简介

使用 C++ 扩展自定义进程组后端

分布式RPC框架入门

使用分布式远程过程调用框架实现参数服务器

使用异步执行来实现批处理 RPC 处理

结合分布式数据并行和分布式远程过程调用框架

使用 Join 上下文管理器进行输入不均匀的分布式训练

边缘端的 ExecuTorch

导出到 ExecuTorch 教程

在 C++ 中运行 ExecuTorch 模型教程

使用 ExecuTorch 开发者工具进行模型性能分析

构建 ExecuTorch iOS 演示应用

构建一个 ExecuTorch Android 演示应用

将模型降级为委托

推荐系统

TorchRec 入门

探索 TorchRec 分片功能

多模态

TorchMultimodal教程：微调FLAVA

(测试版) 在 FX 中构建卷积/批归一化融合器 作者: Horace He 在本教程中，我们将使用 FX，一个用于 PyTorch 可组合函数变换的工具包，来完成以下任务： 在数据依赖关系中找到卷积/批归一化的模式。 对于在步骤1中找到的模式，将批归一化统计量折叠到卷积权重中。 请注意，此优化仅适用于处于推理模式下的模型（即 mode.eval()）。 我们将构建位于此处的融合器：https://github.com/pytorch/pytorch/blob/orig/release/1.8/torch/fx/experimental/fuser.py 首先，让我们先导入一些必要的库（稍后我们将在代码中使用这些库）。 fromtypingimport Type, Dict, Any, Tuple, Iterable importcopy importtorch.fxasfx importtorch importtorch.nnasnn 在本教程中，我们将创建一个包含卷积和批量归一化层的模型。需要注意的是，这个模型包含了一些复杂的组件——部分卷积/批量归一化模式被隐藏在 Sequential 中，并且其中一个 BatchNorm 被包裹在另一个 Module 中。 classWrappedBatchNorm(nn.Module): def__init__(self): super().__init__() self.mod = nn.BatchNorm2d(1) defforward(self, x): return self.mod(x) classM(nn.Module): def__init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 1, 1) self.bn1 = nn.BatchNorm2d(1) self.conv2 = nn.Conv2d(1, 1, 1) self.nested = nn.Sequential( nn.BatchNorm2d(1), nn.Conv2d(1, 1, 1), ) self.wrapped = WrappedBatchNorm() defforward(self, x): x = self.conv1(x) x = self.bn1(x) x = self.conv2(x) x = self.nested(x) x = self.wrapped(x) return x model = M() model.eval() 卷积与批归一化的融合 在 PyTorch 中尝试自动融合卷积和批归一化的主要挑战之一是 PyTorch 没有提供一种简便的方式来访问计算图。FX 通过符号化跟踪实际调用的操作解决了这个问题，使我们能够通过前向调用、嵌套在 Sequential 模块中的计算或用户自定义模块中的计算来追踪这些操作。 traced_model = torch.fx.symbolic_trace(model) print(traced_model.graph) 这为我们提供了模型的图形表示。请注意，顺序模块中隐藏的模块以及被包装的模块都已内联到图形中。这是默认的抽象级别，但可以通过传递编写器进行配置。更多信息可以在 FX 概述 中找到。 卷积与批归一化的融合 与其他一些融合不同，卷积与批归一化的融合不需要引入新的操作符。相反，由于推理过程中的批归一化只包含逐点的加法和乘法操作，这些操作可以被“烘焙”到前一个卷积的权重中。这使得我们能够完全从模型中移除批归一化！更多详细信息请阅读https://nenadmarkus.com/p/fusing-batchnorm-and-conv/。出于清晰考虑，这里的代码复制自https://github.com/pytorch/pytorch/blob/orig/release/1.8/torch/nn/utils/fusion.py。 deffuse_conv_bn_eval(conv, bn): """ Given a conv Module `A` and an batch_norm module `B`, returns a conv module `C` such that C(x) == B(A(x)) in inference mode. """ assert(not (conv.training or bn.training)), "Fusion only for eval!" fused_conv = copy.deepcopy(conv) fused_conv.weight, fused_conv.bias = \ fuse_conv_bn_weights(fused_conv.weight, fused_conv.bias, bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias) return fused_conv deffuse_conv_bn_weights(conv_w, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b): if conv_b is None: conv_b = torch.zeros_like(bn_rm) if bn_w is None: bn_w = torch.ones_like(bn_rm) if bn_b is None: bn_b = torch.zeros_like(bn_rm) bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps) conv_w = conv_w * (bn_w * bn_var_rsqrt).reshape([-1] + [1] * (len(conv_w.shape) - 1)) conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b return torch.nn.Parameter(conv_w), torch.nn.Parameter(conv_b) FX 融合过程 既然我们已经有了计算图以及融合卷积和批量归一化的方法，剩下的就是遍历 FX 图并应用所需的融合操作。 def_parent_name(target : str) -> Tuple[str, str]: """ Splits a ``qualname`` into parent path and last atom. For example, `foo.bar.baz` -> (`foo.bar`, `baz`) """ *parent, name = target.rsplit('.', 1) return parent[0] if parent else '', name defreplace_node_module(node: fx.Node, modules: Dict[str, Any], new_module: torch.nn.Module): assert(isinstance(node.target, str)) parent_name, name = _parent_name(node.target) setattr(modules[parent_name], name, new_module) deffuse(model: torch.nn.Module) -> torch.nn.Module: model = copy.deepcopy(model) # The first step of most FX passes is to symbolically trace our model to # obtain a `GraphModule`. This is a representation of our original model # that is functionally identical to our original model, except that we now # also have a graph representation of our forward pass. fx_model: fx.GraphModule = fx.symbolic_trace(model) modules = dict(fx_model.named_modules()) # The primary representation for working with FX are the `Graph` and the # `Node`. Each `GraphModule` has a `Graph` associated with it - this # `Graph` is also what generates `GraphModule.code`. # The `Graph` itself is represented as a list of `Node` objects. Thus, to # iterate through all of the operations in our graph, we iterate over each # `Node` in our `Graph`. for node in fx_model.graph.nodes: # The FX IR contains several types of nodes, which generally represent # call sites to modules, functions, or methods. The type of node is # determined by `Node.op`. if node.op != 'call_module': # If our current node isn't calling a Module then we can ignore it. continue # For call sites, `Node.target` represents the module/function/method # that's being called. Here, we check `Node.target` to see if it's a # batch norm module, and then check `Node.args[0].target` to see if the # input `Node` is a convolution. if type(modules[node.target]) is nn.BatchNorm2d and type(modules[node.args[0].target]) is nn.Conv2d: if len(node.args[0].users) > 1: # Output of conv is used by other nodes continue conv = modules[node.args[0].target] bn = modules[node.target] fused_conv = fuse_conv_bn_eval(conv, bn) replace_node_module(node.args[0], modules, fused_conv) # As we've folded the batch nor into the conv, we need to replace all uses # of the batch norm with the conv. node.replace_all_uses_with(node.args[0]) # Now that all uses of the batch norm have been replaced, we can # safely remove the batch norm. fx_model.graph.erase_node(node) fx_model.graph.lint() # After we've modified our graph, we need to recompile our graph in order # to keep the generated code in sync. fx_model.recompile() return fx_model 为了演示目的，我们在此进行了一些简化，例如仅匹配2D卷积。查看 https://github.com/pytorch/pytorch/blob/master/torch/fx/experimental/fuser.py 以获取更实用的方法。 测试我们的融合过程 我们现在可以在初始的玩具模型上运行这个融合过程，并验证我们的结果是否一致。此外，我们可以打印出融合后的模型代码，并确认不再存在批量归一化层。 fused_model = fuse(model) print(fused_model.code) inp = torch.randn(5, 1, 1, 1) torch.testing.assert_allclose(fused_model(inp), model(inp)) 我们的 Fusion 在 ResNet18 上的性能基准测试 我们可以在更大的模型（如 ResNet18）上测试我们的融合优化，看看这一优化能在多大程度上提升推理性能。 importtorchvision.modelsasmodels importtime rn18 = models.resnet18() rn18.eval() inp = torch.randn(10, 3, 224, 224) output = rn18(inp) defbenchmark(model, iters=20): for _ in range(10): model(inp) begin = time.time() for _ in range(iters): model(inp) return str(time.time()-begin) fused_rn18 = fuse(rn18) print("Unfused time: ", benchmark(rn18)) print("Fused time: ", benchmark(fused_rn18)) 正如我们之前所见，FX转换的输出是（“可torchscript化的”）PyTorch代码，我们可以轻松地使用jit.script对输出进行处理，以尝试进一步提升性能。通过这种方式，我们的FX模型转换与TorchScript无缝结合，没有任何问题。 jit_rn18 = torch.jit.script(fused_rn18) print("jit time: ", benchmark(jit_rn18)) ############ # Conclusion # ---------- # As we can see, using FX we can easily write static graph transformations on # PyTorch code. # # Since FX is still in beta, we would be happy to hear any # feedback you have about using it. Please feel free to use the # PyTorch Forums (https://discuss.pytorch.org/) and the issue tracker # (https://github.com/pytorch/pytorch/issues) to provide any feedback # you might have. 下载 Python 源代码: fx_conv_bn_fuser.py 下载 Jupyter 笔记本: fx_conv_bn_fuser.ipynb

本页目录