量化

警告

量化功能尚处测试阶段，可能随时发生变化。

量化介绍

量化是指在低于浮点精度的位宽下执行计算和存储张量的技术。一个量化模型会在某些或所有操作中使用较低精度（而非全精度浮点数）的张量进行运算。这使得模型表示更加紧凑，并且可以在许多硬件平台上利用高性能向量化操作。与典型的FP32模型相比，PyTorch支持INT8量化，可以将模型大小减少4倍，并将内存带宽需求降低4倍。对于INT8计算，硬件支持通常比FP32计算快2到4倍。量化主要是一种加速推理的技术，仅支持量化的操作的前向传递。

PyTorch 支持多种方法来量化深度学习模型。通常情况下，模型先以 FP32 进行训练，然后将其转换为 INT8。此外，PyTorch 还支持量化的感知训练，这种方法在前向和后向传递中使用假量化模块模拟量化误差。需要注意的是，整个计算过程都是用浮点数完成的。在量化的感知训练结束后，PyTorch 提供了将训练好的模型转换为较低精度的功能。

在较低级别，PyTorch 提供了表示量化张量并对其进行操作的方法。可以使用这种方法直接构建执行全部或部分计算的低精度模型。此外，还提供了更高层次的 API，这些 API 集成了将 FP32 模型转换为低精度模型（同时尽量减少准确性损失）的典型工作流程。

量化API总结

PyTorch 提供了三种不同的量化模式：eager 模式量化、FX 图形模式量化（维护中）和 PyTorch 2 导出量化。

急切模式量化是一项测试版功能。用户需要手动完成融合并指定量化的具体位置，此外，它只支持模块而不支持函数。

FX 图模式量化是 PyTorch 中的一种自动化量化的流程，目前它是一个原型功能，并且由于有了 PyTorch 2 导出量化，现在处于维护模式。与急切模式量化相比，FX 图模式量化通过添加对函数的支持并自动执行量化过程进行了改进，尽管用户可能需要重构模型以使其兼容（使用 torch.fx 进行符号跟踪）。需要注意的是，对于任意模型，FX 图模式量化可能无法正常工作，因为这些模型可能不具备符号可追踪性。我们将把该功能整合到像 torchvision 这样的领域库中，并且用户将能够对与支持的领域库中的模型类似的模型进行量化。对于任意模型，我们会提供一些通用指南，但要实际实现这一过程，用户需要熟悉 torch.fx，特别是如何使一个模型具备符号可追踪性。

PyTorch 2 导出量化是新的全图模式量化工作流，在 PyTorch 2.1 中首次作为原型功能发布。随着 PyTorch 2 的推出，我们转向了更好的完整程序捕获解决方案（torch.export），因为它可以捕捉到比 FX Graph Mode Quantization 使用的 torch.fx.symbolic_trace 更多的模型（在 14K 模型中，torch.export 可以捕捉到 88.8%，而 torch.fx.symbolic_trace 则为 72.7%）。尽管 torch.export 在某些 Python 构造方面仍然存在限制，并且需要用户参与来支持导出模型中的动态性，但总体而言，它比之前的程序捕获解决方案有所改进。PyTorch 2 导出量化是针对由 torch.export 捕获的模型构建的，同时考虑了建模用户的灵活性和生产力以及后端开发者的生产力。主要特性包括：(1) 可编程 API，用于配置如何对模型进行量化，并且可以扩展到更多用例；(2) 简化的用户体验，因为建模用户和后端开发者只需与一个对象（Quantizer）交互即可表达关于如何量化模型以及后端支持的意图；(3) 选择性的参考量化模型表示形式，该表示形式可以用整数操作来表示量化计算，并且更接近实际硬件中的量化计算。

建议新用户先尝试 PyTorch 2 的导出量化功能，如果不满意，再试试即时模式量化。

下表对比了急切模式量化、FX 图形模式量化和 PyTorch 2 导出量化的不同之处：

	即时模式量化	FX 图形模式量化	PyTorch 2 量化导出
发布状态	beta	原型维护	原型
操作融合	手册	自动	自动
Quant/DeQuant 置位	手册	自动	自动
量化模块	支持	支持	支持
量化功能和 Torch 操作	手册	自动	支持
支持定制	有限支持	完全支持	完全支持
量化模式支持	训练后量化：静态、动态和仅权重量化感知训练（静态）	训练后量化：静态、动态和仅权重量化感知训练（静态）	由后端特定量化器定义
输入输出模型类型	`torch.nn.Module`	`torch.nn.Module`（可能需要进行一些修改，以便模型与FX 图形模式量化兼容）	`torch.fx.GraphModule`（由`torch.export`捕获）

支持三种量化的类型：

动态量化（权重以浮点数形式读取和存储，但在计算过程中进行量化）
静态量化（权重和激活都进行量化，训练后需要校准）
静态量化感知训练（权重和激活函数都进行量化，并在训练过程中模拟量化效果）

请参阅我们的PyTorch 量化介绍博客文章，了解这些量化类型之间权衡的详细信息。

操作符的覆盖率在动态量化和静态量化之间有所差异，具体详情请参见下表。

	静态量化	动态量化
nn.Linear 一维/二维/三维卷积层 (nn.Conv1d/2d/3d)	Y Y	Y N
nn.LSTM模型 GRU	Y 通过自定义模块 N	Y Y
RNNCell (nn.RNNCell) nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.EmbeddingBag	Y（激活值以fp32格式表示）	Y
nn.Embedding	Y	Y
MultiheadAttention	Y（通过自定义模块）	不予支持
激活函数	获得广泛支持	不变，计算仍使用fp32

即时量化模式

关于量化流程的通用介绍，包括不同类型量化的内容，请参阅通用量化流程。

训练后的动态量化

这是最简单的量化形式，其中权重提前被量化，而激活值在推理过程中动态量化。这种做法适用于模型执行时间主要受限于从内存加载权重而不是计算矩阵乘法的情况，例如小批量处理的LSTM和Transformer类型的模型。

图表：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

PTDQ API示例:

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

如需了解更多关于动态量化的信息，请参阅我们的动态量化教程。

训练后的静态量化

训练后静态量化（PTQ 静态）会对模型的权重和激活进行量化，并尽可能将激活融合到前一层中。它需要通过具有代表性的数据集来校准，以确定最佳的量化参数。这种技术通常在内存带宽和计算资源都需节省的情况下使用，例如卷积神经网络（CNN）。

在进行训练后的静态量化之前，可能需要先对模型进行修改。请参考 eager模式下的静态量化模型准备。

图表：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# statically quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

PTSQ API示例:

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

要了解更多关于静态量化的信息，请参阅静态量化教程。

静态量化-aware训练

量化感知训练（QAT）在训练过程中模拟量化效果，从而实现比其他量化方法更高的精度。我们可以对静态、动态或仅权重的量化进行 QAT。在训练期间，所有计算都在浮点数中完成，通过 fake_quant 模块进行钳位和舍入操作来模拟 INT8 量化的效果。模型转换后，权重和激活会被量化，并且在可能的情况下将激活融合到前一层。QAT 常常与 CNN 结合使用，并且相比静态量化可以获得更高的精度。

在进行训练后的静态量化之前，可能需要先对模型进行修改。请参考 eager模式下的静态量化模型准备。

图表：

# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# quantized model
# weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

QAT API 示例：

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

了解更多关于量化感知训练的内容，请参阅QAT教程。

急切模式下静态量化的模型准备

在进行 eager 模式量化之前，目前需要对模型定义做一些修改。这是因为在当前情况下，量化是针对每个模块单独进行的。具体来说，对于所有的量化技术，用户都需要：

将任何需要输出重新量化（因而带有额外参数）的操作从功能形式改为模块形式（例如，使用torch.nn.ReLU而不是torch.nn.functional.relu）。
通过在子模块上设置.qconfig属性或指定qconfig_mapping来确定模型中哪些部分需要量化。例如，将model.conv1.qconfig = None设置为None表示model.conv层不会被量化；而将model.linear1.qconfig = custom_qconfig设置为custom_qconfig则意味着model.linear1的量化配置将会使用自定义的custom_qconfig而不是全局的qconfig。

对于采用量化激活的静态量化技术，用户还需进行如下操作：

指定激活函数的量化和去量化的位置。这通过使用QuantStub 和 DeQuantStub 模块来完成。
使用FloatFunctional 将需要特殊处理以适应量化的张量操作封装到模块中。例如，add 和 cat 操作需要特殊的处理来确定输出的量化参数。
通过将操作/模块组合成单个模块来提高准确性和性能，这被称为融合模块。这是使用fuse_modules() API 完成的，该 API 接受要融合的模块列表。目前支持以下几种融合方式：[Conv, Relu]、[Conv, BatchNorm]、[Conv, BatchNorm, Relu]、[Linear, Relu]

(原型维护模式) FX 图形模式量化

模型训练后的量化包括多种类型，如仅权重量化、动态量化和静态量化。这些配置是通过qconfig_mapping（prepare_fx函数的一个参数）来完成的。

FXPTQ API示例:

import torch
from torch.ao.quantization import (
  get_default_qconfig_mapping,
  get_default_qat_qconfig_mapping,
  QConfigMapping,
)
import torch.ao.quantization.quantize_fx as quantize_fx
import copy

model_fp = UserModel()

#
# post training dynamic/weight_only quantization
#

# we need to deepcopy if we still want to keep model_fp unchanged after quantization since quantization apis change the input model
model_to_quantize = copy.deepcopy(model_fp)
model_to_quantize.eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_dynamic_qconfig)
# a tuple of one or more example inputs are needed to trace the model
example_inputs = (input_fp32)
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# no calibration needed when we only have dynamic/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# post training static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qconfig_mapping("qnnpack")
model_to_quantize.eval()
# prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
# calibrate (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# quantization aware training for static quantization
#

model_to_quantize = copy.deepcopy(model_fp)
qconfig_mapping = get_default_qat_qconfig_mapping("qnnpack")
model_to_quantize.train()
# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_mapping, example_inputs)
# training loop (not shown)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

#
# fusion
#
model_to_quantize = copy.deepcopy(model_fp)
model_fused = quantize_fx.fuse_fx(model_to_quantize)

请参阅以下教程，了解有关FX 图形模式量化的更多信息。

PyTorch 2 导出量化的原型

API示例:

import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantizer import (
    XNNPACKQuantizer,
    get_symmetric_quantization_config,
)

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
       return self.linear(x)

# initialize a floating point model
float_model = M().eval()

# define calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2. quantization
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
# or prepare_qat_pt2e for Quantization Aware Training
m = prepare_pt2e(m, quantizer)

# run calibration
# calibrate(m, sample_inference_data)
m = convert_pt2e(m)

# Step 3. lowering
# lower to target backend

请参考这些教程，开始学习如何使用PyTorch 2进行导出量化。

用户建模：

后端开发人员（请查阅所有建模用户的文档）：

如何为 PyTorch 2 编写导出量化的量化器

量化栈

量化是将浮点模型转换为量化模型的过程。因此，从高层次来看，量化堆栈可以分为两部分：1) 量化模型的构建模块和抽象 2) 将浮点模型转换为量化模型的流程的构建模块和抽象

量化模型

量化张量

为了在 PyTorch 中实现量化，我们需要使用张量来表示量化后的数据。一个量化张量不仅可以存储用 int8、uint8 或 int32 表示的量化数据，还可以包含 scale 和 zero_point 等量化参数。此外，量化张量还支持许多有用的运算，使量化算术更加简单，并且可以将数据以量化格式进行序列化。

PyTorch 支持张量级和通道级的对称与非对称量化。张量级量化是指张量内的所有值使用相同的量化参数以相同的方式进行量化。而通道级量化则是指每个维度（通常是张量的通道维度）中的值使用不同的量化参数进行量化。这样可以减少将张量转换为量化值时的误差，因为异常值只会对其所在的通道产生影响，而不是整个张量。

通过将浮点张量进行转换来执行映射

${BASE_RAW_UPLOAD_URL}/pytorch-doc-2.5/161907b1eaa52e48f126f8041595a277.png$

需要注意的是，在量化之后，我们确保浮点数中的零表示没有误差，这样可以避免像填充这类操作产生额外的量化误差。

以下是一些关于量化张量的关键属性：

QScheme (torch.qscheme): 一种枚举，用于指定量化张量的方法
- torch.per_tensor_affine
- torch.per_tensor_symmetric
- torch.per_channel_affine
- torch.per_channel_symmetric
dtype (torch.dtype): 量化张量的数据类型
- torch.quint8
- torch.qint8
- torch.qint32
- torch.float16
量化参数（取决于QScheme）：选定量化方法的参数
- torch.per_tensor_affine 的量化参数如下：
  - scale (浮动数值)
  - zero_point (整型)
- torch.per_channel_affine 的量化参数包括
  - per_channel_scales（浮点数列表）
  - per_channel_zero_points (一个整数列表)
  - 轴（整型）

量化与去量化

模型的输入和输出是浮点张量，但量化模型中的激活函数会被量化，因此我们需要操作符来进行浮点张量和量化张量之间的转换。

量化（从浮点数到量化表示）
- torch.quantize_per_tensor(x, scale, zero_point, dtype)
- torch.quantize_per_channel(x, scales, zero_points, axis, dtype)
- torch.quantize_per_tensor_dynamic(x, dtype, reduce_range)
- 转换为 torch.float16
从量化状态转换为浮点数
- quantized_tensor.dequantize() - 对 torch.float16 张量调用 dequantize 方法会将其转换回 torch.float 类型。
- torch.dequantize(x)

量化运算符/模块

量化运算符是将量化张量作为输入并输出量化张量的运算符。
量化模块是执行量化操作的PyTorch模块。它们通常应用于带有权重的操作，例如线性运算和卷积。

量化引擎

当执行量化模型时，qengine（torch.backends.quantized.engine）指定了要使用的后端。确保qengine与量化模型在量化激活和权重的值范围上兼容非常重要。

量化流程

观察者和FakeQuantize

观察者是 PyTorch 模块，用于实现以下功能：
- 收集张量的统计信息，例如最大值和最小值等，这些信息是通过观察器传递的。
- 并基于收集到的张量统计信息计算量化参数
FakeQuantize 是 PyTorch 模块，用于以下目的：
- 模拟量化（进行量化和反量化操作）对于网络中的张量
- 它可以根据观察器收集的统计数据来计算量化参数，或者也可以学习这些参数。

QConfig

QConfig 是一个包含 Observer 或 FakeQuantize 模块类的 namedtuple，可以通过 qscheme、dtype 等参数进行配置，用于定义如何观察操作符。
- 操作符/模块的量化设置
  - 不同的Observer/FakeQuantize类型
  - 数据类型（dtype）
  - qscheme
  - quant_min/quant_max：可以用于模拟低精度张量
- 目前支持激活和权重的配置
- 我们根据给定的操作或模块的 qconfig 配置插入输入/权重/输出观察器。

量化通用流程

一般而言，流程如下：

准备
- 根据用户的 qconfig 设置插入 Observer/FakeQuantize 模块
根据训练后量化或量化意识训练来校准/训练
- 允许Observers收集统计信息，并使FakeQuantize模块学习量化参数
转换
- 将经过校准或训练的模型转换为量化模型

量化有多种模式，可以分为两类：

关于何时何地应用量化流程，我们有以下内容：

训练后的量化（在训练完成后进行，量化参数根据样本校准数据计算得出）
量化感知训练（在训练过程中模拟量化，使量化参数能够利用训练数据与模型一同学习）

关于如何量化运算符，我们有以下几种方式：

权重-only 量化（仅对权重进行静态量化）
动态量化（权重静态量化，激活动态量化）
静态量化（权重和激活值均被静态量化）

我们可以在同一个量化流程中结合使用不同类型的量化操作。例如，可以实现一种同时包含静态和动态量化操作的训练后量化。

量化支持矩阵

量化模式支持

	量化模式	数据集需求	最适合于	准确性	注释
训练后的量化	动态/权重量化	激活可以动态量化为（fp16，int8），也可以不做量化；权重则静态量化为（fp16，int8，in4）	无内容	LSTM、MLP、Embedding、Transformer	好	易于使用，当性能因权重受限于计算或内存时，接近静态量化
静态量化	激活和权重的静态量化（int8）	校准数据集	CNN	好	提供最佳性能，但可能会影响准确性，适用于仅支持int8计算的硬件
量化感知训练	动态量化	激活和权重进行了假量化	细调数据集	多层感知器（MLP），嵌入（Embedding）	最佳	目前仅提供有限的支持
静态量化	激活和权重进行了假量化	细调数据集	卷积神经网络（CNN）、多层感知器（MLP）、嵌入（Embedding）	最佳	通常在静态量化导致精度下降时使用，以弥补精度差距

请参阅我们的PyTorch 量化介绍博客文章，了解这些量化类型之间权衡的详细信息。

量化流支持

PyTorch 提供了两种量化模式：Eager 模式量化和 FX 图表模式量化。

急切模式量化是一项测试版功能。用户需要手动完成融合并指定量化的具体位置，此外，它只支持模块而不支持函数。

FX 图模式量化是 PyTorch 中的一种自动化量化框架，目前仍处于原型阶段。与急切模式量化相比，FX 图模式量化通过添加对函数的支持并自动执行量化过程进行了改进。然而，用户可能需要重构模型以使其兼容 FX 图模式量化（使用 torch.fx 进行符号跟踪）。需要注意的是，并非所有模型都适用于 FX 图模式量化，因为某些模型无法进行符号跟踪。我们将把该框架集成到像 torchvision 这样的领域库中，用户可以对与这些支持的领域库中的模型类似的模型使用 FX 图模式量化进行量化。对于任意模型，我们会提供一般性指导，但要实际实现量化，用户可能需要熟悉 torch.fx 及其如何使模型具有符号跟踪能力。

新用户建议首先尝试FX图模式量化。如果不起作用，可以参考使用FX图模式量化的指南，或退回到急切模式量化。

下表对比了急切模式量化和FX图模式量化之间的差异：

	即时模式量化	FX 图形模式量化
发布状态	beta	原型
操作融合	手册	自动
Quant/DeQuant 置位	手册	自动
量化模块	支持	支持
量化功能和 Torch 操作	手册	自动
支持定制	有限支持	完全支持
量化模式支持	训练后量化：静态、动态和仅权重量化感知训练（静态）	训练后量化：静态、动态和仅权重量化感知训练（静态）
输入输出模型类型	`torch.nn.Module`	`torch.nn.Module`（可能需要进行一些修改，以便模型与FX 图形模式量化兼容）

后端和硬件支持

硬件	内核库	即时模式量化	FX 图形模式量化	量化模式支持
服务器CPU	fbgemm 和 onednn	支持	全部支持
移动处理器	qnnpack/xnnpack	支持
服务器显卡	TensorRT（早期原型版）	不支持此功能，因为它需要一个图形。	支持	静态量化

目前，PyTorch 支持以下后端来高效地执行量化运算：

需要 x86 CPU 并支持 AVX2 或更高版本（不支持 AVX2 时，某些操作的实现效率较低），并通过 x86 使用 fbgemm 和 onednn 进行优化（详情请参阅RFC）
ARM CPU（通常用于移动和嵌入式设备）通过qnnpack
通过 TensorRT 和 fx2trt (即将开源) 支持 NVidia GPU (早期原型)

注：适用于原生CPU后端

我们提供了两种后端：x86和qnnpack，它们使用相同的原生PyTorch量化操作符。因此，我们需要一个额外的标志来区分这两种后端。x86和qnnpack的具体实现会根据PyTorch构建模式自动选择，但用户也可以通过设置torch.backends.quantization.engine为x86或qnnpack来手动指定。

在准备量化模型时，需要确保 qconfig 和用于量化计算的引擎与执行该模型的后端相匹配。qconfig 控制了量化过程中使用的观察器类型，而 qengine 决定了在打包线性函数和卷积模块权重时使用特定于 x86 还是 qnnpack 的打包函数。例如：

x86 默认设置：

# set the qconfig for PTQ
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.ao.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'x86'

qnnpack 的默认设置:

# set the qconfig for PTQ
qconfig = torch.ao.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.ao.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'

运营商支持

操作符的覆盖率在动态量化和静态量化之间存在差异，具体数据见下表。需要注意的是，对于FX图模式量化，相应的功能也得到了支持。

	静态量化	动态量化
nn.Linear 一维/二维/三维卷积层 (nn.Conv1d/2d/3d)	Y Y	Y N
nn.LSTM模型 GRU	N N	Y Y
RNNCell (nn.RNNCell) nn.GRUCell nn.LSTMCell	N N N	Y Y Y
nn.EmbeddingBag	Y（激活值以fp32格式表示）	Y
nn.Embedding	Y	Y
MultiheadAttention	不予支持	不予支持
激活函数	获得广泛支持	不变，计算仍使用fp32

注意：这将很快根据原生后端配置字典生成的信息进行更新。

量化API参考

量化API参考提供了关于量化API的文档，包括量化流程、量化张量操作和支持的量化模块和函数。

量化后台配置

量化后端配置提供了如何为不同后端配置量化工作流的相关文档。

量化准确性调试

The 量化精度调试提供了关于如何调试量化精度的文档。

量化定制

虽然提供了根据观察到的张量数据来选择缩放因子和偏置的观察者的默认实现，但开发人员也可以提供自己的量化函数。量化可以被选择性地应用到模型的不同部分，并且可以针对不同部分进行不同的配置。

我们也为 conv1d()、conv2d()、conv3d() 和 linear() 提供了通道级别量化的支持。

量化工作流通过在模型的模块层次结构中添加或替换子模块来实现，例如将观察器作为.observer子模块添加或将nn.Conv2d转换为nn.quantized.Conv2d。这意味着在整个过程中，模型仍然是一个基于nn.Module的常规实例，并且可以与PyTorch API的其余部分协同工作。

量化自定义模块的API

无论是Eager模式还是FX图模式的量化API，都提供了钩子，让用户可以自定义指定模块的量化，并定义观察和量化的逻辑。

模型中存在的源 fp32 模块的 Python 类型
观察到的模块（由用户提供的）的 Python 类型。此模块需要定义一个 from_float 函数，用于描述如何从原始的 fp32 模块创建观察到的模块。
量化模块的Python类型（用户提供）。此模块需包含一个from_observed函数，用于定义如何从观察模块生成量化模块。
一个描述上述内容（1）、（2）和（3）的配置，并将其传递给量化APIs。

然后框架将执行以下操作：

在prepare模块交换期间，它会使用(2)中类的from_float函数，将(1)中指定类型的每个模块转换为(2)中指定的类型。
在convert模块交换期间，它会将每个类型如(2)中指定的模块转换为(3)中指定的类型，并使用(3)中类的from_observed函数来进行转换。

目前要求< cite > ObservedCustomModule < /cite >有一个单一的张量输出，并且框架（而不是用户）会在该输出上添加一个观察器。这个观察器将作为自定义模块实例的一个属性，存储在< cite > activation_post_process < /cite >键下。将来可能会放宽这些限制。

自定义API示例：

import torch
import torch.ao.nn.quantized as nnq
from torch.ao.quantization import QConfigMapping
import torch.ao.quantization.quantize_fx

# original fp32 module to replace
class CustomModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(3, 3)

    def forward(self, x):
        return self.linear(x)

# custom observed module, provided by user
class ObservedCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_float(cls, float_module):
        assert hasattr(float_module, 'qconfig')
        observed = cls(float_module.linear)
        observed.qconfig = float_module.qconfig
        return observed

# custom quantized module, provided by user
class StaticQuantCustomModule(torch.nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear

    def forward(self, x):
        return self.linear(x)

    @classmethod
    def from_observed(cls, observed_module):
        assert hasattr(observed_module, 'qconfig')
        assert hasattr(observed_module, 'activation_post_process')
        observed_module.linear.activation_post_process = \
            observed_module.activation_post_process
        quantized = cls(nnq.Linear.from_float(observed_module.linear))
        return quantized

#
# example API call (Eager mode quantization)
#

m = torch.nn.Sequential(CustomModule()).eval()
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        CustomModule: ObservedCustomModule
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        ObservedCustomModule: StaticQuantCustomModule
    }
}
m.qconfig = torch.ao.quantization.default_qconfig
mp = torch.ao.quantization.prepare(
    m, prepare_custom_config_dict=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.convert(
    mp, convert_custom_config_dict=convert_custom_config_dict)
#
# example API call (FX graph mode quantization)
#
m = torch.nn.Sequential(CustomModule()).eval()
qconfig_mapping = QConfigMapping().set_global(torch.ao.quantization.default_qconfig)
prepare_custom_config_dict = {
    "float_to_observed_custom_module_class": {
        "static": {
            CustomModule: ObservedCustomModule,
        }
    }
}
convert_custom_config_dict = {
    "observed_to_quantized_custom_module_class": {
        "static": {
            ObservedCustomModule: StaticQuantCustomModule,
        }
    }
}
mp = torch.ao.quantization.quantize_fx.prepare_fx(
    m, qconfig_mapping, torch.randn(3,3), prepare_custom_config=prepare_custom_config_dict)
# calibration (not shown)
mq = torch.ao.quantization.quantize_fx.convert_fx(
    mp, convert_custom_config=convert_custom_config_dict)

最佳实践

1. 如果你使用的是x86后端，需要将位数从8位改为7位。确保调整quant_min和quant_max的范围：如果dtype是torch.quint8，请设置自定义的quant_min为0，quant_max为127(即255/2)；如果dtype是torch.qint8，请设置自定义的quant_min为-64(即-128/2)，quant_max为63(即127/2)。如果你调用torch.ao.quantization.get_default_qconfig(backend)或torch.ao.quantization.get_default_qat_qconfig(backend)函数来获取x86或qnnpack后端的默认qconfig，这些值已经正确设置。

2. 如果选择了onednn后端，在默认的qconfig映射torch.ao.quantization.get_default_qconfig_mapping('onednn')和默认的qconfigtorch.ao.quantization.get_default_qconfig('onednn')中，激活将使用8位。建议在支持向量神经网络指令（VNNI）的CPU上使用此设置。否则，在不支持VNNI的CPU上，为了提高精度，应将激活观察者的reduce_range参数设置为True。

常见问题

如何在GPU上进行量化推理?

我们目前还没有官方的 GPU 支持，但这方面的开发非常活跃。你可以在这里了解更多详情
在哪里可以获取我的量化模型的ONNX支持?

如果你在使用torch.onnx下的API导出模型时遇到错误，可以在PyTorch仓库中提交一个问题。请在问题标题前加上[ONNX]，并将问题标记为module: onnx。

如果你在使用 ONNX Runtime 时遇到问题，请前往 GitHub - microsoft/onnxruntime 创建一个 issue。
如何在LSTM中使用量化技术？

LSTM 通过我们的自定义模块 API 在急切模式和 FX 图形模式量化中都得到了支持。示例代码如下：急切模式：pytorch/test_quantized_op.py 中的 TestQuantizedOps.test_custom_module_lstm，FX 图形模式：pytorch/test_quantize_fx.py 中的 TestQuantizeFx.test_static_lstm

常见错误

将非量化张量传递给量化内核

如果你遇到类似的错误：

RuntimeError: Could not run 'quantized::some_operator' with arguments from the 'CPU' backend...

这意味着你试图将未量化的张量传递给量化内核。常见的解决方法是使用 torch.ao.quantization.QuantStub 来量化张量，在 Eager 模式下需要手动完成此操作。以下是一个端到端的例子：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv(x)
        return x

将量化张量传递给未量化内核

如果你遇到类似的错误：

RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend.

这意味着你试图将一个量化张量传递给非量化的内核。常见的解决方法是使用 torch.ao.quantization.DeQuantStub 对张量进行去量化处理。在 Eager 模式量化中，需要手动完成这一操作。以下是一个端到端的例子：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.conv1 = torch.nn.Conv2d(1, 1, 1)
        # this module will not be quantized (see `qconfig = None` logic below)
        self.conv2 = torch.nn.Conv2d(1, 1, 1)
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # during the convert step, this will be replaced with a
        # `quantize_per_tensor` call
        x = self.quant(x)
        x = self.conv1(x)
        # during the convert step, this will be replaced with a
        # `dequantize` call
        x = self.dequant(x)
        x = self.conv2(x)
        return x

m = M()
m.qconfig = some_qconfig
# turn off quantization for conv2
m.conv2.qconfig = None

保存和加载量化模型

当你使用 torch.load 加载一个量化模型时，如果遇到类似的错误：

AttributeError: 'LinearPackedParams' object has no attribute '_modules'

这是因为直接使用 torch.save 和 torch.load 来保存和加载量化模型是不受支持的。要保存或加载量化模型，可以使用以下方法：

保存和加载量化模型的状态字典

示例：

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 5)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        x = self.relu(x)
        return x

m = M().eval()
prepare_orig = prepare_fx(m, {'' : default_qconfig})
prepare_orig(torch.rand(5, 5))
quantized_orig = convert_fx(prepare_orig)

# Save/load using state_dict
b = io.BytesIO()
torch.save(quantized_orig.state_dict(), b)

m2 = M().eval()
prepared = prepare_fx(m2, {'' : default_qconfig})
quantized = convert_fx(prepared)
b.seek(0)
quantized.load_state_dict(torch.load(b))

使用 torch.jit.save 和 torch.jit.load 保存和加载脚本量化模型

示例：

# Note: using the same model M from previous example
m = M().eval()
prepare_orig = prepare_fx(m, {'' : default_qconfig})
prepare_orig(torch.rand(5, 5))
quantized_orig = convert_fx(prepare_orig)

# save/load using scripted model
scripted = torch.jit.script(quantized_orig)
b = io.BytesIO()
torch.jit.save(scripted, b)
b.seek(0)
scripted_quantized = torch.jit.load(b)

在使用FX图模式量化时遇到的符号跟踪错误

符号追踪是(Prototype - maintenance模式) FX Graph Mode 量化的要求。如果你传递给torch.ao.quantization.prepare_fx或torch.ao.quantization.prepare_qat_fx的PyTorch模型不具备符号追踪能力，可能会遇到如下错误：

torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow

请参阅符号跟踪的限制，并使用 FX 图模式量化用户指南来解决此问题。