（测试版）BERT 动态量化

为了充分利用本教程，我们建议使用此 Colab 版本。这将使您能够尝试下面提供的信息。

作者: Jianyu Huang

审阅者: Raghuraman Krishnamoorthi

编辑者: Jessica Lin

简介

在本教程中，我们将对一个 BERT 模型应用动态量化，并紧密遵循 HuggingFace Transformers 示例中的 BERT 模型。通过这个逐步的过程，我们希望能够展示如何将像 BERT 这样的知名最先进模型转换为动态量化模型。

BERT（Bidirectional Embedding Representations from Transformers）是一种新的预训练语言表示方法，它在许多流行的自然语言处理（NLP）任务中取得了最先进的准确率，例如问答、文本分类等。原始论文可以在这里找到：here。
PyTorch 中的动态量化支持将浮点模型转换为量化模型，其中权重使用静态的 int8 或 float16 数据类型，而激活值则使用动态量化。当权重被量化为 int8 时，激活值会动态地（每批）量化为 int8。在 PyTorch 中，我们有 torch.quantization.quantize_dynamic API，它可以将指定的模块替换为动态仅权重量化版本，并输出量化后的模型。
我们在通用语言理解评估基准 (GLUE) 中的 Microsoft Research Paraphrase Corpus (MRPC) 任务上展示了准确率和推理性能的结果。MRPC（Dolan 和 Brockett，2005）是一个从在线新闻源自动提取的句子对语料库，包含人工标注，用于判断句子对中的句子是否在语义上等价。由于类别不平衡（68% 正例，32% 负例），我们遵循常见的做法并报告 F1 分数。MRPC 是语言对分类中常见的 NLP 任务，如下所示。

../_images/bert.png

1. 设置

1.1 安装 PyTorch 和 HuggingFace Transformers

要开始本教程，请首先按照 PyTorch 和 HuggingFace Github 仓库中的安装说明进行操作。此外，我们还需要安装 scikit-learn 包，因为我们将复用其内置的 F1 分数计算辅助函数。

pipinstallsklearn
pipinstalltransformers==4.29.2

由于我们将使用 PyTorch 的测试版功能，因此建议安装最新版本的 torch 和 torchvision。您可以在这里找到最新的本地安装说明。例如，在 Mac 上安装：

yesy|pipuninstalltorchtorchvision
yesy|pipinstall--pretorch-fhttps://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

1.2 导入必要的模块

在这一步中，我们为教程导入必要的 Python 模块。

importlogging
importnumpyasnp
importos
importrandom
importsys
importtime
importtorch

fromargparseimport Namespace
fromtorch.utils.dataimport (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
fromtqdmimport tqdm
fromtransformersimport (BertConfig, BertForSequenceClassification, BertTokenizer,)
fromtransformersimport glue_compute_metrics as compute_metrics
fromtransformersimport glue_output_modes as output_modes
fromtransformersimport glue_processors as processors
fromtransformersimport glue_convert_examples_to_features as convert_examples_to_features

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.WARN)

logging.getLogger("transformers.modeling_utils").setLevel(
   logging.WARN)  # Reduce logging

print(torch.__version__)

我们设置了线程数量，以比较 FP32 和 INT8 在单线程下的性能。在本教程的最后，用户可以通过构建带有正确并行后端的 PyTorch 来设置其他线程数量。

torch.set_num_threads(1)
print(torch.__config__.parallel_info())

1.3 了解辅助函数

辅助函数内置于 transformers 库中。我们主要使用以下两个辅助函数：一个用于将文本示例转换为特征向量；另一个用于衡量预测结果的 F1 分数。

glue_convert_examples_to_features 函数将文本转换为输入特征：

对输入序列进行分词；
在开头插入 [CLS]；
在第一句和第二句之间以及结尾处插入 [SEP]；
生成 token 类型 ID，以指示 token 属于第一个序列还是第二个序列。

glue_compute_metrics 函数包含了计算 F1 分数的指标，F1 分数可以理解为精确率和召回率的加权平均值，其最佳值为 1，最差值为 0。精确率和召回率对 F1 分数的贡献是相等的。

F1 分数的计算公式为：

\[F1 = 2 * (\text{precision} * \text{recall}) / (\text{precision} + \text{recall}) \]

1.4 下载数据集

在运行 MRPC 任务之前，我们通过运行此脚本下载 GLUE 数据，并将其解压到 glue_data 目录中。

pythondownload_glue_data.py--data_dir='glue_data'--tasks='MRPC'

2. 微调 BERT 模型

BERT 的核心思想是预训练语言表示，然后在广泛的任务上对深度双向表示进行微调，使用最少的任务相关参数，并取得最先进的结果。在本教程中，我们将重点介绍如何使用预训练的 BERT 模型进行微调，以在 MRPC 任务中对语义等价的句子对进行分类。

要对预训练的 BERT 模型（HuggingFace transformers 中的 bert-base-uncased 模型）进行 MRPC 任务的微调，您可以按照 examples 中的命令操作：

export GLUE_DIR=./glue_data
export TASK_NAME=MRPC
export OUT_DIR=./$TASK_NAME/
python ./run_glue.py \
    *-model_type bert \
    *-model_name_or_path bert-base-uncased \
    *-task_name $TASK_NAME \
    *-do_train \
    *-do_eval \
    *-do_lower_case \
    *-data_dir $GLUE_DIR/$TASK_NAME \
    *-max_seq_length 128 \
    *-per_gpu_eval_batch_size=8   \
    *-per_gpu_train_batch_size=8   \
    *-learning_rate 2e-5 \
    *-num_train_epochs 3.0 \
    *-save_steps 100000 \
    *-output_dir $OUT_DIR

我们为MRPC任务提供了微调后的BERT模型，您可以在此获取。为了节省时间，您可以将模型文件（约400 MB）直接下载到本地文件夹$OUT_DIR中。

2.1 设置全局配置

在这里，我们设置了在动态量化前后评估微调过的 BERT 模型的全局配置。

configs = Namespace()

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False


# Set random seed for reproducibility.
defset_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

2.2 加载微调后的 BERT 模型

我们从 configs.output_dir 加载了分词器和微调后的 BERT 序列分类模型 (FP32)。

tokenizer = BertTokenizer.from_pretrained(
    configs.output_dir, do_lower_case=configs.do_lower_case)

model = BertForSequenceClassification.from_pretrained(configs.output_dir)
model.to(configs.device)

2.3 定义 tokenize 和评估函数

我们重用了来自 HuggingFace 的分词和评估函数。

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

defevaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1],
                          'labels':         batch[3]}
                if args.model_type != 'distilbert':
                    inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs['labels'].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


defload_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,
                                                pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
                                                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
                                                pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

3. 应用动态量化

我们对模型调用 torch.quantization.quantize_dynamic 来对 HuggingFace BERT 模型应用动态量化。具体来说，

我们指定希望模型中的 torch.nn.Linear 模块被量化；
我们指定希望权重被转换为量化的 int8 值。

quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
print(quantized_model)

3.1 检查模型大小

首先让我们检查模型的大小。我们可以观察到模型大小显著减少（FP32 总大小：438 MB；INT8 总大小：181 MB）：

defprint_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

本教程中使用的 BERT 模型（bert-base-uncased）的词汇表大小 V 为 30522。嵌入大小为 768，因此词嵌入表的总大小为 ~ 4（字节/FP32）* 30522 * 768 = 90 MB。借助量化技术，非嵌入表部分的模型大小从 350 MB（FP32 模型）减少到了 90 MB（INT8 模型）。

3.2 评估推理准确性和时间

接下来，我们将比较原始 FP32 模型与动态量化后的 INT8 模型在推理时间和评估准确性上的差异。

deftime_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

# Evaluate the INT8 BERT model after the dynamic quantization
time_model_evaluation(quantized_model, configs, tokenizer)

在一台 MacBook Pro 上本地运行这些操作时，未进行量化的情况下，推理（针对 MRPC 数据集中的所有 408 个示例）大约需要 160 秒，而进行量化后仅需约 90 秒。我们将量化后的 BERT 模型在 MacBook Pro 上进行推理的结果总结如下：

| Prec | F1 score | Model Size | 1 thread | 4 threads |
| FP32 |  0.9019  |   438 MB   | 160 sec  | 85 sec    |
| INT8 |  0.902   |   181 MB   |  90 sec  | 46 sec    |

在对MRPC任务上微调后的BERT模型应用训练后动态量化后，我们的F1分数准确率降低了0.6%。作为对比，在最近的一篇论文（表1）中，通过应用训练后动态量化获得了0.8788的分数，而通过应用量化感知训练则获得了0.8956的分数。主要区别在于我们在PyTorch中支持非对称量化，而那篇论文仅支持对称量化。

请注意，在本教程的单线程比较中，我们将线程数设置为1。我们还支持这些量化INT8算子的内部操作并行化。用户现在可以通过torch.set_num_threads(N)（N为内部操作并行化线程数）来设置多线程。启用内部操作并行化支持的一个初步要求是使用正确的后端（如OpenMP、Native或TBB）构建PyTorch。您可以使用torch.__config__.parallel_info()来检查并行化设置。在同一台MacBook Pro上，使用带有Native后端的PyTorch进行并行化处理，我们可以在约46秒内完成MRPC数据集的评估。

3.3 序列化量化模型

我们可以通过跟踪模型后使用 torch.jit.save 来序列化并保存量化模型以备将来使用。

defids_tensor(shape, vocab_size):
    #  Creates a random int32 tensor of the shape within the vocab size
    return torch.randint(0, vocab_size, shape=shape, dtype=torch.int, device='cpu')

input_ids = ids_tensor([8, 128], 2)
token_type_ids = ids_tensor([8, 128], 2)
attention_mask = ids_tensor([8, 128], vocab_size=2)
dummy_input = (input_ids, attention_mask, token_type_ids)
traced_model = torch.jit.trace(quantized_model, dummy_input)
torch.jit.save(traced_model, "bert_traced_eager_quant.pt")

要加载量化模型，我们可以使用 torch.jit.load

loaded_quantized_model = torch.jit.load("bert_traced_eager_quant.pt")

结论

在本教程中，我们展示了如何将 BERT 这样的知名先进 NLP 模型转换为动态量化模型。动态量化可以在仅对准确性产生有限影响的情况下，减小模型的体积。

感谢阅读！一如既往，我们欢迎任何反馈，如果您有任何问题，请在此处创建问题 here。

参考文献

[1] J.Devlin, M. Chang, K. Lee 和 K. Toutanova, BERT: 用于语言理解的深度双向Transformer预训练 (2018).

[2] HuggingFace Transformers.

[3] O. Zafrir, G. Boudoukh, P. Izsak 和 M. Wasserblat (2019). Q8BERT: 量化的8位BERT.