PyTorch 入门指南
学习 PyTorch
图像和视频
音频
后端
强化学习
在生产环境中部署 PyTorch 模型
Profiling PyTorch
代码变换与FX
前端API
扩展 PyTorch
模型优化
并行和分布式训练
边缘端的 ExecuTorch
推荐系统
多模态

简介 || 张量 || 自动求导 || 构建模型 || TensorBoard 支持 || 训练模型 || 模型理解

使用 PyTorch 进行训练

请跟随以下视频或在 YouTube 上观看。

简介

在之前的视频中,我们已经讨论并演示了:

  • 使用 torch.nn 模块中的神经网络层和函数构建模型

  • 自动化梯度计算的机制,这是基于梯度的模型训练的核心

  • 使用 TensorBoard 可视化训练进度和其他活动

在本视频中,我们将为您的工具箱添加一些新工具:

  • 我们将熟悉数据集和数据加载器的抽象概念,以及它们如何简化在训练循环中向模型提供数据的过程

  • 我们将讨论特定的损失函数以及何时使用它们

  • 我们将了解 PyTorch 优化器,它们实现了根据损失函数的结果调整模型权重的算法

最后,我们将把这些内容整合在一起,并看到一个完整的 PyTorch 训练循环的实际运作。

数据集和数据加载器

DatasetDataLoader 类封装了从存储中提取数据并以批次形式将其提供给训练循环的过程。

Dataset 负责访问和处理单个数据实例。

DataLoaderDataset 中提取数据实例(可以自动提取或使用您定义的采样器),将它们收集成批次,并返回给训练循环使用。DataLoader 适用于各种类型的数据集,无论它们包含何种数据类型。

在本教程中,我们将使用 TorchVision 提供的 Fashion-MNIST 数据集。我们使用 torchvision.transforms.Normalize() 来对图像块内容进行零中心化和归一化处理,并下载训练和验证数据集。

importtorch
importtorchvision
importtorchvision.transformsastransforms

# PyTorch TensorBoard support
fromtorch.utils.tensorboardimport SummaryWriter
fromdatetimeimport datetime


transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))])

# Create datasets for training & validation, download if necessary
training_set = torchvision.datasets.FashionMNIST('./data', train=True, transform=transform, download=True)
validation_set = torchvision.datasets.FashionMNIST('./data', train=False, transform=transform, download=True)

# Create data loaders for our datasets; shuffle for training, not for validation
training_loader = torch.utils.data.DataLoader(training_set, batch_size=4, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=4, shuffle=False)

# Class labels
classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
        'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot')

# Report split sizes
print('Training set has {} instances'.format(len(training_set)))
print('Validation set has {} instances'.format(len(validation_set)))
  0%|          | 0.00/26.4M [00:00<?, ?B/s]
  0%|          | 65.5k/26.4M [00:00<01:12, 362kB/s]
  1%|          | 229k/26.4M [00:00<00:38, 680kB/s]
  3%|3         | 918k/26.4M [00:00<00:12, 2.10MB/s]
 12%|#1        | 3.11M/26.4M [00:00<00:03, 7.35MB/s]
 25%|##4       | 6.55M/26.4M [00:00<00:01, 12.3MB/s]
 44%|####3     | 11.5M/26.4M [00:00<00:00, 21.6MB/s]
 58%|#####7    | 15.3M/26.4M [00:01<00:00, 24.0MB/s]
 70%|#######   | 18.5M/26.4M [00:01<00:00, 23.9MB/s]
 90%|######### | 23.8M/26.4M [00:01<00:00, 31.1MB/s]
100%|##########| 26.4M/26.4M [00:01<00:00, 19.3MB/s]

  0%|          | 0.00/29.5k [00:00<?, ?B/s]
100%|##########| 29.5k/29.5k [00:00<00:00, 324kB/s]

  0%|          | 0.00/4.42M [00:00<?, ?B/s]
  1%|1         | 65.5k/4.42M [00:00<00:12, 361kB/s]
  5%|5         | 229k/4.42M [00:00<00:06, 680kB/s]
 21%|##        | 918k/4.42M [00:00<00:01, 2.63MB/s]
 44%|####3     | 1.93M/4.42M [00:00<00:00, 4.07MB/s]
100%|##########| 4.42M/4.42M [00:00<00:00, 6.07MB/s]

  0%|          | 0.00/5.15k [00:00<?, ?B/s]
100%|##########| 5.15k/5.15k [00:00<00:00, 33.5MB/s]
Training set has 60000 instances
Validation set has 10000 instances

一如既往,我们将数据可视化以进行初步检查:

importmatplotlib.pyplotasplt
importnumpyasnp

# Helper function for inline image display
defmatplotlib_imshow(img, one_channel=False):
    if one_channel:
        img = img.mean(dim=0)
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    if one_channel:
        plt.imshow(npimg, cmap="Greys")
    else:
        plt.imshow(np.transpose(npimg, (1, 2, 0)))

dataiter = iter(training_loader)
images, labels = next(dataiter)

# Create a grid from the images and show them
img_grid = torchvision.utils.make_grid(images)
matplotlib_imshow(img_grid, one_channel=True)
print('  '.join(classes[labels[j]] for j in range(4)))

trainingyt

Sandal  Sneaker  Coat  Sneaker

模型

在本例中,我们将使用的模型是 LeNet-5 的一个变体——如果您看过本系列之前的视频,应该会对它很熟悉。

importtorch.nnasnn
importtorch.nn.functionalasF

# PyTorch models inherit from torch.nn.Module
classGarmentClassifier(nn.Module):
    def__init__(self):
        super(GarmentClassifier, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    defforward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


model = GarmentClassifier()

损失函数

在本示例中,我们将使用交叉熵损失函数。为了演示目的,我们将创建一批虚拟的输出和标签值,将它们传递给损失函数,并检查结果。

loss_fn = torch.nn.CrossEntropyLoss()

# NB: Loss functions expect data in batches, so we're creating batches of 4
# Represents the model's confidence in each of the 10 classes for a given input
dummy_outputs = torch.rand(4, 10)
# Represents the correct class among the 10 being tested
dummy_labels = torch.tensor([1, 5, 3, 7])

print(dummy_outputs)
print(dummy_labels)

loss = loss_fn(dummy_outputs, dummy_labels)
print('Total loss for this batch: {}'.format(loss.item()))
tensor([[0.7026, 0.1489, 0.0065, 0.6841, 0.4166, 0.3980, 0.9849, 0.6701, 0.4601,
         0.8599],
        [0.7461, 0.3920, 0.9978, 0.0354, 0.9843, 0.0312, 0.5989, 0.2888, 0.8170,
         0.4150],
        [0.8408, 0.5368, 0.0059, 0.8931, 0.3942, 0.7349, 0.5500, 0.0074, 0.0554,
         0.1537],
        [0.7282, 0.8755, 0.3649, 0.4566, 0.8796, 0.2390, 0.9865, 0.7549, 0.9105,
         0.5427]])
tensor([1, 5, 3, 7])
Total loss for this batch: 2.428950071334839

优化器

在本例中,我们将使用带有动量的简单随机梯度下降

尝试一些优化方案的变体可能会有所启发:

  • 学习率决定了优化器进行每一步的大小。不同的学习率对您的训练结果有何影响,特别是在准确性和收敛时间方面?

  • 动量促使优化器在多个步骤中朝着梯度最强的方向移动。改变这个值会对您的结果产生什么影响?

  • 尝试一些不同的优化算法,例如平均SGD、Adagrad或Adam。您的结果会有何不同?

# Optimizers specified in the torch.optim package
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

训练循环

下面是一个执行一个训练周期的函数。它从 DataLoader 中枚举数据,并在每次循环中执行以下操作:

  • 从 DataLoader 中获取一批训练数据

  • 将优化器的梯度清零

  • 执行推理 - 即从模型获取输入批次的预测结果

  • 计算预测结果与数据集标签之间的损失

  • 计算学习权重的反向梯度

  • 指示优化器执行一次学习步骤 - 即根据我们选择的优化算法,基于当前批次观察到的梯度调整模型的学习权重

  • 每 1000 个批次报告一次损失

  • 最后,报告最后 1000 个批次的平均每批次损失,以便与验证运行进行比较

deftrain_one_epoch(epoch_index, tb_writer):
    running_loss = 0.
    last_loss = 0.

    # Here, we use enumerate(training_loader) instead of
    # iter(training_loader) so that we can track the batch
    # index and do some intra-epoch reporting
    for i, data in enumerate(training_loader):
        # Every data instance is an input + label pair
        inputs, labels = data

        # Zero your gradients for every batch!
        optimizer.zero_grad()

        # Make predictions for this batch
        outputs = model(inputs)

        # Compute the loss and its gradients
        loss = loss_fn(outputs, labels)
        loss.backward()

        # Adjust learning weights
        optimizer.step()

        # Gather data and report
        running_loss += loss.item()
        if i % 1000 == 999:
            last_loss = running_loss / 1000 # loss per batch
            print('  batch {} loss: {}'.format(i + 1, last_loss))
            tb_x = epoch_index * len(training_loader) + i + 1
            tb_writer.add_scalar('Loss/train', last_loss, tb_x)
            running_loss = 0.

    return last_loss

每个周期的活动

在每个 epoch 中,我们需要执行以下几项操作:

  • 通过在一组未用于训练的数据上检查我们的相对损失来进行验证,并报告这一结果

  • 保存模型的副本

在这里,我们将在 TensorBoard 中进行报告。这需要我们在命令行中启动 TensorBoard,并在另一个浏览器标签页中打开它。

# Initializing in a separate cell so we can easily add more epochs to the same run
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter('runs/fashion_trainer_{}'.format(timestamp))
epoch_number = 0

EPOCHS = 5

best_vloss = 1_000_000.

for epoch in range(EPOCHS):
    print('EPOCH {}:'.format(epoch_number + 1))

    # Make sure gradient tracking is on, and do a pass over the data
    model.train(True)
    avg_loss = train_one_epoch(epoch_number, writer)


    running_vloss = 0.0
    # Set the model to evaluation mode, disabling dropout and using population
    # statistics for batch normalization.
    model.eval()

    # Disable gradient computation and reduce memory consumption.
    with torch.no_grad():
        for i, vdata in enumerate(validation_loader):
            vinputs, vlabels = vdata
            voutputs = model(vinputs)
            vloss = loss_fn(voutputs, vlabels)
            running_vloss += vloss

    avg_vloss = running_vloss / (i + 1)
    print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))

    # Log the running loss averaged per batch
    # for both training and validation
    writer.add_scalars('Training vs. Validation Loss',
                    { 'Training' : avg_loss, 'Validation' : avg_vloss },
                    epoch_number + 1)
    writer.flush()

    # Track best performance, and save the model's state
    if avg_vloss < best_vloss:
        best_vloss = avg_vloss
        model_path = 'model_{}_{}'.format(timestamp, epoch_number)
        torch.save(model.state_dict(), model_path)

    epoch_number += 1
EPOCH 1:
  batch 1000 loss: 1.6334228584356607
  batch 2000 loss: 0.8325267538074403
  batch 3000 loss: 0.7359380583595484
  batch 4000 loss: 0.6198329215242994
  batch 5000 loss: 0.6000315657821484
  batch 6000 loss: 0.555109024874866
  batch 7000 loss: 0.5260250487388112
  batch 8000 loss: 0.4973462742221891
  batch 9000 loss: 0.4781935699362075
  batch 10000 loss: 0.47880298678041433
  batch 11000 loss: 0.45598648857555235
  batch 12000 loss: 0.4327470133750467
  batch 13000 loss: 0.41800182418141046
  batch 14000 loss: 0.4115047634313814
  batch 15000 loss: 0.4211296908891527
LOSS train 0.4211296908891527 valid 0.414460688829422
EPOCH 2:
  batch 1000 loss: 0.3879808729066281
  batch 2000 loss: 0.35912817339546743
  batch 3000 loss: 0.38074520684120944
  batch 4000 loss: 0.3614532373107213
  batch 5000 loss: 0.36850082185724753
  batch 6000 loss: 0.3703581801643886
  batch 7000 loss: 0.38547042514081115
  batch 8000 loss: 0.37846584360170527
  batch 9000 loss: 0.3341486988377292
  batch 10000 loss: 0.3433013284947956
  batch 11000 loss: 0.35607743899174965
  batch 12000 loss: 0.3499939931873523
  batch 13000 loss: 0.33874178926000603
  batch 14000 loss: 0.35130289171106416
  batch 15000 loss: 0.3394507191307202
LOSS train 0.3394507191307202 valid 0.3581162691116333
EPOCH 3:
  batch 1000 loss: 0.3319729989422485
  batch 2000 loss: 0.29558994361863006
  batch 3000 loss: 0.3107374766407593
  batch 4000 loss: 0.3298987646112146
  batch 5000 loss: 0.30858693152241906
  batch 6000 loss: 0.33916381367447684
  batch 7000 loss: 0.3105102765217889
  batch 8000 loss: 0.3011080777524912
  batch 9000 loss: 0.3142058177240979
  batch 10000 loss: 0.31458891937109
  batch 11000 loss: 0.31527258940579483
  batch 12000 loss: 0.31501667268342864
  batch 13000 loss: 0.3011875962628328
  batch 14000 loss: 0.30012811454350596
  batch 15000 loss: 0.31833117976446373
LOSS train 0.31833117976446373 valid 0.3307691514492035
EPOCH 4:
  batch 1000 loss: 0.2786161053752294
  batch 2000 loss: 0.27965198021690596
  batch 3000 loss: 0.28595415444140965
  batch 4000 loss: 0.292985666413857
  batch 5000 loss: 0.3069892351147719
  batch 6000 loss: 0.29902250939945224
  batch 7000 loss: 0.2863366014406201
  batch 8000 loss: 0.2655441066541243
  batch 9000 loss: 0.3045048695363293
  batch 10000 loss: 0.27626545656517554
  batch 11000 loss: 0.2808379335970967
  batch 12000 loss: 0.29241049340573955
  batch 13000 loss: 0.28030834131941446
  batch 14000 loss: 0.2983542350126445
  batch 15000 loss: 0.3009556676162611
LOSS train 0.3009556676162611 valid 0.41686952114105225
EPOCH 5:
  batch 1000 loss: 0.2614263167564495
  batch 2000 loss: 0.2587047562422049
  batch 3000 loss: 0.2642477260621345
  batch 4000 loss: 0.2825975873669813
  batch 5000 loss: 0.26987933717705165
  batch 6000 loss: 0.2759250026817317
  batch 7000 loss: 0.26055969463163275
  batch 8000 loss: 0.29164007206353565
  batch 9000 loss: 0.2893096504513578
  batch 10000 loss: 0.2486029507305684
  batch 11000 loss: 0.2732803234480907
  batch 12000 loss: 0.27927226484491985
  batch 13000 loss: 0.2686819267635074
  batch 14000 loss: 0.24746483912148323
  batch 15000 loss: 0.27903492261294194
LOSS train 0.27903492261294194 valid 0.31206756830215454

加载保存的模型版本:

saved_model = GarmentClassifier()
saved_model.load_state_dict(torch.load(PATH))

加载模型后,它就可以用于您所需的任何用途——继续训练、推理或分析。

请注意,如果您的模型具有影响模型结构的构造函数参数,您需要提供这些参数,并将模型配置为与保存时相同的状态。

其他资源

本页目录