PyTorch 中 `non_blocking` 和 `pin_memory()` 的正确使用指南

简介

在众多 PyTorch 应用中，将数据从 CPU 传输到 GPU 是基础操作。用户理解在设备间移动数据的最有效工具和选项至关重要。本教程探讨了 PyTorch 中两种关键的设备间数据传输方法：pin_memory() 和带有 non_blocking=True 选项的 to()。

您将学到什么

优化从 CPU 到 GPU 的张量传输可以通过异步传输和内存固定来实现。然而，有一些重要的注意事项：

使用 tensor.pin_memory().to(device, non_blocking=True) 可能比直接使用 tensor.to(device) 慢两倍。
通常情况下，tensor.to(device, non_blocking=True) 是提高传输速度的有效选择。
虽然 cpu_tensor.to("cuda", non_blocking=True).mean() 能够正确执行，但尝试 cuda_tensor.to("cpu", non_blocking=True).mean() 会导致输出错误。

前言

本教程中报告的性能受到构建教程所用系统的限制。尽管结论适用于不同系统，但具体观察结果可能会因可用硬件而略有不同，尤其是在较旧的硬件上。本教程的主要目标是提供一个理解 CPU 到 GPU 数据传输的理论框架。然而，任何设计决策都应根据具体情况进行调整，并通过基准吞吐量测量以及任务的特定需求来指导。

importtorch

assert torch.cuda.is_available(), "A cuda device is required to run this tutorial"

本教程要求安装 tensordict。如果您的环境中尚未安装 tensordict，请在一个单独的单元格中运行以下命令进行安装：

# Install tensordict with the following command
!pip3installtensordict

我们首先围绕这些概念概述理论，然后转向这些功能的具体测试示例。

背景

内存管理基础

在 PyTorch 中创建 CPU 张量时，该张量的内容需要放置在内存中。这里所说的内存是一个相当复杂的概念，值得仔细探讨。我们区分由内存管理单元（Memory Management Unit）处理的两种内存类型：RAM（为简化起见）和磁盘上的交换空间（可能是也可能不是硬盘）。磁盘和 RAM（物理内存）中的可用空间共同构成了虚拟内存，虚拟内存是对可用总资源的抽象。简而言之，虚拟内存使得可用空间比单独的 RAM 更大，并创造了一个主内存比实际更大的假象。

在正常情况下，常规的CPU张量是可分页的，这意味着它被划分为称为页的块，这些页可以存在于虚拟内存中的任何位置（无论是在RAM中还是在磁盘上）。如前所述，这样做的好处是内存看起来比实际的主内存更大。

通常，当一个程序访问一个不在RAM中的页时，会发生“页错误”，操作系统（OS）随后将该页重新加载到RAM中（称为“换入”或“页入”）。反过来，OS可能需要将另一个页换出（或“页出”）以便为新页腾出空间。

与可分页内存不同，固定内存（或页锁定内存或不可分页内存）是一种不能被换出到磁盘的内存类型。它允许更快且更可预测的访问时间，但缺点是其可用内存比可分页内存（即主内存）更为有限。

CUDA 与（非）可分页内存

为了理解 CUDA 如何将张量从 CPU 复制到 CUDA，让我们考虑以下两种场景：

如果内存是页锁定的，设备可以直接访问主内存中的数据。内存地址是明确且固定的，因此读取这些数据的函数可以显著加速。
如果内存是可分页的，所有页面在发送到 GPU 之前都必须先加载到主内存中。此操作可能会耗费时间，并且比在页锁定张量上执行时更加不可预测。

更准确地说，当 CUDA 将可分页数据从 CPU 发送到 GPU 时，它必须首先创建该数据的页锁定副本，然后再进行传输。

异步与同步操作：使用 `non_blocking=True`（CUDA `cudaMemcpyAsync`）

当从主机（例如 CPU）向设备（例如 GPU）执行数据复制时，CUDA 工具包提供了以同步或异步方式进行这些操作的选项。

在实际操作中，当调用 to() 时，PyTorch 总是会调用 cudaMemcpyAsync。如果 non_blocking=False（默认值），则在每次 cudaMemcpyAsync 之后都会调用 cudaStreamSynchronize，使得对 to() 的调用在主线程中变为阻塞操作。如果 non_blocking=True，则不会触发同步操作，主机上的主线程不会被阻塞。因此，从主机的角度来看，多个张量可以同时发送到设备，因为线程不需要等待一个传输完成后再启动另一个传输。

通常，传输在设备端是阻塞的（即使它在主机端不是阻塞的）：当另一个操作正在执行时，设备上的复制无法进行。然而，在一些高级场景中，复制和内核执行可以在 GPU 端同时进行。如下例所示，要实现这一点，必须满足三个条件：

设备必须至少有一个可用的 DMA（直接内存访问）引擎。现代 GPU 架构（如 Volterra、Tesla 或 H100 设备）通常拥有多个 DMA 引擎。

传输必须在单独的非默认 CUDA 流上进行。在 PyTorch 中，可以使用 Stream 来管理 CUDA 流。

源数据必须位于固定内存中。

我们通过运行以下脚本的性能分析来展示这一点。

importcontextlib

fromtorch.cudaimport Stream


s = Stream()

torch.manual_seed(42)
t1_cpu_pinned = torch.randn(1024**2 * 5, pin_memory=True)
t2_cpu_paged = torch.randn(1024**2 * 5, pin_memory=False)
t3_cuda = torch.randn(1024**2 * 5, device="cuda:0")

assert torch.cuda.is_available()
device = torch.device("cuda", torch.cuda.current_device())


# The function we want to profile
definner(pinned: bool, streamed: bool):
    with torch.cuda.stream(s) if streamed else contextlib.nullcontext():
        if pinned:
            t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)
        else:
            t2_cuda = t2_cpu_paged.to(device, non_blocking=True)
        t_star_cuda_h2d_event = s.record_event()
    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
    #  done in the other stream
    t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda
    t3_cuda_h2d_event = torch.cuda.current_stream().record_event()
    t_star_cuda_h2d_event.synchronize()
    t3_cuda_h2d_event.synchronize()


# Our profiler: profiles the `inner` function and stores the results in a .json file
defbenchmark_with_profiler(
    pinned,
    streamed,
) -> None:
    torch._C._profiler._set_cuda_sync_enabled_val(True)
    wait, warmup, active = 1, 1, 2
    num_steps = wait + warmup + active
    rank = 0
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=torch.profiler.schedule(
            wait=wait, warmup=warmup, active=active, repeat=1, skip_first=1
        ),
    ) as prof:
        for step_idx in range(1, num_steps + 1):
            inner(streamed=streamed, pinned=pinned)
            if rank is None or rank == 0:
                prof.step()
    prof.export_chrome_trace(f"trace_streamed{int(streamed)}_pinned{int(pinned)}.json")

在 Chrome 中加载这些性能分析跟踪文件 (chrome://tracing) 会显示以下结果：首先，让我们看看如果在主流中将可分页张量发送到 GPU 后，再执行 t3_cuda 上的算术操作会发生什么：

benchmark_with_profiler(streamed=False, pinned=False)

使用固定的张量不会显著改变跟踪结果，两个操作仍然会依次执行：

benchmark_with_profiler(streamed=False, pinned=True)

将可分页的张量发送到 GPU 的单独流上也是一个阻塞操作：

benchmark_with_profiler(streamed=True, pinned=False)

仅在独立流上固定的张量副本与在主流上执行的另一个 CUDA 内核重叠。

benchmark_with_profiler(streamed=True, pinned=True)

PyTorch 视角

`pin_memory()`

PyTorch 提供了通过 pin_memory() 方法和构造函数参数创建并将张量发送到页锁定内存的可能性。在已初始化 CUDA 的机器上，CPU 张量可以通过 pin_memory() 方法转换为页锁定内存。重要的是，pin_memory 在主线程上是阻塞的：它会等待张量被复制到页锁定内存后再执行下一个操作。新的张量可以直接通过 zeros()、ones() 等构造函数创建在页锁定内存中。

让我们检查固定内存和将张量发送到 CUDA 的速度：

importtorch
importgc
fromtorch.utils.benchmarkimport Timer
importmatplotlib.pyplotasplt


deftimer(cmd):
    median = (
        Timer(cmd, globals=globals())
        .adaptive_autorange(min_run_time=1.0, max_run_time=20.0)
        .median
        * 1000
    )
    print(f"{cmd}: {median: 4.4f} ms")
    return median


# A tensor in pageable memory
pageable_tensor = torch.randn(1_000_000)

# A tensor in page-locked (pinned) memory
pinned_tensor = torch.randn(1_000_000, pin_memory=True)

# Runtimes:
pageable_to_device = timer("pageable_tensor.to('cuda:0')")
pinned_to_device = timer("pinned_tensor.to('cuda:0')")
pin_mem = timer("pageable_tensor.pin_memory()")
pin_mem_to_device = timer("pageable_tensor.pin_memory().to('cuda:0')")

# Ratios:
r1 = pinned_to_device / pageable_to_device
r2 = pin_mem_to_device / pageable_to_device

# Create a figure with the results
fig, ax = plt.subplots()

xlabels = [0, 1, 2]
bar_labels = [
    "pageable_tensor.to(device) (1x)",
    f"pinned_tensor.to(device) ({r1:4.2f}x)",
    f"pageable_tensor.pin_memory().to(device) ({r2:4.2f}x)"
    f"\npin_memory()={100*pin_mem/pin_mem_to_device:.2f}% of runtime.",
]
values = [pageable_to_device, pinned_to_device, pin_mem_to_device]
colors = ["tab:blue", "tab:red", "tab:orange"]
ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime (pin-memory)")
ax.set_xticks([])
ax.legend()

plt.show()

# Clear tensors
del pageable_tensor, pinned_tensor
_ = gc.collect()

Device casting runtime (pin-memory)

pageable_tensor.to('cuda:0'):  0.4800 ms
pinned_tensor.to('cuda:0'):  0.3727 ms
pageable_tensor.pin_memory():  0.3967 ms
pageable_tensor.pin_memory().to('cuda:0'):  0.7629 ms

我们可以观察到，将固定内存中的张量转换为 GPU 张量的速度确实比可换页张量快得多，因为在底层，可换页张量必须先复制到固定内存中，然后才能发送到 GPU。

然而，与一些常见的看法相反，在将可换页张量转换为 GPU 张量之前调用 pin_memory() 并不会带来显著的加速，相反，这个调用通常比直接执行传输操作要慢。这是有道理的，因为我们实际上是要求 Python 执行一个 CUDA 无论如何都会在将数据从主机复制到设备之前执行的操作。

pin_memory 的 PyTorch 实现依赖于通过 cudaHostAlloc 在固定内存中创建一个全新的存储空间，在极少数情况下，这可能比像 cudaMemcpy 那样以块的形式传输数据更快。同样，这一观察结果可能会因可用硬件、传输的张量大小或可用 RAM 量的不同而有所变化。

`non_blocking=True`

如前所述，许多 PyTorch 操作都可以通过 non_blocking 参数选择相对于主机异步执行。

在这里，为了准确衡量使用 non_blocking 的优势，我们将设计一个稍微复杂一些的实验，因为我们希望评估在调用和不调用 non_blocking 的情况下，将多个张量发送到 GPU 的速度。

# A simple loop that copies all tensors to cuda
defcopy_to_device(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.to("cuda:0"))
    return result


# A loop that copies all tensors to cuda asynchronously
defcopy_to_device_nonblocking(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.to("cuda:0", non_blocking=True))
    # We need to synchronize
    torch.cuda.synchronize()
    return result


# Create a list of tensors
tensors = [torch.randn(1000) for _ in range(1000)]
to_device = timer("copy_to_device(*tensors)")
to_device_nonblocking = timer("copy_to_device_nonblocking(*tensors)")

# Ratio
r1 = to_device_nonblocking / to_device

# Plot the results
fig, ax = plt.subplots()

xlabels = [0, 1]
bar_labels = [f"to(device) (1x)", f"to(device, non_blocking=True) ({r1:4.2f}x)"]
colors = ["tab:blue", "tab:red"]
values = [to_device, to_device_nonblocking]

ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime (non-blocking)")
ax.set_xticks([])
ax.legend()

plt.show()

Device casting runtime (non-blocking)

copy_to_device(*tensors):  28.4692 ms
copy_to_device_nonblocking(*tensors):  22.1842 ms

为了更好地理解这里发生了什么，让我们对这两个函数进行分析：

fromtorch.profilerimport profile, ProfilerActivity


defprofile_mem(cmd):
    with profile(activities=[ProfilerActivity.CPU]) as prof:
        exec(cmd)
    print(cmd)
    print(prof.key_averages().table(row_limit=10))

我们先来看一下常规 to(device) 的调用栈：

print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))

copy_to_device(*tensors)
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 aten::to         3.62%       1.247ms       100.00%      34.421ms      34.421us          1000
           aten::_to_copy        13.34%       4.592ms        96.38%      33.174ms      33.174us          1000
      aten::empty_strided        24.88%       8.564ms        24.88%       8.564ms       8.564us          1000
              aten::copy_        18.55%       6.385ms        58.16%      20.019ms      20.019us          1000
          cudaMemcpyAsync        20.39%       7.017ms        20.39%       7.017ms       7.017us          1000
    cudaStreamSynchronize        19.22%       6.617ms        19.22%       6.617ms       6.617us          1000
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 34.421ms

Call to `to(device)` None

现在来看 non_blocking 版本：

print(
    "Call to `to(device, non_blocking=True)`",
    profile_mem("copy_to_device_nonblocking(*tensors)"),
)

copy_to_device_nonblocking(*tensors)
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 aten::to         4.90%       1.306ms        99.88%      26.611ms      26.611us          1000
           aten::_to_copy        16.24%       4.326ms        94.98%      25.305ms      25.305us          1000
      aten::empty_strided        31.40%       8.365ms        31.40%       8.365ms       8.365us          1000
              aten::copy_        21.62%       5.761ms        47.34%      12.614ms      12.614us          1000
          cudaMemcpyAsync        25.72%       6.853ms        25.72%       6.853ms       6.853us          1000
    cudaDeviceSynchronize         0.12%      33.074us         0.12%      33.074us      33.074us             1
*------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 26.644ms

Call to `to(device, non_blocking=True)` None

毫无疑问，使用 non_blocking=True 时结果会更好，因为所有传输在主机端是同时启动的，并且只进行一次同步。

具体的好处会因张量的数量、大小以及所使用的硬件而有所不同。

有趣的是，阻塞式的 to("cuda") 实际上执行了与 non_blocking=True 相同的异步设备转换操作（cudaMemcpyAsync），只是在每次拷贝后都有一个同步点。

协同效应

既然我们已经明确了将已经在固定内存中的张量数据传输到 GPU 的速度比从可分页内存中传输更快，并且我们知道异步执行这些传输也比同步更快，那么我们可以对这些方法的组合进行基准测试。首先，让我们编写几个新函数，它们将在每个张量上调用 pin_memory 和 to(device)：

defpin_copy_to_device(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.pin_memory().to("cuda:0"))
    return result


defpin_copy_to_device_nonblocking(*tensors):
    result = []
    for tensor in tensors:
        result.append(tensor.pin_memory().to("cuda:0", non_blocking=True))
    # We need to synchronize
    torch.cuda.synchronize()
    return result

使用 pin_memory() 的好处在处理较大的批量和大张量时更为显著：

tensors = [torch.randn(1_000_000) for _ in range(1000)]
page_copy = timer("copy_to_device(*tensors)")
page_copy_nb = timer("copy_to_device_nonblocking(*tensors)")

tensors_pinned = [torch.randn(1_000_000, pin_memory=True) for _ in range(1000)]
pinned_copy = timer("copy_to_device(*tensors_pinned)")
pinned_copy_nb = timer("copy_to_device_nonblocking(*tensors_pinned)")

pin_and_copy = timer("pin_copy_to_device(*tensors)")
pin_and_copy_nb = timer("pin_copy_to_device_nonblocking(*tensors)")

# Plot
strategies = ("pageable copy", "pinned copy", "pin and copy")
blocking = {
    "blocking": [page_copy, pinned_copy, pin_and_copy],
    "non-blocking": [page_copy_nb, pinned_copy_nb, pin_and_copy_nb],
}

x = torch.arange(3)
width = 0.25
multiplier = 0


fig, ax = plt.subplots(layout="constrained")

for attribute, runtimes in blocking.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, runtimes, width, label=attribute)
    ax.bar_label(rects, padding=3, fmt="%.2f")
    multiplier += 1

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel("Runtime (ms)")
ax.set_title("Runtime (pin-mem and non-blocking)")
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(strategies)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.legend(loc="upper left", ncols=3)

plt.show()

del tensors, tensors_pinned
_ = gc.collect()

Runtime (pin-mem and non-blocking)

copy_to_device(*tensors):  634.8358 ms
copy_to_device_nonblocking(*tensors):  561.5676 ms
copy_to_device(*tensors_pinned):  372.5108 ms
copy_to_device_nonblocking(*tensors_pinned):  345.2958 ms
pin_copy_to_device(*tensors):  1014.9210 ms
pin_copy_to_device_nonblocking(*tensors):  695.7035 ms

其他拷贝方向（GPU -> CPU，CPU -> MPS）

到目前为止，我们一直假设从 CPU 到 GPU 的异步拷贝是安全的。这在一般情况下是成立的，因为 CUDA 会自动处理同步，以确保在读取时访问的数据是有效的，只要张量位于可分页内存中。

然而，在其他情况下，我们不能做出同样的假设：当一个张量被放置在固定内存中时，在调用主机到设备的传输后，对原始副本的修改可能会破坏 GPU 上接收的数据。同样，当传输方向相反时，即从 GPU 到 CPU，或者从任何非 CPU 或 GPU 的设备到任何非 CUDA 管理的 GPU 设备（例如 MPS），在没有显式同步的情况下，无法保证在 GPU 上读取的数据是有效的。

在这些情况下，这些传输无法保证在访问数据时拷贝已经完成。因此，主机上的数据可能不完整或错误，实际上变成了无效数据。

让我们首先通过一个固定内存张量来演示这一点：

DELAY = 100000000
try:
    i = -1
    for i in range(100):
        # Create a tensor in pin-memory
        cpu_tensor = torch.ones(1024, 1024, pin_memory=True)
        torch.cuda.synchronize()
        # Send the tensor to CUDA
        cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
        torch.cuda._sleep(DELAY)
        # Corrupt the original tensor
        cpu_tensor.zero_()
        assert (cuda_tensor == 1).all()
    print("No test failed with non_blocking and pinned tensor")
except AssertionError:
    print(f"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests")

1th test failed with non_blocking and pinned tensor. Skipping remaining tests

使用可分页的张量总是有效的：

i = -1
for i in range(100):
    # Create a tensor in pageable memory
    cpu_tensor = torch.ones(1024, 1024)
    torch.cuda.synchronize()
    # Send the tensor to CUDA
    cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
    torch.cuda._sleep(DELAY)
    # Corrupt the original tensor
    cpu_tensor.zero_()
    assert (cuda_tensor == 1).all()
print("No test failed with non_blocking and pageable tensor")

No test failed with non_blocking and pageable tensor

现在让我们来演示一下，如果没有同步操作，从 CUDA 到 CPU 的转换也无法产生可靠输出：

tensor = (
    torch.arange(1, 1_000_000, dtype=torch.double, device="cuda")
    .expand(100, 999999)
    .clone()
)
torch.testing.assert_close(
    tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")
), tensor.mean()
try:
    i = -1
    for i in range(100):
        cpu_tensor = tensor.to("cpu", non_blocking=True)
        torch.testing.assert_close(
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
        )
    print("No test failed with non_blocking")
except AssertionError:
    print(f"{i}th test failed with non_blocking. Skipping remaining tests")
try:
    i = -1
    for i in range(100):
        cpu_tensor = tensor.to("cpu", non_blocking=True)
        torch.cuda.synchronize()
        torch.testing.assert_close(
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
        )
    print("No test failed with synchronize")
except AssertionError:
    print(f"One test failed with synchronize: {i}th assertion!")

0th test failed with non_blocking. Skipping remaining tests
No test failed with synchronize

通常情况下，只有在目标是支持 CUDA 的设备且原始张量位于可分页内存中时，异步复制到设备才无需显式同步。

总结来说，当使用 non_blocking=True 从 CPU 向 GPU 复制数据是安全的，但对于其他方向的复制，虽然仍可以使用 non_blocking=True，但用户必须确保在访问数据之前执行设备同步。

实用建议

根据我们的观察，我们现在可以总结一些早期的建议：

通常，无论原始张量是否在固定内存中，non_blocking=True 都能提供良好的吞吐量。如果张量已经在固定内存中，传输速度可以加快，但从 Python 主线程手动将张量发送到固定内存是一个阻塞操作，因此会大大抵消使用 non_blocking=True 的好处（因为 CUDA 无论如何都会执行固定内存传输）。

现在，人们可能会合理地问 pin_memory() 方法有什么用。在下一节中，我们将进一步探讨如何使用它来进一步加速数据传输。

其他注意事项

PyTorch 众所周知提供了一个 DataLoader 类，其构造函数接受一个 pin_memory 参数。考虑到我们之前关于 pin_memory 的讨论，您可能会好奇，如果内存固定本质上是阻塞的，那么 DataLoader 是如何加速数据传输的。

关键在于 DataLoader 使用了一个单独的线程来处理从可分页内存到固定内存的数据传输，从而避免了主线程中的任何阻塞。

为了说明这一点，我们将使用来自同名库的 TensorDict 原语。当调用 to() 时，默认行为是异步地将张量发送到设备，随后调用一次 torch.device.synchronize()。

此外，TensorDict.to() 包含一个 non_blocking_pin 选项，它会在执行 to(device) 之前启动多个线程来执行 pin_memory()。这种方法可以进一步加速数据传输，如下例所示。

fromtensordictimport TensorDict
importtorch
fromtorch.utils.benchmarkimport Timer
importmatplotlib.pyplotasplt

# Create the dataset
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})

# Runtimes
copy_blocking = timer("td.to('cuda:0', non_blocking=False)")
copy_non_blocking = timer("td.to('cuda:0')")
copy_pin_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=0)")
copy_pin_multithread_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=4)")

# Rations
r1 = copy_non_blocking / copy_blocking
r2 = copy_pin_nb / copy_blocking
r3 = copy_pin_multithread_nb / copy_blocking

# Figure
fig, ax = plt.subplots()

xlabels = [0, 1, 2, 3]
bar_labels = [
    "Blocking copy (1x)",
    f"Non-blocking copy ({r1:4.2f}x)",
    f"Blocking pin, non-blocking copy ({r2:4.2f}x)",
    f"Non-blocking pin, non-blocking copy ({r3:4.2f}x)",
]
values = [copy_blocking, copy_non_blocking, copy_pin_nb, copy_pin_multithread_nb]
colors = ["tab:blue", "tab:red", "tab:orange", "tab:green"]

ax.bar(xlabels, values, label=bar_labels, color=colors)

ax.set_ylabel("Runtime (ms)")
ax.set_title("Device casting runtime")
ax.set_xticks([])
ax.legend()

plt.show()

Device casting runtime

td.to('cuda:0', non_blocking=False):  636.8030 ms
td.to('cuda:0'):  562.6520 ms
td.to('cuda:0', non_blocking_pin=True, num_threads=0):  709.1188 ms
td.to('cuda:0', non_blocking_pin=True, num_threads=4):  357.9704 ms

在这个示例中，我们将许多大张量从 CPU 转移到 GPU。这种情况下，使用多线程的 pin_memory() 是理想的选择，因为它可以显著提升性能。然而，如果张量较小，多线程的开销可能会超过其带来的好处。同样，如果只有少数几个张量，在单独线程中固定张量的优势也会变得有限。

另外需要注意的是，虽然看似可以在固定内存中创建永久缓冲区，将张量从可分页内存转移到 GPU 之前先暂存到固定内存中，但这种策略并不一定能加速计算。将数据复制到固定内存中的固有瓶颈仍然是一个限制因素。

此外，将存储在磁盘上的数据（无论是在共享内存中还是文件中）传输到 GPU 通常需要一个中间步骤，即将数据复制到固定内存（位于 RAM 中）。在这种情况下，对大数据传输使用 non_blocking 可能会显著增加 RAM 的消耗，从而可能导致不利影响。

实际上，并没有一种适用于所有情况的解决方案。使用多线程 pin_memory 结合 non_blocking 传输的效果取决于多种因素，包括特定的系统、操作系统、硬件以及执行任务的性质。以下是在尝试加速 CPU 和 GPU 之间的数据传输或比较不同场景下的吞吐量时需要检查的因素列表：

可用核心数

有多少 CPU 核心可用？系统是否与其他可能竞争资源的用户或进程共享？
核心利用率

其他进程是否大量占用了 CPU 核心？应用程序是否在数据传输的同时执行其他 CPU 密集型任务？
内存利用率

当前使用了多少可分页和锁页内存？是否有足够的空闲内存来分配额外的固定内存而不影响系统性能？请记住，没有任何东西是免费的，例如 pin_memory 会消耗 RAM 并可能影响其他任务。
CUDA 设备能力

GPU 是否支持多个 DMA 引擎以进行并发数据传输？使用的 CUDA 设备有哪些具体的能力和限制？
要发送的张量数量

在典型操作中传输了多少张量？
要发送的张量大小

传输的张量大小是多少？少数大张量或许多小张量可能不会从相同的传输程序中受益。
系统架构

系统的架构如何影响数据传输速度（例如，总线速度、网络延迟）？

此外，在固定内存中分配大量张量或大型张量会占用大量RAM。这会减少其他关键操作（如分页）的可用内存，从而对整个算法的性能产生负面影响。

结论

在本教程中，我们探讨了将张量从主机发送到设备时影响传输速度和内存管理的几个关键因素。我们了解到，使用 non_blocking=True 通常可以加速数据传输，而如果正确实现，pin_memory() 也能提升性能。然而，这些技术需要精心设计和调校才能发挥最佳效果。

请记住，分析代码并密切关注内存消耗对于优化资源使用和实现最佳性能至关重要。

其他资源

如果您在使用 CUDA 设备时遇到内存拷贝问题，或想了解更多关于本教程中讨论的内容，请查阅以下参考资料：

下载 Python 源代码: pinmem_nonblock.py

下载 Jupyter 笔记本: pinmem_nonblock.ipynb

PyTorch 中 non_blocking 和 pin_memory() 的正确使用指南

简介