CUDA流 sanitizer

注意

这是一个原型功能，目前处于早期阶段，主要用于收集反馈和进行测试。其组件可能还会发生变化。

概述

此模块介绍了CUDA Sanitizer，这是一款用于检测不同流中运行的内核之间同步错误的工具。

它会记录张量的访问信息，以判断是否存在同步问题。当在Python程序中启用此功能，并且检测到潜在的数据竞争时，将会输出详细的警告信息并退出程序。

可以通过导入此模块并调用 enable_cuda_sanitizer()，或者导出环境变量 TORCH_CUDA_SANITIZER 来启用。

使用方法

这是一个在PyTorch中的简单同步错误示例：

import torch

a = torch.rand(4, 2, device="cuda")

with torch.cuda.stream(torch.cuda.Stream()):
    torch.mul(a, 5, out=a)

a 张量在默认流中初始化，并且没有使用任何同步方法的情况下，在一个新的流中对其进行修改。这两个内核将在同一个张量上并发执行，这可能会导致第二个内核读取到第一个内核尚未写入的未初始化数据，或者第一个内核可能覆盖了第二个内核的部分结果。当此脚本通过命令行运行时：

TORCH_CUDA_SANITIZER=1 python example_error.py

以下是CSAN打印的输出：

============================
CSAN detected a possible data race on tensor with data pointer 139719969079296
Access by stream 94646435460352 during kernel:
aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
writing to argument(s) self, out, and to the output
With stack trace:
  File "example_error.py", line 6, in <module>
    torch.mul(a, 5, out=a)
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

Previous access by stream 0 during kernel:
aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor
writing to the output
With stack trace:
  File "example_error.py", line 3, in <module>
    a = torch.rand(10000, device="cuda")
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

Tensor was allocated with stack trace:
  File "example_error.py", line 3, in <module>
    a = torch.rand(10000, device="cuda")
  ...
  File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation
    traceback.StackSummary.extract(

这提供了关于错误源头的详细洞察：

张量从以下具有不同 ID 的流中被错误地访问：0（默认流）和 94646435460352（新流）
该张量是通过调用代码 a = torch.rand(10000, device="cuda") 分配的。
错误的访问是由操作员引起的
- a = torch.rand(10000, device="cuda") 在流 0 中
- torch.mul(a, 5, out=a) 在流 94646435460352 上
错误消息还显示了被调用操作符的模式，并注明了哪些参数与受影响的张量相对应。
- 从示例中可以看到，张量 a 对应于参数 self、out 以及操作符 torch.mul 的输出值 output。

参见

支持的torch操作符及其模式可以在此处查看。

可以通过强制新流等待默认流来解决这个问题：

with torch.cuda.stream(torch.cuda.Stream()):
    torch.cuda.current_stream().wait_stream(torch.cuda.default_stream())
    torch.mul(a, 5, out=a)

再次运行脚本时，没有出现错误。

API参考

torch.cuda._sanitizer.enable_cuda_sanitizer()[源代码]

启用CUDA sanitizers。

sanitizer 将开始分析由 torch 函数调用的低级 CUDA 调用中的同步错误。找到的所有数据竞争将与怀疑的原因的堆栈跟踪一起打印到标准错误输出中。为了获得最佳效果，应尽早启用 sanitizer。