PyTorch 入门指南
学习 PyTorch
图像和视频
音频
后端
强化学习
在生产环境中部署 PyTorch 模型
Profiling PyTorch
代码变换与FX
前端API
扩展 PyTorch
模型优化
并行和分布式训练
边缘端的 ExecuTorch
推荐系统
多模态

Pendulum: 使用 TorchRL 编写环境和转换

作者: Vincent Moens

创建环境(模拟器或与物理控制系统的接口)是强化学习和控制工程中的一个重要组成部分。

TorchRL 提供了一套工具,可以在多种场景中实现这一目标。本教程演示了如何从零开始使用 PyTorch 和 TorchRL 编写一个钟摆模拟器。它受到了 OpenAI-Gym/Farama-Gymnasium 控制库 中 Pendulum-v1 实现的启发。

Pendulum

单摆

关键学习点:

  • 如何在 TorchRL 中设计环境:

    • 编写规范(输入、观察和奖励);
    • 实现行为:初始化、重置和步骤。
  • 转换环境输入和输出,并编写自定义转换;

  • 如何使用 TensorDictcodebase 中传递任意数据结构。

在此过程中,我们将涉及 TorchRL 的三个关键组件:

为了展示 TorchRL 环境的能力,我们将设计一个无状态环境。有状态环境会记录最近遇到的物理状态,并依赖这些信息来模拟状态到状态的转换,而无状态环境则在每一步都需要提供当前状态以及所采取的操作。TorchRL 支持这两种类型的环境,但无状态环境更加通用,因此能够涵盖 TorchRL 环境 API 的更多功能。

建模无状态环境使用户能够完全控制模拟器的输入和输出:用户可以在任何阶段重置实验或从外部主动修改动态。然而,这种方法假设我们对任务有一定的控制权,但这并非总是如此:解决无法控制当前状态的问题更具挑战性,但应用范围也更广。

无状态环境的另一个优势是它们可以实现批量执行的过渡模拟。如果后端和实现允许,代数操作可以无缝地在标量、向量或张量上执行。本教程将提供此类示例。

本教程的结构如下:

  • 我们首先将熟悉环境属性:其形状(batch_size)、其方法(主要是 step()reset()set_seed())以及其规范。

  • 在编写完我们的模拟器之后,我们将演示如何在训练过程中使用转换。

  • 我们将探索 TorchRL API 带来的新途径,包括:转换输入的可能性、模拟的向量化执行以及通过模拟图进行反向传播的可能性。

  • 最后,我们将训练一个简单的策略来解决我们实现的系统。

fromcollectionsimport defaultdict
fromtypingimport Optional

importnumpyasnp
importtorch
importtqdm
fromtensordictimport TensorDict, TensorDictBase
fromtensordict.nnimport TensorDictModule
fromtorchimport nn

fromtorchrl.dataimport BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
fromtorchrl.envsimport (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
fromtorchrl.envs.transforms.transformsimport _apply_to_composite
fromtorchrl.envs.utilsimport check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

在设计一个新的环境类时,您必须注意以下四点:

  • EnvBase._reset(),用于在(可能是随机的)初始状态下重置模拟器;

  • EnvBase._step(),用于编码状态转换动态;

  • EnvBase._set_seed`(),用于实现种子机制;

  • 环境规格。

让我们首先描述当前的问题:我们希望模拟一个简单的单摆,并能够控制施加在其固定点上的扭矩。我们的目标是将单摆放置在向上的位置(按惯例,角度位置为0),并使其在该位置保持静止。为了设计我们的动态系统,我们需要定义两个方程:在施加动作(扭矩)后的运动方程,以及构成我们目标函数的奖励方程。

对于运动方程,我们将根据以下公式更新角速度:

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

其中 \(\dot{\theta}\) 是角速度,单位为 rad/sec,\(g\) 是重力,\(L\) 是摆长,\(m\) 是质量,\(\theta\) 是角位置,\(u\) 是扭矩。角位置随后根据以下公式更新:

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

我们将奖励定义为

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

当角度接近 0(摆锤处于向上位置)、角速度接近 0(无运动)且扭矩也为 0 时,该值将最大化。

编码动作的效果: _step()

step 方法是首先要考虑的内容,因为它将编码我们感兴趣的模拟过程。在 TorchRL 中,EnvBase 类有一个 EnvBase.step() 方法,它接收一个带有 "action" 条目的 tensordict.TensorDict 实例,该条目指示要执行的操作。

为了便于从该 tensordict 中读取和写入数据,并确保键与库预期的内容一致,模拟部分已被委托给一个私有的抽象方法 _step(),该方法从 tensordict 中读取输入数据,并将输出数据写入一个新的 tensordict 中。

_step() 方法应执行以下操作:

  1. 读取输入键(例如 "action")并根据这些键执行模拟;

  2. 获取观测值、完成状态和奖励;

  3. 将一组观测值连同奖励和完成状态写入新 TensorDict 中的相应条目。

接下来,step() 方法会将输入 tensordict 中的 step() 输出进行合并,以确保输入/输出的一致性。

通常情况下,对于有状态的环境,这看起来会像这样:

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

请注意,根 tensordict 并未发生变化,唯一的修改是出现了一个新的 "next" 条目,其中包含了新的信息。

在 Pendulum 示例中,我们的 _step() 方法将从输入的 tensordict 中读取相关条目,并计算在 "action" 键所编码的力施加后,摆锤的位置和速度。我们计算摆锤的新角度位置 "new_th",作为前一个位置 "th" 加上新速度 "new_thdot" 在时间间隔 dt 内的结果。

由于我们的目标是将摆锤直立并保持静止,因此对于接近目标位置且速度较低的情况,我们的 cost(负奖励)函数值较低。实际上,我们希望抑制那些远离“直立”位置和/或速度远离 0 的情况。

在我们的示例中,EnvBase._step() 被编码为静态方法,因为我们的环境是无状态的。在有状态的环境中,需要 self 参数,因为状态需要从环境中读取。

def_step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        *tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


defangle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

重置模拟器: _reset()

我们需要关注的第二个方法是 _reset() 方法。与 _step() 类似,它应该在输出的 tensordict 中写入观察条目,并可能包含一个 done 状态(如果省略 done 状态,父方法 reset() 会将其填充为 False)。在某些情况下,_reset 方法需要接收调用它的函数传递的命令(例如,在多代理设置中,我们可能希望指示哪些代理需要重置)。这就是为什么 _reset() 方法也期望一个 tensordict 作为输入,尽管它完全可以为空或 None

父类 EnvBase.reset() 会执行一些与 EnvBase.step() 类似的简单检查,例如确保输出的 tensordict 中包含 "done" 状态,并且形状与规格中的预期一致。

对于我们来说,唯一需要考虑的是 EnvBase._reset() 是否包含所有预期的观察值。再次强调,由于我们处理的是一个无状态环境,我们将摆锤的配置传递到名为 "params" 的嵌套 tensordict 中。

在这个示例中,我们没有传递 done 状态,因为这对于 _reset() 并不是强制性的,而且我们的环境是非终止的,因此我们始终期望它为 False

def_reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

环境元数据: env.*_spec

规范定义了环境的输入和输出域。重要的是,规范必须准确地定义在运行时将接收的张量,因为它们通常用于在多进程和分布式设置中传递环境信息。它们还可以用于实例化延迟定义的神经网络和测试脚本,而无需实际查询环境(例如,在现实世界的物理系统中,查询可能代价高昂)。

在我们的环境中,必须编写四种规范:

  • EnvBase.observation_spec: 这将是一个 CompositeSpec 实例,其中每个键对应一个观测值(CompositeSpec 可以视为一组规范的字典)。

  • EnvBase.action_spec: 它可以是任何类型的规范,但必须与输入 tensordict 中的 "action" 条目相对应;

  • EnvBase.reward_spec: 提供有关奖励空间的信息;

  • EnvBase.done_spec: 提供有关完成标志空间的信息。

TorchRL 的规格被组织在两个通用容器中:input_spec 包含步进函数读取信息的规格(分为包含动作的 action_spec 和包含其余所有内容的 state_spec),以及 output_spec,它编码了步进输出的规格(observation_specreward_specdone_spec)。通常,您不应直接与 output_specinput_spec 交互,而只应与其内容交互:observation_specreward_specdone_specaction_specstate_spec。原因是这些规格在 output_specinput_spec 中以非平凡的方式组织,并且不应直接修改它们。

换句话说,observation_spec 和相关属性是对输出和输入规范容器内容的便捷快捷方式。

TorchRL 提供了多种 TensorSpec 子类 来编码环境的输入和输出特性。

规格形状

环境规格的前导维度必须与环境的批量大小相匹配。这样做是为了确保环境的每个组件(包括其变换)都能准确表示预期的输入和输出形状。在有状态设置中,这一点应该被准确编码。

对于非批量锁定的环境,如我们示例中的环境(见下文),这一点无关紧要,因为环境的批量大小很可能是空的。

def_make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = CompositeSpec(
        th=BoundedTensorSpec(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=BoundedTensorSpec(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = BoundedTensorSpec(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))


defmake_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = CompositeSpec(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else UnboundedContinuousTensorSpec(
                dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
            )
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

可复现的实验:种子设置

在初始化实验时,设置环境种子是一个常见操作。EnvBase._set_seed() 的唯一目标是设置所包含模拟器的种子。如果可能的话,此操作不应调用 reset() 或与环境执行进行交互。父方法 EnvBase.set_seed() 包含了一种机制,允许使用不同的伪随机且可复现的种子为多个环境设置种子。

def_set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

整合内容:EnvBase

我们终于可以将各个部分整合起来,设计我们的环境类。由于需要在环境构建期间执行 specs 的初始化,因此我们必须在 PendulumEnv.__init__() 中调用 _make_spec() 方法。

我们添加了一个静态方法 PendulumEnv.gen_params(),它可以确定性地生成一组在执行期间使用的超参数:

defgen_params(g=10.0, batch_size=None) -> TensorDictBase:
"""Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

我们将环境定义为非 batch_locked,通过将同名的 homonymous 属性设置为 False。这意味着我们不会强制要求输入的 tensordict 具有与环境匹配的 batch-size

以下代码将把我们上面编写的部分整合在一起。

classPendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def__init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

测试我们的环境

TorchRL 提供了一个简单的函数 check_env_specs(),用于检查(转换后的)环境的输入/输出结构是否与其规范所要求的结构匹配。让我们来试一下:

env = PendulumEnv()
check_env_specs(env)

我们可以查看我们的配置,以获取环境签名的可视化表示:

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
state_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

我们可以执行几条命令来检查输出结构是否符合预期。

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

我们可以运行 env.rand_step() 来从 action_spec 域中随机生成一个动作。由于我们的环境是无状态的,必须传递一个包含超参数和当前状态的 tensordict。在有状态的上下文中,env.rand_step() 也同样适用。

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

环境转换

为无状态模拟器编写环境转换比有状态模拟器稍微复杂一些:转换需要在下一个迭代中读取的输出条目时,需要在下一步调用 meth.step() 之前应用逆转换。这是展示 TorchRL 转换所有功能的理想场景!

例如,在以下转换后的环境中,我们对条目 ["th", "thdot"] 进行 unsqueeze 操作,以便能够沿最后一个维度堆叠它们。我们还将其作为 in_keys_inv 传递,以便在下一迭代中作为输入传递时将其压缩回原始形状。

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

编写自定义变换

TorchRL 的变换(transforms)可能无法涵盖在执行环境后需要执行的所有操作。编写一个变换并不需要太多的工作量。与设计环境类似,编写变换包含两个步骤:

  • 正确掌握动力学(正向和逆向);

  • 调整环境规格。

变换可以在两种场景中使用:单独使用时,它可以作为一个Module。它也可以附加到 TransformedEnv 中使用。该类的结构允许在不同的上下文中自定义行为。

Transform 的骨架可以总结如下:

classTransform(nn.Module):
    defforward(self, tensordict):
        ...
    def_apply_transform(self, tensordict):
        ...
    def_step(self, tensordict):
        ...
    def_call(self, tensordict):
        ...
    definv(self, tensordict):
        ...
    def_inv_apply_transform(self, tensordict):
        ...

有三个入口点(forward()_step()inv()),它们都接收 tensordict.TensorDict 实例。前两个最终会遍历 in_keys 指定的键,并对每个键调用 _apply_transform()。如果提供了 Transform.out_keys,结果将被写入这些键指向的条目中(如果没有提供,则 in_keys 将使用转换后的值进行更新)。如果需要执行反向转换,将执行类似的数据流,但会使用 Transform.inv()Transform._inv_apply_transform() 方法,并遍历 in_keys_invout_keys_inv 键列表。下图总结了环境和回放缓冲区的这一流程。

Transform API

在某些情况下,转换无法以单一方式处理键的子集,而是会对父环境执行某些操作或处理整个输入 tensordict。在这些情况下,应重写 _call()forward() 方法,并且可以跳过 _apply_transform() 方法。

让我们编写新的转换来计算位置角度的 sinecosine 值,因为这些值对于我们学习策略来说比原始角度值更有用:

classSinTransform(Transform):
    def_apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def_reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    deftransform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


classCosTransform(Transform):
    def_apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def_reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    deftransform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

将观测值连接到一个“observation”条目上。del_keys=False 确保我们在下一次迭代中保留这些值。

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

再次确认我们的环境规范与接收到的内容是否匹配:

check_env_specs(env)

执行滚动更新

执行滚动更新是一系列简单的步骤:

  • 重置环境

  • 当某些条件未满足时:

    • 根据策略计算动作

    • 根据此动作执行一步

    • 收集数据

    • 进行一步 MDP

  • 收集数据并返回

这些操作已经被方便地封装在 rollout() 方法中,我们在下面提供了一个简化版本。

defsimple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

批量计算

我们教程中最后一个未被探索的部分是 TorchRL 中批量计算的能力。由于我们的环境对输入数据的形状没有任何假设,因此我们可以无缝地在数据批次上执行它。更好的是:对于像我们的 Pendulum 这样的非批量锁定环境,我们可以动态更改批量大小而无需重新创建环境。为此,我们只需生成具有所需形状的参数即可。

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

执行带有批量数据的 rollout 操作需要我们重置环境,而不能在 rollout 函数内部进行,因为我们需要动态定义 batch_size,而 rollout() 不支持这一操作:

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

训练一个简单的策略

在这个示例中,我们将使用奖励作为可微分的目标(例如负损失)来训练一个简单的策略。我们将利用动态系统完全可微分的特性,通过轨迹回报进行反向传播,并调整策略的权重以直接最大化该值。当然,在许多情况下,我们所做的许多假设并不成立,例如系统可微分和完全访问底层机制。

尽管如此,这是一个非常简单的示例,展示了如何在 TorchRL 中使用自定义环境编写训练循环。

让我们首先编写策略网络:

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

以及我们的优化器:

optim = torch.optim.Adam(policy.parameters(), lr=2e-3)

训练循环

我们将依次进行以下操作:

  • 生成一条轨迹

  • 累加奖励

  • 通过这些操作定义的图进行反向传播

  • 裁剪梯度范数并执行优化步骤

  • 重复上述过程

在训练循环结束时,我们应该得到一个接近 0 的最终奖励,这表明摆杆已经直立并保持静止,符合预期。

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[...,-1]['next','reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


defplot():
    importmatplotlib
    frommatplotlibimport pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        fromIPythonimport display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()

returns, last reward

  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<02:15,  4.60it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<02:15,  4.60it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<02:16,  4.57it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<02:16,  4.57it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<02:16,  4.56it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<02:16,  4.56it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<02:15,  4.58it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:01<02:15,  4.58it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:01<02:15,  4.58it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:01<02:15,  4.58it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:01<02:15,  4.57it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<02:15,  4.57it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|1         | 7/625 [00:01<02:14,  4.58it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|1         | 7/625 [00:01<02:14,  4.58it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|1         | 8/625 [00:01<02:14,  4.59it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|1         | 8/625 [00:01<02:14,  4.59it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|1         | 9/625 [00:01<02:14,  4.59it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|1         | 9/625 [00:02<02:14,  4.59it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|1         | 10/625 [00:02<02:14,  4.59it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|1         | 10/625 [00:02<02:14,  4.59it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|1         | 11/625 [00:02<02:13,  4.59it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|1         | 11/625 [00:02<02:13,  4.59it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|1         | 12/625 [00:02<02:13,  4.60it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|1         | 12/625 [00:02<02:13,  4.60it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|2         | 13/625 [00:02<02:13,  4.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|2         | 13/625 [00:03<02:13,  4.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|2         | 14/625 [00:03<02:12,  4.60it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|2         | 14/625 [00:03<02:12,  4.60it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|2         | 15/625 [00:03<02:12,  4.60it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   2%|2         | 15/625 [00:03<02:12,  4.60it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   3%|2         | 16/625 [00:03<02:12,  4.60it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|2         | 16/625 [00:03<02:12,  4.60it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|2         | 17/625 [00:03<02:12,  4.60it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|2         | 17/625 [00:03<02:12,  4.60it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|2         | 18/625 [00:03<02:11,  4.60it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|2         | 18/625 [00:04<02:11,  4.60it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|3         | 19/625 [00:04<02:12,  4.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|3         | 19/625 [00:04<02:12,  4.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|3         | 20/625 [00:04<02:13,  4.55it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|3         | 20/625 [00:04<02:13,  4.55it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|3         | 21/625 [00:04<02:13,  4.54it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   3%|3         | 21/625 [00:04<02:13,  4.54it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   4%|3         | 22/625 [00:04<02:13,  4.53it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|3         | 22/625 [00:05<02:13,  4.53it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|3         | 23/625 [00:05<02:13,  4.52it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|3         | 23/625 [00:05<02:13,  4.52it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|3         | 24/625 [00:05<02:13,  4.52it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|3         | 24/625 [00:05<02:13,  4.52it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|4         | 25/625 [00:05<02:12,  4.52it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|4         | 25/625 [00:05<02:12,  4.52it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|4         | 26/625 [00:05<02:12,  4.52it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|4         | 26/625 [00:05<02:12,  4.52it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|4         | 27/625 [00:05<02:11,  4.55it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|4         | 27/625 [00:06<02:11,  4.55it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|4         | 28/625 [00:06<02:10,  4.56it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   4%|4         | 28/625 [00:06<02:10,  4.56it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   5%|4         | 29/625 [00:06<02:10,  4.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|4         | 29/625 [00:06<02:10,  4.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|4         | 30/625 [00:06<02:10,  4.56it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|4         | 30/625 [00:06<02:10,  4.56it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|4         | 31/625 [00:06<02:09,  4.57it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|4         | 31/625 [00:06<02:09,  4.57it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|5         | 32/625 [00:07<02:09,  4.59it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|5         | 32/625 [00:07<02:09,  4.59it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|5         | 33/625 [00:07<02:09,  4.58it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|5         | 33/625 [00:07<02:09,  4.58it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|5         | 34/625 [00:07<02:09,  4.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   5%|5         | 34/625 [00:07<02:09,  4.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   6%|5         | 35/625 [00:07<02:10,  4.53it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|5         | 35/625 [00:07<02:10,  4.53it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|5         | 36/625 [00:07<02:09,  4.55it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|5         | 36/625 [00:08<02:09,  4.55it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|5         | 37/625 [00:08<02:08,  4.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|5         | 37/625 [00:08<02:08,  4.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|6         | 38/625 [00:08<02:08,  4.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|6         | 38/625 [00:08<02:08,  4.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|6         | 39/625 [00:08<02:07,  4.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|6         | 39/625 [00:08<02:07,  4.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|6         | 40/625 [00:08<02:07,  4.60it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   6%|6         | 40/625 [00:08<02:07,  4.60it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   7%|6         | 41/625 [00:08<02:07,  4.58it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|6         | 41/625 [00:09<02:07,  4.58it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|6         | 42/625 [00:09<02:07,  4.58it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|6         | 42/625 [00:09<02:07,  4.58it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|6         | 43/625 [00:09<02:06,  4.58it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|6         | 43/625 [00:09<02:06,  4.58it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|7         | 44/625 [00:09<02:06,  4.58it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|7         | 44/625 [00:09<02:06,  4.58it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|7         | 45/625 [00:09<02:06,  4.58it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|7         | 45/625 [00:10<02:06,  4.58it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|7         | 46/625 [00:10<02:06,  4.58it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   7%|7         | 46/625 [00:10<02:06,  4.58it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   8%|7         | 47/625 [00:10<02:05,  4.59it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|7         | 47/625 [00:10<02:05,  4.59it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|7         | 48/625 [00:10<02:06,  4.58it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|7         | 48/625 [00:10<02:06,  4.58it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|7         | 49/625 [00:10<02:05,  4.58it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|7         | 49/625 [00:10<02:05,  4.58it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|8         | 50/625 [00:10<02:05,  4.57it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|8         | 50/625 [00:11<02:05,  4.57it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|8         | 51/625 [00:11<02:05,  4.56it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|8         | 51/625 [00:11<02:05,  4.56it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|8         | 52/625 [00:11<02:05,  4.56it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|8         | 52/625 [00:11<02:05,  4.56it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|8         | 53/625 [00:11<02:05,  4.55it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   8%|8         | 53/625 [00:11<02:05,  4.55it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   9%|8         | 54/625 [00:11<02:05,  4.56it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|8         | 54/625 [00:12<02:05,  4.56it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|8         | 55/625 [00:12<02:04,  4.57it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|8         | 55/625 [00:12<02:04,  4.57it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|8         | 56/625 [00:12<02:04,  4.56it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|8         | 56/625 [00:12<02:04,  4.56it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|9         | 57/625 [00:12<02:04,  4.56it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|9         | 57/625 [00:12<02:04,  4.56it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|9         | 58/625 [00:12<02:04,  4.56it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|9         | 58/625 [00:12<02:04,  4.56it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|9         | 59/625 [00:12<02:04,  4.56it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:   9%|9         | 59/625 [00:13<02:04,  4.56it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:  10%|9         | 60/625 [00:13<02:03,  4.56it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|9         | 60/625 [00:13<02:03,  4.56it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|9         | 61/625 [00:13<02:03,  4.56it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|9         | 61/625 [00:13<02:03,  4.56it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|9         | 62/625 [00:13<02:03,  4.56it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|9         | 62/625 [00:13<02:03,  4.56it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|#         | 63/625 [00:13<02:03,  4.56it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|#         | 63/625 [00:14<02:03,  4.56it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|#         | 64/625 [00:14<02:02,  4.57it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|#         | 64/625 [00:14<02:02,  4.57it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|#         | 65/625 [00:14<02:02,  4.58it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  10%|#         | 65/625 [00:14<02:02,  4.58it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  11%|#         | 66/625 [00:14<02:01,  4.59it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|#         | 66/625 [00:14<02:01,  4.59it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|#         | 67/625 [00:14<02:01,  4.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|#         | 67/625 [00:14<02:01,  4.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|#         | 68/625 [00:14<02:01,  4.60it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|#         | 68/625 [00:15<02:01,  4.60it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|#1        | 69/625 [00:15<02:00,  4.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|#1        | 69/625 [00:15<02:00,  4.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|#1        | 70/625 [00:15<02:00,  4.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|#1        | 70/625 [00:15<02:00,  4.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|#1        | 71/625 [00:15<02:00,  4.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  11%|#1        | 71/625 [00:15<02:00,  4.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  12%|#1        | 72/625 [00:15<02:00,  4.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|#1        | 72/625 [00:15<02:00,  4.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|#1        | 73/625 [00:15<02:00,  4.60it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|#1        | 73/625 [00:16<02:00,  4.60it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|#1        | 74/625 [00:16<01:59,  4.60it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|#1        | 74/625 [00:16<01:59,  4.60it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|#2        | 75/625 [00:16<01:59,  4.61it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|#2        | 75/625 [00:16<01:59,  4.61it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|#2        | 76/625 [00:16<01:59,  4.61it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|#2        | 76/625 [00:16<01:59,  4.61it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|#2        | 77/625 [00:16<01:58,  4.61it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|#2        | 77/625 [00:17<01:58,  4.61it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|#2        | 78/625 [00:17<01:58,  4.61it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  12%|#2        | 78/625 [00:17<01:58,  4.61it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  13%|#2        | 79/625 [00:17<01:58,  4.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|#2        | 79/625 [00:17<01:58,  4.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|#2        | 80/625 [00:17<01:58,  4.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|#2        | 80/625 [00:17<01:58,  4.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|#2        | 81/625 [00:17<01:58,  4.59it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|#2        | 81/625 [00:17<01:58,  4.59it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|#3        | 82/625 [00:17<01:58,  4.60it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|#3        | 82/625 [00:18<01:58,  4.60it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|#3        | 83/625 [00:18<01:57,  4.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|#3        | 83/625 [00:18<01:57,  4.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|#3        | 84/625 [00:18<01:57,  4.59it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  13%|#3        | 84/625 [00:18<01:57,  4.59it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  14%|#3        | 85/625 [00:18<01:57,  4.59it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|#3        | 85/625 [00:18<01:57,  4.59it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|#3        | 86/625 [00:18<01:57,  4.59it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|#3        | 86/625 [00:19<01:57,  4.59it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|#3        | 87/625 [00:19<01:57,  4.59it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|#3        | 87/625 [00:19<01:57,  4.59it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|#4        | 88/625 [00:19<01:56,  4.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|#4        | 88/625 [00:19<01:56,  4.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|#4        | 89/625 [00:19<01:56,  4.59it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|#4        | 89/625 [00:19<01:56,  4.59it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|#4        | 90/625 [00:19<01:56,  4.60it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  14%|#4        | 90/625 [00:19<01:56,  4.60it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  15%|#4        | 91/625 [00:19<01:56,  4.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|#4        | 91/625 [00:20<01:56,  4.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|#4        | 92/625 [00:20<01:55,  4.60it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|#4        | 92/625 [00:20<01:55,  4.60it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|#4        | 93/625 [00:20<01:56,  4.58it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|#4        | 93/625 [00:20<01:56,  4.58it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|#5        | 94/625 [00:20<01:55,  4.59it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|#5        | 94/625 [00:20<01:55,  4.59it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|#5        | 95/625 [00:20<01:55,  4.59it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|#5        | 95/625 [00:20<01:55,  4.59it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|#5        | 96/625 [00:20<01:55,  4.60it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  15%|#5        | 96/625 [00:21<01:55,  4.60it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  16%|#5        | 97/625 [00:21<01:54,  4.60it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|#5        | 97/625 [00:21<01:54,  4.60it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|#5        | 98/625 [00:21<01:54,  4.60it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|#5        | 98/625 [00:21<01:54,  4.60it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|#5        | 99/625 [00:21<01:54,  4.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|#5        | 99/625 [00:21<01:54,  4.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|#6        | 100/625 [00:21<01:54,  4.58it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|#6        | 100/625 [00:22<01:54,  4.58it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|#6        | 101/625 [00:22<01:54,  4.57it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|#6        | 101/625 [00:22<01:54,  4.57it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|#6        | 102/625 [00:22<01:54,  4.58it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|#6        | 102/625 [00:22<01:54,  4.58it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|#6        | 103/625 [00:22<01:53,  4.59it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  16%|#6        | 103/625 [00:22<01:53,  4.59it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  17%|#6        | 104/625 [00:22<01:53,  4.59it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|#6        | 104/625 [00:22<01:53,  4.59it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|#6        | 105/625 [00:22<01:53,  4.59it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|#6        | 105/625 [00:23<01:53,  4.59it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|#6        | 106/625 [00:23<01:52,  4.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|#6        | 106/625 [00:23<01:52,  4.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|#7        | 107/625 [00:23<01:52,  4.60it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|#7        | 107/625 [00:23<01:52,  4.60it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|#7        | 108/625 [00:23<01:52,  4.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|#7        | 108/625 [00:23<01:52,  4.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|#7        | 109/625 [00:23<01:52,  4.60it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  17%|#7        | 109/625 [00:24<01:52,  4.60it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  18%|#7        | 110/625 [00:24<01:52,  4.59it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|#7        | 110/625 [00:24<01:52,  4.59it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|#7        | 111/625 [00:24<01:51,  4.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|#7        | 111/625 [00:24<01:51,  4.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|#7        | 112/625 [00:24<01:51,  4.60it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|#7        | 112/625 [00:24<01:51,  4.60it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|#8        | 113/625 [00:24<01:51,  4.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|#8        | 113/625 [00:24<01:51,  4.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|#8        | 114/625 [00:24<01:51,  4.60it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|#8        | 114/625 [00:25<01:51,  4.60it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|#8        | 115/625 [00:25<01:51,  4.59it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  18%|#8        | 115/625 [00:25<01:51,  4.59it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  19%|#8        | 116/625 [00:25<01:50,  4.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|#8        | 116/625 [00:25<01:50,  4.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|#8        | 117/625 [00:25<01:50,  4.59it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|#8        | 117/625 [00:25<01:50,  4.59it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|#8        | 118/625 [00:25<01:50,  4.60it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|#8        | 118/625 [00:25<01:50,  4.60it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|#9        | 119/625 [00:25<01:49,  4.60it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|#9        | 119/625 [00:26<01:49,  4.60it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|#9        | 120/625 [00:26<01:49,  4.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|#9        | 120/625 [00:26<01:49,  4.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|#9        | 121/625 [00:26<01:49,  4.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  19%|#9        | 121/625 [00:26<01:49,  4.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  20%|#9        | 122/625 [00:26<01:49,  4.61it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|#9        | 122/625 [00:26<01:49,  4.61it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|#9        | 123/625 [00:26<01:48,  4.61it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|#9        | 123/625 [00:27<01:48,  4.61it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|#9        | 124/625 [00:27<01:48,  4.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|#9        | 124/625 [00:27<01:48,  4.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|##        | 125/625 [00:27<01:48,  4.62it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|##        | 125/625 [00:27<01:48,  4.62it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|##        | 126/625 [00:27<01:48,  4.62it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|##        | 126/625 [00:27<01:48,  4.62it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|##        | 127/625 [00:27<01:47,  4.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|##        | 127/625 [00:27<01:47,  4.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|##        | 128/625 [00:27<01:47,  4.62it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  20%|##        | 128/625 [00:28<01:47,  4.62it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  21%|##        | 129/625 [00:28<01:47,  4.62it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|##        | 129/625 [00:28<01:47,  4.62it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|##        | 130/625 [00:28<01:47,  4.62it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|##        | 130/625 [00:28<01:47,  4.62it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|##        | 131/625 [00:28<01:47,  4.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|##        | 131/625 [00:28<01:47,  4.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|##1       | 132/625 [00:28<01:46,  4.61it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|##1       | 132/625 [00:29<01:46,  4.61it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|##1       | 133/625 [00:29<01:46,  4.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|##1       | 133/625 [00:29<01:46,  4.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|##1       | 134/625 [00:29<01:46,  4.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  21%|##1       | 134/625 [00:29<01:46,  4.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  22%|##1       | 135/625 [00:29<01:46,  4.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|##1       | 135/625 [00:29<01:46,  4.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|##1       | 136/625 [00:29<01:46,  4.61it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|##1       | 136/625 [00:29<01:46,  4.61it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|##1       | 137/625 [00:29<01:45,  4.61it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|##1       | 137/625 [00:30<01:45,  4.61it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|##2       | 138/625 [00:30<01:45,  4.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|##2       | 138/625 [00:30<01:45,  4.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|##2       | 139/625 [00:30<01:45,  4.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|##2       | 139/625 [00:30<01:45,  4.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|##2       | 140/625 [00:30<01:45,  4.61it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  22%|##2       | 140/625 [00:30<01:45,  4.61it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  23%|##2       | 141/625 [00:30<01:45,  4.61it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|##2       | 141/625 [00:30<01:45,  4.61it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|##2       | 142/625 [00:30<01:44,  4.60it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|##2       | 142/625 [00:31<01:44,  4.60it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|##2       | 143/625 [00:31<01:44,  4.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|##2       | 143/625 [00:31<01:44,  4.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|##3       | 144/625 [00:31<01:44,  4.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|##3       | 144/625 [00:31<01:44,  4.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|##3       | 145/625 [00:31<01:44,  4.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|##3       | 145/625 [00:31<01:44,  4.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|##3       | 146/625 [00:31<01:44,  4.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  23%|##3       | 146/625 [00:32<01:44,  4.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  24%|##3       | 147/625 [00:32<01:43,  4.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|##3       | 147/625 [00:32<01:43,  4.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|##3       | 148/625 [00:32<01:43,  4.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|##3       | 148/625 [00:32<01:43,  4.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|##3       | 149/625 [00:32<01:43,  4.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|##3       | 149/625 [00:32<01:43,  4.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|##4       | 150/625 [00:32<01:43,  4.59it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|##4       | 150/625 [00:32<01:43,  4.59it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|##4       | 151/625 [00:32<01:43,  4.59it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|##4       | 151/625 [00:33<01:43,  4.59it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|##4       | 152/625 [00:33<01:45,  4.50it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|##4       | 152/625 [00:33<01:45,  4.50it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|##4       | 153/625 [00:33<01:44,  4.52it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  24%|##4       | 153/625 [00:33<01:44,  4.52it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  25%|##4       | 154/625 [00:33<01:43,  4.55it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|##4       | 154/625 [00:33<01:43,  4.55it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|##4       | 155/625 [00:33<01:42,  4.56it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|##4       | 155/625 [00:34<01:42,  4.56it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|##4       | 156/625 [00:34<01:42,  4.57it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|##4       | 156/625 [00:34<01:42,  4.57it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|##5       | 157/625 [00:34<01:42,  4.58it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|##5       | 157/625 [00:34<01:42,  4.58it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|##5       | 158/625 [00:34<01:41,  4.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|##5       | 158/625 [00:34<01:41,  4.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|##5       | 159/625 [00:34<01:41,  4.59it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  25%|##5       | 159/625 [00:34<01:41,  4.59it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  26%|##5       | 160/625 [00:34<01:41,  4.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|##5       | 160/625 [00:35<01:41,  4.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|##5       | 161/625 [00:35<01:40,  4.59it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|##5       | 161/625 [00:35<01:40,  4.59it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|##5       | 162/625 [00:35<01:40,  4.60it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|##5       | 162/625 [00:35<01:40,  4.60it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|##6       | 163/625 [00:35<01:40,  4.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|##6       | 163/625 [00:35<01:40,  4.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|##6       | 164/625 [00:35<01:40,  4.60it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|##6       | 164/625 [00:35<01:40,  4.60it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|##6       | 165/625 [00:35<01:40,  4.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  26%|##6       | 165/625 [00:36<01:40,  4.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  27%|##6       | 166/625 [00:36<01:39,  4.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|##6       | 166/625 [00:36<01:39,  4.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|##6       | 167/625 [00:36<01:39,  4.61it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|##6       | 167/625 [00:36<01:39,  4.61it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|##6       | 168/625 [00:36<01:39,  4.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|##6       | 168/625 [00:36<01:39,  4.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|##7       | 169/625 [00:36<01:39,  4.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|##7       | 169/625 [00:37<01:39,  4.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|##7       | 170/625 [00:37<01:38,  4.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|##7       | 170/625 [00:37<01:38,  4.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|##7       | 171/625 [00:37<01:38,  4.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  27%|##7       | 171/625 [00:37<01:38,  4.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  28%|##7       | 172/625 [00:37<01:38,  4.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|##7       | 172/625 [00:37<01:38,  4.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|##7       | 173/625 [00:37<01:38,  4.60it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|##7       | 173/625 [00:37<01:38,  4.60it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|##7       | 174/625 [00:37<01:38,  4.60it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|##7       | 174/625 [00:38<01:38,  4.60it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|##8       | 175/625 [00:38<01:37,  4.61it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|##8       | 175/625 [00:38<01:37,  4.61it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|##8       | 176/625 [00:38<01:37,  4.61it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|##8       | 176/625 [00:38<01:37,  4.61it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|##8       | 177/625 [00:38<01:37,  4.61it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|##8       | 177/625 [00:38<01:37,  4.61it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|##8       | 178/625 [00:38<01:36,  4.61it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  28%|##8       | 178/625 [00:39<01:36,  4.61it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  29%|##8       | 179/625 [00:39<01:36,  4.61it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|##8       | 179/625 [00:39<01:36,  4.61it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|##8       | 180/625 [00:39<01:36,  4.61it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|##8       | 180/625 [00:39<01:36,  4.61it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|##8       | 181/625 [00:39<01:36,  4.62it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|##8       | 181/625 [00:39<01:36,  4.62it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|##9       | 182/625 [00:39<01:35,  4.62it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|##9       | 182/625 [00:39<01:35,  4.62it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|##9       | 183/625 [00:39<01:35,  4.62it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|##9       | 183/625 [00:40<01:35,  4.62it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|##9       | 184/625 [00:40<01:35,  4.62it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  29%|##9       | 184/625 [00:40<01:35,  4.62it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  30%|##9       | 185/625 [00:40<01:35,  4.62it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|##9       | 185/625 [00:40<01:35,  4.62it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|##9       | 186/625 [00:40<01:35,  4.61it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|##9       | 186/625 [00:40<01:35,  4.61it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|##9       | 187/625 [00:40<01:35,  4.61it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|##9       | 187/625 [00:40<01:35,  4.61it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|###       | 188/625 [00:40<01:34,  4.61it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|###       | 188/625 [00:41<01:34,  4.61it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|###       | 189/625 [00:41<01:34,  4.61it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|###       | 189/625 [00:41<01:34,  4.61it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|###       | 190/625 [00:41<01:34,  4.61it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  30%|###       | 190/625 [00:41<01:34,  4.61it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  31%|###       | 191/625 [00:41<01:34,  4.61it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|###       | 191/625 [00:41<01:34,  4.61it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|###       | 192/625 [00:41<01:33,  4.61it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|###       | 192/625 [00:42<01:33,  4.61it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|###       | 193/625 [00:42<01:33,  4.61it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|###       | 193/625 [00:42<01:33,  4.61it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|###1      | 194/625 [00:42<01:33,  4.61it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|###1      | 194/625 [00:42<01:33,  4.61it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|###1      | 195/625 [00:42<01:33,  4.61it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|###1      | 195/625 [00:42<01:33,  4.61it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|###1      | 196/625 [00:42<01:33,  4.61it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  31%|###1      | 196/625 [00:42<01:33,  4.61it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  32%|###1      | 197/625 [00:42<01:33,  4.60it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|###1      | 197/625 [00:43<01:33,  4.60it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|###1      | 198/625 [00:43<01:32,  4.60it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|###1      | 198/625 [00:43<01:32,  4.60it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|###1      | 199/625 [00:43<01:32,  4.60it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|###1      | 199/625 [00:43<01:32,  4.60it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|###2      | 200/625 [00:43<01:32,  4.60it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|###2      | 200/625 [00:43<01:32,  4.60it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|###2      | 201/625 [00:43<01:32,  4.60it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|###2      | 201/625 [00:44<01:32,  4.60it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|###2      | 202/625 [00:44<01:31,  4.61it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|###2      | 202/625 [00:44<01:31,  4.61it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|###2      | 203/625 [00:44<01:31,  4.61it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  32%|###2      | 203/625 [00:44<01:31,  4.61it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  33%|###2      | 204/625 [00:44<01:31,  4.61it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|###2      | 204/625 [00:44<01:31,  4.61it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|###2      | 205/625 [00:44<01:31,  4.61it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|###2      | 205/625 [00:44<01:31,  4.61it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|###2      | 206/625 [00:44<01:30,  4.61it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|###2      | 206/625 [00:45<01:30,  4.61it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|###3      | 207/625 [00:45<01:30,  4.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|###3      | 207/625 [00:45<01:30,  4.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|###3      | 208/625 [00:45<01:30,  4.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|###3      | 208/625 [00:45<01:30,  4.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|###3      | 209/625 [00:45<01:30,  4.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  33%|###3      | 209/625 [00:45<01:30,  4.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  34%|###3      | 210/625 [00:45<01:29,  4.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|###3      | 210/625 [00:45<01:29,  4.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|###3      | 211/625 [00:45<01:29,  4.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|###3      | 211/625 [00:46<01:29,  4.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|###3      | 212/625 [00:46<01:29,  4.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|###3      | 212/625 [00:46<01:29,  4.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|###4      | 213/625 [00:46<01:29,  4.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|###4      | 213/625 [00:46<01:29,  4.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|###4      | 214/625 [00:46<01:29,  4.61it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|###4      | 214/625 [00:46<01:29,  4.61it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|###4      | 215/625 [00:46<01:28,  4.61it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  34%|###4      | 215/625 [00:47<01:28,  4.61it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  35%|###4      | 216/625 [00:47<01:28,  4.60it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|###4      | 216/625 [00:47<01:28,  4.60it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|###4      | 217/625 [00:47<01:28,  4.60it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|###4      | 217/625 [00:47<01:28,  4.60it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|###4      | 218/625 [00:47<01:28,  4.60it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|###4      | 218/625 [00:47<01:28,  4.60it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|###5      | 219/625 [00:47<01:28,  4.59it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|###5      | 219/625 [00:47<01:28,  4.59it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|###5      | 220/625 [00:47<01:28,  4.59it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|###5      | 220/625 [00:48<01:28,  4.59it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|###5      | 221/625 [00:48<01:28,  4.59it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  35%|###5      | 221/625 [00:48<01:28,  4.59it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  36%|###5      | 222/625 [00:48<01:27,  4.59it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|###5      | 222/625 [00:48<01:27,  4.59it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|###5      | 223/625 [00:48<01:27,  4.59it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|###5      | 223/625 [00:48<01:27,  4.59it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|###5      | 224/625 [00:48<01:27,  4.60it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|###5      | 224/625 [00:49<01:27,  4.60it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|###6      | 225/625 [00:49<01:26,  4.60it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|###6      | 225/625 [00:49<01:26,  4.60it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|###6      | 226/625 [00:49<01:26,  4.59it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|###6      | 226/625 [00:49<01:26,  4.59it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|###6      | 227/625 [00:49<01:26,  4.59it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|###6      | 227/625 [00:49<01:26,  4.59it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|###6      | 228/625 [00:49<01:26,  4.60it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  36%|###6      | 228/625 [00:49<01:26,  4.60it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  37%|###6      | 229/625 [00:49<01:26,  4.60it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|###6      | 229/625 [00:50<01:26,  4.60it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|###6      | 230/625 [00:50<01:25,  4.60it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|###6      | 230/625 [00:50<01:25,  4.60it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|###6      | 231/625 [00:50<01:25,  4.61it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|###6      | 231/625 [00:50<01:25,  4.61it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|###7      | 232/625 [00:50<01:25,  4.61it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|###7      | 232/625 [00:50<01:25,  4.61it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|###7      | 233/625 [00:50<01:25,  4.61it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|###7      | 233/625 [00:50<01:25,  4.61it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|###7      | 234/625 [00:50<01:25,  4.60it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  37%|###7      | 234/625 [00:51<01:25,  4.60it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  38%|###7      | 235/625 [00:51<01:24,  4.60it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|###7      | 235/625 [00:51<01:24,  4.60it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|###7      | 236/625 [00:51<01:24,  4.60it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|###7      | 236/625 [00:51<01:24,  4.60it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|###7      | 237/625 [00:51<01:24,  4.59it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|###7      | 237/625 [00:51<01:24,  4.59it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|###8      | 238/625 [00:51<01:24,  4.59it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|###8      | 238/625 [00:52<01:24,  4.59it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|###8      | 239/625 [00:52<01:24,  4.59it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|###8      | 239/625 [00:52<01:24,  4.59it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|###8      | 240/625 [00:52<01:24,  4.58it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  38%|###8      | 240/625 [00:52<01:24,  4.58it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  39%|###8      | 241/625 [00:52<01:23,  4.57it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|###8      | 241/625 [00:52<01:23,  4.57it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|###8      | 242/625 [00:52<01:23,  4.58it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|###8      | 242/625 [00:52<01:23,  4.58it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|###8      | 243/625 [00:52<01:23,  4.59it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|###8      | 243/625 [00:53<01:23,  4.59it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|###9      | 244/625 [00:53<01:22,  4.59it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|###9      | 244/625 [00:53<01:22,  4.59it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|###9      | 245/625 [00:53<01:22,  4.60it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|###9      | 245/625 [00:53<01:22,  4.60it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|###9      | 246/625 [00:53<01:22,  4.60it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  39%|###9      | 246/625 [00:53<01:22,  4.60it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  40%|###9      | 247/625 [00:53<01:22,  4.59it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|###9      | 247/625 [00:54<01:22,  4.59it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|###9      | 248/625 [00:54<01:22,  4.59it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|###9      | 248/625 [00:54<01:22,  4.59it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|###9      | 249/625 [00:54<01:21,  4.59it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|###9      | 249/625 [00:54<01:21,  4.59it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|####      | 250/625 [00:54<01:21,  4.60it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|####      | 250/625 [00:54<01:21,  4.60it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|####      | 251/625 [00:54<01:21,  4.60it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|####      | 251/625 [00:54<01:21,  4.60it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|####      | 252/625 [00:54<01:21,  4.60it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|####      | 252/625 [00:55<01:21,  4.60it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|####      | 253/625 [00:55<01:20,  4.61it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  40%|####      | 253/625 [00:55<01:20,  4.61it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  41%|####      | 254/625 [00:55<01:20,  4.61it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|####      | 254/625 [00:55<01:20,  4.61it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|####      | 255/625 [00:55<01:20,  4.61it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|####      | 255/625 [00:55<01:20,  4.61it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|####      | 256/625 [00:55<01:20,  4.61it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|####      | 256/625 [00:55<01:20,  4.61it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|####1     | 257/625 [00:55<01:19,  4.61it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|####1     | 257/625 [00:56<01:19,  4.61it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|####1     | 258/625 [00:56<01:19,  4.61it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|####1     | 258/625 [00:56<01:19,  4.61it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|####1     | 259/625 [00:56<01:19,  4.60it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  41%|####1     | 259/625 [00:56<01:19,  4.60it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  42%|####1     | 260/625 [00:56<01:19,  4.59it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|####1     | 260/625 [00:56<01:19,  4.59it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|####1     | 261/625 [00:56<01:19,  4.59it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|####1     | 261/625 [00:57<01:19,  4.59it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|####1     | 262/625 [00:57<01:19,  4.59it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|####1     | 262/625 [00:57<01:19,  4.59it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|####2     | 263/625 [00:57<01:18,  4.60it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|####2     | 263/625 [00:57<01:18,  4.60it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|####2     | 264/625 [00:57<01:18,  4.60it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|####2     | 264/625 [00:57<01:18,  4.60it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|####2     | 265/625 [00:57<01:18,  4.60it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  42%|####2     | 265/625 [00:57<01:18,  4.60it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  43%|####2     | 266/625 [00:57<01:18,  4.58it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|####2     | 266/625 [00:58<01:18,  4.58it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|####2     | 267/625 [00:58<01:18,  4.57it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|####2     | 267/625 [00:58<01:18,  4.57it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|####2     | 268/625 [00:58<01:18,  4.58it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|####2     | 268/625 [00:58<01:18,  4.58it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|####3     | 269/625 [00:58<01:17,  4.58it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|####3     | 269/625 [00:58<01:17,  4.58it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|####3     | 270/625 [00:58<01:17,  4.58it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|####3     | 270/625 [00:59<01:17,  4.58it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|####3     | 271/625 [00:59<01:16,  4.60it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  43%|####3     | 271/625 [00:59<01:16,  4.60it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  44%|####3     | 272/625 [00:59<01:16,  4.61it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|####3     | 272/625 [00:59<01:16,  4.61it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|####3     | 273/625 [00:59<01:16,  4.61it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|####3     | 273/625 [00:59<01:16,  4.61it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|####3     | 274/625 [00:59<01:16,  4.61it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|####3     | 274/625 [00:59<01:16,  4.61it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|####4     | 275/625 [00:59<01:15,  4.61it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|####4     | 275/625 [01:00<01:15,  4.61it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|####4     | 276/625 [01:00<01:15,  4.61it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|####4     | 276/625 [01:00<01:15,  4.61it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|####4     | 277/625 [01:00<01:15,  4.61it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|####4     | 277/625 [01:00<01:15,  4.61it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|####4     | 278/625 [01:00<01:15,  4.61it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  44%|####4     | 278/625 [01:00<01:15,  4.61it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  45%|####4     | 279/625 [01:00<01:15,  4.61it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|####4     | 279/625 [01:00<01:15,  4.61it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|####4     | 280/625 [01:00<01:14,  4.60it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|####4     | 280/625 [01:01<01:14,  4.60it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|####4     | 281/625 [01:01<01:15,  4.57it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|####4     | 281/625 [01:01<01:15,  4.57it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|####5     | 282/625 [01:01<01:14,  4.58it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|####5     | 282/625 [01:01<01:14,  4.58it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|####5     | 283/625 [01:01<01:14,  4.59it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|####5     | 283/625 [01:01<01:14,  4.59it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|####5     | 284/625 [01:01<01:14,  4.59it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  45%|####5     | 284/625 [01:02<01:14,  4.59it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  46%|####5     | 285/625 [01:02<01:13,  4.60it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|####5     | 285/625 [01:02<01:13,  4.60it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|####5     | 286/625 [01:02<01:13,  4.60it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|####5     | 286/625 [01:02<01:13,  4.60it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|####5     | 287/625 [01:02<01:13,  4.61it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|####5     | 287/625 [01:02<01:13,  4.61it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|####6     | 288/625 [01:02<01:13,  4.61it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|####6     | 288/625 [01:02<01:13,  4.61it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|####6     | 289/625 [01:02<01:13,  4.59it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|####6     | 289/625 [01:03<01:13,  4.59it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|####6     | 290/625 [01:03<01:13,  4.59it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  46%|####6     | 290/625 [01:03<01:13,  4.59it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  47%|####6     | 291/625 [01:03<01:12,  4.59it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|####6     | 291/625 [01:03<01:12,  4.59it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|####6     | 292/625 [01:03<01:12,  4.59it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|####6     | 292/625 [01:03<01:12,  4.59it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|####6     | 293/625 [01:03<01:12,  4.60it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|####6     | 293/625 [01:04<01:12,  4.60it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|####7     | 294/625 [01:04<01:11,  4.61it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|####7     | 294/625 [01:04<01:11,  4.61it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|####7     | 295/625 [01:04<01:11,  4.61it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|####7     | 295/625 [01:04<01:11,  4.61it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|####7     | 296/625 [01:04<01:11,  4.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  47%|####7     | 296/625 [01:04<01:11,  4.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  48%|####7     | 297/625 [01:04<01:11,  4.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|####7     | 297/625 [01:04<01:11,  4.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|####7     | 298/625 [01:04<01:10,  4.61it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|####7     | 298/625 [01:05<01:10,  4.61it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|####7     | 299/625 [01:05<01:10,  4.61it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|####7     | 299/625 [01:05<01:10,  4.61it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|####8     | 300/625 [01:05<01:10,  4.61it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|####8     | 300/625 [01:05<01:10,  4.61it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|####8     | 301/625 [01:05<01:10,  4.61it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|####8     | 301/625 [01:05<01:10,  4.61it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|####8     | 302/625 [01:05<01:10,  4.61it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|####8     | 302/625 [01:05<01:10,  4.61it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|####8     | 303/625 [01:05<01:09,  4.61it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  48%|####8     | 303/625 [01:06<01:09,  4.61it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  49%|####8     | 304/625 [01:06<01:09,  4.61it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|####8     | 304/625 [01:06<01:09,  4.61it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|####8     | 305/625 [01:06<01:09,  4.61it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|####8     | 305/625 [01:06<01:09,  4.61it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|####8     | 306/625 [01:06<01:09,  4.61it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|####8     | 306/625 [01:06<01:09,  4.61it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|####9     | 307/625 [01:06<01:08,  4.61it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|####9     | 307/625 [01:07<01:08,  4.61it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|####9     | 308/625 [01:07<01:08,  4.61it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|####9     | 308/625 [01:07<01:08,  4.61it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|####9     | 309/625 [01:07<01:08,  4.61it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  49%|####9     | 309/625 [01:07<01:08,  4.61it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  50%|####9     | 310/625 [01:07<01:08,  4.62it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|####9     | 310/625 [01:07<01:08,  4.62it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|####9     | 311/625 [01:07<01:07,  4.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|####9     | 311/625 [01:07<01:07,  4.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|####9     | 312/625 [01:07<01:07,  4.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|####9     | 312/625 [01:08<01:07,  4.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|#####     | 313/625 [01:08<01:07,  4.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|#####     | 313/625 [01:08<01:07,  4.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|#####     | 314/625 [01:08<01:07,  4.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|#####     | 314/625 [01:08<01:07,  4.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|#####     | 315/625 [01:08<01:07,  4.61it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  50%|#####     | 315/625 [01:08<01:07,  4.61it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  51%|#####     | 316/625 [01:08<01:07,  4.61it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|#####     | 316/625 [01:09<01:07,  4.61it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|#####     | 317/625 [01:09<01:06,  4.61it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|#####     | 317/625 [01:09<01:06,  4.61it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|#####     | 318/625 [01:09<01:06,  4.61it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|#####     | 318/625 [01:09<01:06,  4.61it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|#####1    | 319/625 [01:09<01:06,  4.61it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|#####1    | 319/625 [01:09<01:06,  4.61it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|#####1    | 320/625 [01:09<01:06,  4.61it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|#####1    | 320/625 [01:09<01:06,  4.61it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|#####1    | 321/625 [01:09<01:05,  4.61it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  51%|#####1    | 321/625 [01:10<01:05,  4.61it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  52%|#####1    | 322/625 [01:10<01:05,  4.61it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|#####1    | 322/625 [01:10<01:05,  4.61it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|#####1    | 323/625 [01:10<01:05,  4.61it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|#####1    | 323/625 [01:10<01:05,  4.61it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|#####1    | 324/625 [01:10<01:05,  4.61it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|#####1    | 324/625 [01:10<01:05,  4.61it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|#####2    | 325/625 [01:10<01:05,  4.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|#####2    | 325/625 [01:10<01:05,  4.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|#####2    | 326/625 [01:10<01:04,  4.62it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|#####2    | 326/625 [01:11<01:04,  4.62it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|#####2    | 327/625 [01:11<01:04,  4.62it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|#####2    | 327/625 [01:11<01:04,  4.62it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|#####2    | 328/625 [01:11<01:04,  4.61it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  52%|#####2    | 328/625 [01:11<01:04,  4.61it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  53%|#####2    | 329/625 [01:11<01:04,  4.61it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|#####2    | 329/625 [01:11<01:04,  4.61it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|#####2    | 330/625 [01:11<01:03,  4.61it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|#####2    | 330/625 [01:12<01:03,  4.61it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|#####2    | 331/625 [01:12<01:03,  4.61it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|#####2    | 331/625 [01:12<01:03,  4.61it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|#####3    | 332/625 [01:12<01:03,  4.62it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|#####3    | 332/625 [01:12<01:03,  4.62it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|#####3    | 333/625 [01:12<01:03,  4.62it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|#####3    | 333/625 [01:12<01:03,  4.62it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|#####3    | 334/625 [01:12<01:02,  4.62it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  53%|#####3    | 334/625 [01:12<01:02,  4.62it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  54%|#####3    | 335/625 [01:12<01:02,  4.62it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|#####3    | 335/625 [01:13<01:02,  4.62it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|#####3    | 336/625 [01:13<01:02,  4.62it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|#####3    | 336/625 [01:13<01:02,  4.62it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|#####3    | 337/625 [01:13<01:02,  4.62it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|#####3    | 337/625 [01:13<01:02,  4.62it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|#####4    | 338/625 [01:13<01:02,  4.62it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|#####4    | 338/625 [01:13<01:02,  4.62it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|#####4    | 339/625 [01:13<01:01,  4.62it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|#####4    | 339/625 [01:13<01:01,  4.62it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|#####4    | 340/625 [01:13<01:01,  4.61it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  54%|#####4    | 340/625 [01:14<01:01,  4.61it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  55%|#####4    | 341/625 [01:14<01:01,  4.61it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|#####4    | 341/625 [01:14<01:01,  4.61it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|#####4    | 342/625 [01:14<01:01,  4.61it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|#####4    | 342/625 [01:14<01:01,  4.61it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|#####4    | 343/625 [01:14<01:01,  4.61it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|#####4    | 343/625 [01:14<01:01,  4.61it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|#####5    | 344/625 [01:14<01:00,  4.62it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|#####5    | 344/625 [01:15<01:00,  4.62it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|#####5    | 345/625 [01:15<01:00,  4.61it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|#####5    | 345/625 [01:15<01:00,  4.61it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|#####5    | 346/625 [01:15<01:00,  4.61it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  55%|#####5    | 346/625 [01:15<01:00,  4.61it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  56%|#####5    | 347/625 [01:15<01:00,  4.60it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|#####5    | 347/625 [01:15<01:00,  4.60it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|#####5    | 348/625 [01:15<01:00,  4.61it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|#####5    | 348/625 [01:15<01:00,  4.61it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|#####5    | 349/625 [01:15<00:59,  4.61it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|#####5    | 349/625 [01:16<00:59,  4.61it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|#####6    | 350/625 [01:16<00:59,  4.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|#####6    | 350/625 [01:16<00:59,  4.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|#####6    | 351/625 [01:16<00:59,  4.61it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|#####6    | 351/625 [01:16<00:59,  4.61it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|#####6    | 352/625 [01:16<00:59,  4.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|#####6    | 352/625 [01:16<00:59,  4.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|#####6    | 353/625 [01:16<00:59,  4.60it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  56%|#####6    | 353/625 [01:17<00:59,  4.60it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  57%|#####6    | 354/625 [01:17<00:58,  4.60it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|#####6    | 354/625 [01:17<00:58,  4.60it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|#####6    | 355/625 [01:17<00:58,  4.60it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|#####6    | 355/625 [01:17<00:58,  4.60it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|#####6    | 356/625 [01:17<00:58,  4.60it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|#####6    | 356/625 [01:17<00:58,  4.60it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|#####7    | 357/625 [01:17<00:58,  4.60it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|#####7    | 357/625 [01:17<00:58,  4.60it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|#####7    | 358/625 [01:17<00:58,  4.59it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|#####7    | 358/625 [01:18<00:58,  4.59it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|#####7    | 359/625 [01:18<00:58,  4.59it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  57%|#####7    | 359/625 [01:18<00:58,  4.59it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  58%|#####7    | 360/625 [01:18<00:57,  4.59it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|#####7    | 360/625 [01:18<00:57,  4.59it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|#####7    | 361/625 [01:18<00:57,  4.60it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|#####7    | 361/625 [01:18<00:57,  4.60it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|#####7    | 362/625 [01:18<00:57,  4.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|#####7    | 362/625 [01:18<00:57,  4.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|#####8    | 363/625 [01:18<00:56,  4.60it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|#####8    | 363/625 [01:19<00:56,  4.60it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|#####8    | 364/625 [01:19<00:56,  4.60it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|#####8    | 364/625 [01:19<00:56,  4.60it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|#####8    | 365/625 [01:19<00:56,  4.60it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  58%|#####8    | 365/625 [01:19<00:56,  4.60it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  59%|#####8    | 366/625 [01:19<00:56,  4.60it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|#####8    | 366/625 [01:19<00:56,  4.60it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|#####8    | 367/625 [01:19<00:55,  4.61it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|#####8    | 367/625 [01:20<00:55,  4.61it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|#####8    | 368/625 [01:20<00:55,  4.61it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|#####8    | 368/625 [01:20<00:55,  4.61it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|#####9    | 369/625 [01:20<00:55,  4.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|#####9    | 369/625 [01:20<00:55,  4.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|#####9    | 370/625 [01:20<00:55,  4.60it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|#####9    | 370/625 [01:20<00:55,  4.60it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|#####9    | 371/625 [01:20<00:55,  4.61it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  59%|#####9    | 371/625 [01:20<00:55,  4.61it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  60%|#####9    | 372/625 [01:20<00:54,  4.61it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|#####9    | 372/625 [01:21<00:54,  4.61it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|#####9    | 373/625 [01:21<00:54,  4.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|#####9    | 373/625 [01:21<00:54,  4.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|#####9    | 374/625 [01:21<00:54,  4.61it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|#####9    | 374/625 [01:21<00:54,  4.61it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|######    | 375/625 [01:21<00:54,  4.59it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|######    | 375/625 [01:21<00:54,  4.59it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|######    | 376/625 [01:21<00:54,  4.60it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|######    | 376/625 [01:22<00:54,  4.60it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|######    | 377/625 [01:22<00:53,  4.60it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|######    | 377/625 [01:22<00:53,  4.60it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|######    | 378/625 [01:22<00:53,  4.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  60%|######    | 378/625 [01:22<00:53,  4.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  61%|######    | 379/625 [01:22<00:53,  4.61it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|######    | 379/625 [01:22<00:53,  4.61it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|######    | 380/625 [01:22<00:53,  4.60it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|######    | 380/625 [01:22<00:53,  4.60it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|######    | 381/625 [01:22<00:53,  4.60it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|######    | 381/625 [01:23<00:53,  4.60it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|######1   | 382/625 [01:23<00:52,  4.60it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|######1   | 382/625 [01:23<00:52,  4.60it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|######1   | 383/625 [01:23<00:52,  4.60it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|######1   | 383/625 [01:23<00:52,  4.60it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|######1   | 384/625 [01:23<00:52,  4.61it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  61%|######1   | 384/625 [01:23<00:52,  4.61it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  62%|######1   | 385/625 [01:23<00:52,  4.61it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|######1   | 385/625 [01:23<00:52,  4.61it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|######1   | 386/625 [01:23<00:51,  4.61it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|######1   | 386/625 [01:24<00:51,  4.61it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|######1   | 387/625 [01:24<00:51,  4.61it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|######1   | 387/625 [01:24<00:51,  4.61it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|######2   | 388/625 [01:24<00:51,  4.61it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|######2   | 388/625 [01:24<00:51,  4.61it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|######2   | 389/625 [01:24<00:51,  4.61it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|######2   | 389/625 [01:24<00:51,  4.61it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|######2   | 390/625 [01:24<00:51,  4.61it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  62%|######2   | 390/625 [01:25<00:51,  4.61it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  63%|######2   | 391/625 [01:25<00:50,  4.61it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|######2   | 391/625 [01:25<00:50,  4.61it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|######2   | 392/625 [01:25<00:50,  4.61it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|######2   | 392/625 [01:25<00:50,  4.61it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|######2   | 393/625 [01:25<00:50,  4.62it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|######2   | 393/625 [01:25<00:50,  4.62it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|######3   | 394/625 [01:25<00:50,  4.62it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|######3   | 394/625 [01:25<00:50,  4.62it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|######3   | 395/625 [01:25<00:49,  4.62it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|######3   | 395/625 [01:26<00:49,  4.62it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|######3   | 396/625 [01:26<00:49,  4.61it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  63%|######3   | 396/625 [01:26<00:49,  4.61it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  64%|######3   | 397/625 [01:26<00:49,  4.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|######3   | 397/625 [01:26<00:49,  4.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|######3   | 398/625 [01:26<00:49,  4.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|######3   | 398/625 [01:26<00:49,  4.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|######3   | 399/625 [01:26<00:48,  4.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|######3   | 399/625 [01:27<00:48,  4.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|######4   | 400/625 [01:27<00:48,  4.61it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|######4   | 400/625 [01:27<00:48,  4.61it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|######4   | 401/625 [01:27<00:48,  4.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|######4   | 401/625 [01:27<00:48,  4.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|######4   | 402/625 [01:27<00:48,  4.62it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|######4   | 402/625 [01:27<00:48,  4.62it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|######4   | 403/625 [01:27<00:48,  4.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  64%|######4   | 403/625 [01:27<00:48,  4.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  65%|######4   | 404/625 [01:27<00:48,  4.58it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|######4   | 404/625 [01:28<00:48,  4.58it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|######4   | 405/625 [01:28<00:48,  4.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|######4   | 405/625 [01:28<00:48,  4.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|######4   | 406/625 [01:28<00:48,  4.54it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|######4   | 406/625 [01:28<00:48,  4.54it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|######5   | 407/625 [01:28<00:48,  4.53it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|######5   | 407/625 [01:28<00:48,  4.53it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|######5   | 408/625 [01:28<00:47,  4.55it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|######5   | 408/625 [01:28<00:47,  4.55it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|######5   | 409/625 [01:28<00:47,  4.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  65%|######5   | 409/625 [01:29<00:47,  4.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  66%|######5   | 410/625 [01:29<00:47,  4.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|######5   | 410/625 [01:29<00:47,  4.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|######5   | 411/625 [01:29<00:46,  4.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|######5   | 411/625 [01:29<00:46,  4.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|######5   | 412/625 [01:29<00:46,  4.59it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|######5   | 412/625 [01:29<00:46,  4.59it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|######6   | 413/625 [01:29<00:46,  4.60it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|######6   | 413/625 [01:30<00:46,  4.60it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|######6   | 414/625 [01:30<00:45,  4.60it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|######6   | 414/625 [01:30<00:45,  4.60it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|######6   | 415/625 [01:30<00:45,  4.60it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  66%|######6   | 415/625 [01:30<00:45,  4.60it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  67%|######6   | 416/625 [01:30<00:45,  4.60it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|######6   | 416/625 [01:30<00:45,  4.60it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|######6   | 417/625 [01:30<00:45,  4.61it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|######6   | 417/625 [01:30<00:45,  4.61it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|######6   | 418/625 [01:30<00:44,  4.62it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|######6   | 418/625 [01:31<00:44,  4.62it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|######7   | 419/625 [01:31<00:44,  4.61it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|######7   | 419/625 [01:31<00:44,  4.61it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|######7   | 420/625 [01:31<00:44,  4.61it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|######7   | 420/625 [01:31<00:44,  4.61it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|######7   | 421/625 [01:31<00:44,  4.62it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  67%|######7   | 421/625 [01:31<00:44,  4.62it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  68%|######7   | 422/625 [01:31<00:44,  4.59it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|######7   | 422/625 [01:32<00:44,  4.59it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|######7   | 423/625 [01:32<00:43,  4.59it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|######7   | 423/625 [01:32<00:43,  4.59it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|######7   | 424/625 [01:32<00:43,  4.60it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|######7   | 424/625 [01:32<00:43,  4.60it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|######8   | 425/625 [01:32<00:43,  4.61it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|######8   | 425/625 [01:32<00:43,  4.61it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|######8   | 426/625 [01:32<00:43,  4.61it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|######8   | 426/625 [01:32<00:43,  4.61it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|######8   | 427/625 [01:32<00:42,  4.61it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|######8   | 427/625 [01:33<00:42,  4.61it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|######8   | 428/625 [01:33<00:42,  4.61it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  68%|######8   | 428/625 [01:33<00:42,  4.61it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  69%|######8   | 429/625 [01:33<00:42,  4.61it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|######8   | 429/625 [01:33<00:42,  4.61it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|######8   | 430/625 [01:33<00:42,  4.61it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|######8   | 430/625 [01:33<00:42,  4.61it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|######8   | 431/625 [01:33<00:42,  4.61it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|######8   | 431/625 [01:33<00:42,  4.61it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|######9   | 432/625 [01:33<00:41,  4.62it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|######9   | 432/625 [01:34<00:41,  4.62it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|######9   | 433/625 [01:34<00:41,  4.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|######9   | 433/625 [01:34<00:41,  4.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|######9   | 434/625 [01:34<00:41,  4.61it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  69%|######9   | 434/625 [01:34<00:41,  4.61it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  70%|######9   | 435/625 [01:34<00:41,  4.61it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|######9   | 435/625 [01:34<00:41,  4.61it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|######9   | 436/625 [01:34<00:40,  4.61it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|######9   | 436/625 [01:35<00:40,  4.61it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|######9   | 437/625 [01:35<00:40,  4.61it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|######9   | 437/625 [01:35<00:40,  4.61it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|#######   | 438/625 [01:35<00:40,  4.61it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|#######   | 438/625 [01:35<00:40,  4.61it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|#######   | 439/625 [01:35<00:40,  4.61it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|#######   | 439/625 [01:35<00:40,  4.61it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|#######   | 440/625 [01:35<00:40,  4.61it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  70%|#######   | 440/625 [01:35<00:40,  4.61it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  71%|#######   | 441/625 [01:35<00:39,  4.60it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|#######   | 441/625 [01:36<00:39,  4.60it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|#######   | 442/625 [01:36<00:39,  4.60it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|#######   | 442/625 [01:36<00:39,  4.60it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|#######   | 443/625 [01:36<00:39,  4.61it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|#######   | 443/625 [01:36<00:39,  4.61it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|#######1  | 444/625 [01:36<00:39,  4.61it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|#######1  | 444/625 [01:36<00:39,  4.61it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|#######1  | 445/625 [01:36<00:39,  4.61it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|#######1  | 445/625 [01:37<00:39,  4.61it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|#######1  | 446/625 [01:37<00:38,  4.62it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  71%|#######1  | 446/625 [01:37<00:38,  4.62it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  72%|#######1  | 447/625 [01:37<00:38,  4.62it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|#######1  | 447/625 [01:37<00:38,  4.62it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|#######1  | 448/625 [01:37<00:38,  4.62it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|#######1  | 448/625 [01:37<00:38,  4.62it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|#######1  | 449/625 [01:37<00:38,  4.62it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|#######1  | 449/625 [01:37<00:38,  4.62it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|#######2  | 450/625 [01:37<00:37,  4.62it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|#######2  | 450/625 [01:38<00:37,  4.62it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|#######2  | 451/625 [01:38<00:37,  4.61it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|#######2  | 451/625 [01:38<00:37,  4.61it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|#######2  | 452/625 [01:38<00:37,  4.61it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|#######2  | 452/625 [01:38<00:37,  4.61it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|#######2  | 453/625 [01:38<00:37,  4.61it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  72%|#######2  | 453/625 [01:38<00:37,  4.61it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  73%|#######2  | 454/625 [01:38<00:37,  4.61it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|#######2  | 454/625 [01:38<00:37,  4.61it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|#######2  | 455/625 [01:38<00:36,  4.61it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|#######2  | 455/625 [01:39<00:36,  4.61it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|#######2  | 456/625 [01:39<00:36,  4.61it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|#######2  | 456/625 [01:39<00:36,  4.61it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|#######3  | 457/625 [01:39<00:36,  4.62it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|#######3  | 457/625 [01:39<00:36,  4.62it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|#######3  | 458/625 [01:39<00:36,  4.62it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|#######3  | 458/625 [01:39<00:36,  4.62it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|#######3  | 459/625 [01:39<00:35,  4.62it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  73%|#######3  | 459/625 [01:40<00:35,  4.62it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  74%|#######3  | 460/625 [01:40<00:35,  4.62it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|#######3  | 460/625 [01:40<00:35,  4.62it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|#######3  | 461/625 [01:40<00:35,  4.62it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|#######3  | 461/625 [01:40<00:35,  4.62it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|#######3  | 462/625 [01:40<00:35,  4.62it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|#######3  | 462/625 [01:40<00:35,  4.62it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|#######4  | 463/625 [01:40<00:34,  4.63it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|#######4  | 463/625 [01:40<00:34,  4.63it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|#######4  | 464/625 [01:40<00:34,  4.63it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|#######4  | 464/625 [01:41<00:34,  4.63it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|#######4  | 465/625 [01:41<00:34,  4.62it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  74%|#######4  | 465/625 [01:41<00:34,  4.62it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  75%|#######4  | 466/625 [01:41<00:34,  4.62it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|#######4  | 466/625 [01:41<00:34,  4.62it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|#######4  | 467/625 [01:41<00:34,  4.62it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|#######4  | 467/625 [01:41<00:34,  4.62it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|#######4  | 468/625 [01:41<00:33,  4.62it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|#######4  | 468/625 [01:41<00:33,  4.62it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|#######5  | 469/625 [01:41<00:33,  4.62it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|#######5  | 469/625 [01:42<00:33,  4.62it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|#######5  | 470/625 [01:42<00:33,  4.62it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|#######5  | 470/625 [01:42<00:33,  4.62it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|#######5  | 471/625 [01:42<00:33,  4.61it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  75%|#######5  | 471/625 [01:42<00:33,  4.61it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  76%|#######5  | 472/625 [01:42<00:33,  4.61it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|#######5  | 472/625 [01:42<00:33,  4.61it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|#######5  | 473/625 [01:42<00:32,  4.61it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|#######5  | 473/625 [01:43<00:32,  4.61it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|#######5  | 474/625 [01:43<00:32,  4.62it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|#######5  | 474/625 [01:43<00:32,  4.62it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|#######6  | 475/625 [01:43<00:32,  4.62it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|#######6  | 475/625 [01:43<00:32,  4.62it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|#######6  | 476/625 [01:43<00:32,  4.62it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|#######6  | 476/625 [01:43<00:32,  4.62it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|#######6  | 477/625 [01:43<00:32,  4.61it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|#######6  | 477/625 [01:43<00:32,  4.61it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|#######6  | 478/625 [01:43<00:31,  4.61it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  76%|#######6  | 478/625 [01:44<00:31,  4.61it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  77%|#######6  | 479/625 [01:44<00:31,  4.61it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|#######6  | 479/625 [01:44<00:31,  4.61it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|#######6  | 480/625 [01:44<00:31,  4.61it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|#######6  | 480/625 [01:44<00:31,  4.61it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|#######6  | 481/625 [01:44<00:31,  4.61it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|#######6  | 481/625 [01:44<00:31,  4.61it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|#######7  | 482/625 [01:44<00:30,  4.61it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|#######7  | 482/625 [01:45<00:30,  4.61it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|#######7  | 483/625 [01:45<00:30,  4.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|#######7  | 483/625 [01:45<00:30,  4.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|#######7  | 484/625 [01:45<00:30,  4.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  77%|#######7  | 484/625 [01:45<00:30,  4.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  78%|#######7  | 485/625 [01:45<00:30,  4.62it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|#######7  | 485/625 [01:45<00:30,  4.62it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|#######7  | 486/625 [01:45<00:30,  4.62it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|#######7  | 486/625 [01:45<00:30,  4.62it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|#######7  | 487/625 [01:45<00:29,  4.63it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|#######7  | 487/625 [01:46<00:29,  4.63it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|#######8  | 488/625 [01:46<00:29,  4.63it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|#######8  | 488/625 [01:46<00:29,  4.63it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|#######8  | 489/625 [01:46<00:29,  4.63it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|#######8  | 489/625 [01:46<00:29,  4.63it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|#######8  | 490/625 [01:46<00:29,  4.63it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  78%|#######8  | 490/625 [01:46<00:29,  4.63it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  79%|#######8  | 491/625 [01:46<00:28,  4.63it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|#######8  | 491/625 [01:46<00:28,  4.63it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|#######8  | 492/625 [01:46<00:28,  4.63it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|#######8  | 492/625 [01:47<00:28,  4.63it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|#######8  | 493/625 [01:47<00:28,  4.62it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|#######8  | 493/625 [01:47<00:28,  4.62it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|#######9  | 494/625 [01:47<00:28,  4.62it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|#######9  | 494/625 [01:47<00:28,  4.62it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|#######9  | 495/625 [01:47<00:28,  4.62it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|#######9  | 495/625 [01:47<00:28,  4.62it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|#######9  | 496/625 [01:47<00:27,  4.62it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  79%|#######9  | 496/625 [01:48<00:27,  4.62it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  80%|#######9  | 497/625 [01:48<00:27,  4.62it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|#######9  | 497/625 [01:48<00:27,  4.62it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|#######9  | 498/625 [01:48<00:27,  4.62it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|#######9  | 498/625 [01:48<00:27,  4.62it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|#######9  | 499/625 [01:48<00:27,  4.62it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|#######9  | 499/625 [01:48<00:27,  4.62it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|########  | 500/625 [01:48<00:27,  4.62it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|########  | 500/625 [01:48<00:27,  4.62it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|########  | 501/625 [01:48<00:26,  4.62it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|########  | 501/625 [01:49<00:26,  4.62it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|########  | 502/625 [01:49<00:26,  4.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|########  | 502/625 [01:49<00:26,  4.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|########  | 503/625 [01:49<00:26,  4.62it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  80%|########  | 503/625 [01:49<00:26,  4.62it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  81%|########  | 504/625 [01:49<00:26,  4.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|########  | 504/625 [01:49<00:26,  4.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|########  | 505/625 [01:49<00:25,  4.62it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|########  | 505/625 [01:50<00:25,  4.62it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|########  | 506/625 [01:50<00:25,  4.62it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|########  | 506/625 [01:50<00:25,  4.62it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|########1 | 507/625 [01:50<00:25,  4.61it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|########1 | 507/625 [01:50<00:25,  4.61it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|########1 | 508/625 [01:50<00:25,  4.59it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|########1 | 508/625 [01:50<00:25,  4.59it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|########1 | 509/625 [01:50<00:25,  4.59it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  81%|########1 | 509/625 [01:50<00:25,  4.59it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  82%|########1 | 510/625 [01:50<00:24,  4.60it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|########1 | 510/625 [01:51<00:24,  4.60it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|########1 | 511/625 [01:51<00:24,  4.61it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|########1 | 511/625 [01:51<00:24,  4.61it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|########1 | 512/625 [01:51<00:24,  4.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|########1 | 512/625 [01:51<00:24,  4.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|########2 | 513/625 [01:51<00:24,  4.61it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|########2 | 513/625 [01:51<00:24,  4.61it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|########2 | 514/625 [01:51<00:24,  4.62it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|########2 | 514/625 [01:51<00:24,  4.62it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|########2 | 515/625 [01:51<00:23,  4.62it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  82%|########2 | 515/625 [01:52<00:23,  4.62it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  83%|########2 | 516/625 [01:52<00:23,  4.62it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|########2 | 516/625 [01:52<00:23,  4.62it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|########2 | 517/625 [01:52<00:23,  4.62it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|########2 | 517/625 [01:52<00:23,  4.62it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|########2 | 518/625 [01:52<00:23,  4.62it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|########2 | 518/625 [01:52<00:23,  4.62it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|########3 | 519/625 [01:52<00:22,  4.62it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|########3 | 519/625 [01:53<00:22,  4.62it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|########3 | 520/625 [01:53<00:22,  4.62it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|########3 | 520/625 [01:53<00:22,  4.62it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|########3 | 521/625 [01:53<00:22,  4.62it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  83%|########3 | 521/625 [01:53<00:22,  4.62it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  84%|########3 | 522/625 [01:53<00:22,  4.62it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|########3 | 522/625 [01:53<00:22,  4.62it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|########3 | 523/625 [01:53<00:22,  4.62it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|########3 | 523/625 [01:53<00:22,  4.62it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|########3 | 524/625 [01:53<00:21,  4.62it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|########3 | 524/625 [01:54<00:21,  4.62it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|########4 | 525/625 [01:54<00:21,  4.62it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|########4 | 525/625 [01:54<00:21,  4.62it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|########4 | 526/625 [01:54<00:21,  4.62it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|########4 | 526/625 [01:54<00:21,  4.62it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|########4 | 527/625 [01:54<00:21,  4.62it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|########4 | 527/625 [01:54<00:21,  4.62it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|########4 | 528/625 [01:54<00:20,  4.62it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  84%|########4 | 528/625 [01:54<00:20,  4.62it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  85%|########4 | 529/625 [01:54<00:20,  4.62it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|########4 | 529/625 [01:55<00:20,  4.62it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|########4 | 530/625 [01:55<00:20,  4.62it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|########4 | 530/625 [01:55<00:20,  4.62it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|########4 | 531/625 [01:55<00:20,  4.62it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|########4 | 531/625 [01:55<00:20,  4.62it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|########5 | 532/625 [01:55<00:20,  4.62it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|########5 | 532/625 [01:55<00:20,  4.62it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|########5 | 533/625 [01:55<00:19,  4.63it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|########5 | 533/625 [01:56<00:19,  4.63it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|########5 | 534/625 [01:56<00:19,  4.63it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  85%|########5 | 534/625 [01:56<00:19,  4.63it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  86%|########5 | 535/625 [01:56<00:19,  4.62it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|########5 | 535/625 [01:56<00:19,  4.62it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|########5 | 536/625 [01:56<00:19,  4.62it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|########5 | 536/625 [01:56<00:19,  4.62it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|########5 | 537/625 [01:56<00:19,  4.62it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|########5 | 537/625 [01:56<00:19,  4.62it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|########6 | 538/625 [01:56<00:18,  4.62it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|########6 | 538/625 [01:57<00:18,  4.62it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|########6 | 539/625 [01:57<00:18,  4.62it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|########6 | 539/625 [01:57<00:18,  4.62it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|########6 | 540/625 [01:57<00:18,  4.62it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  86%|########6 | 540/625 [01:57<00:18,  4.62it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  87%|########6 | 541/625 [01:57<00:18,  4.62it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|########6 | 541/625 [01:57<00:18,  4.62it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|########6 | 542/625 [01:57<00:17,  4.62it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|########6 | 542/625 [01:58<00:17,  4.62it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|########6 | 543/625 [01:58<00:17,  4.62it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|########6 | 543/625 [01:58<00:17,  4.62it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|########7 | 544/625 [01:58<00:17,  4.62it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|########7 | 544/625 [01:58<00:17,  4.62it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|########7 | 545/625 [01:58<00:17,  4.62it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|########7 | 545/625 [01:58<00:17,  4.62it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|########7 | 546/625 [01:58<00:17,  4.62it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  87%|########7 | 546/625 [01:58<00:17,  4.62it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  88%|########7 | 547/625 [01:58<00:16,  4.62it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|########7 | 547/625 [01:59<00:16,  4.62it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|########7 | 548/625 [01:59<00:16,  4.62it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|########7 | 548/625 [01:59<00:16,  4.62it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|########7 | 549/625 [01:59<00:16,  4.62it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|########7 | 549/625 [01:59<00:16,  4.62it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|########8 | 550/625 [01:59<00:16,  4.62it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|########8 | 550/625 [01:59<00:16,  4.62it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|########8 | 551/625 [01:59<00:15,  4.63it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|########8 | 551/625 [01:59<00:15,  4.63it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|########8 | 552/625 [01:59<00:15,  4.62it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|########8 | 552/625 [02:00<00:15,  4.62it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|########8 | 553/625 [02:00<00:15,  4.63it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  88%|########8 | 553/625 [02:00<00:15,  4.63it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  89%|########8 | 554/625 [02:00<00:15,  4.63it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|########8 | 554/625 [02:00<00:15,  4.63it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|########8 | 555/625 [02:00<00:15,  4.63it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|########8 | 555/625 [02:00<00:15,  4.63it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|########8 | 556/625 [02:00<00:14,  4.63it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|########8 | 556/625 [02:01<00:14,  4.63it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|########9 | 557/625 [02:01<00:14,  4.63it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|########9 | 557/625 [02:01<00:14,  4.63it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|########9 | 558/625 [02:01<00:14,  4.63it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|########9 | 558/625 [02:01<00:14,  4.63it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|########9 | 559/625 [02:01<00:14,  4.62it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  89%|########9 | 559/625 [02:01<00:14,  4.62it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  90%|########9 | 560/625 [02:01<00:14,  4.60it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|########9 | 560/625 [02:01<00:14,  4.60it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|########9 | 561/625 [02:01<00:13,  4.59it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|########9 | 561/625 [02:02<00:13,  4.59it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|########9 | 562/625 [02:02<00:13,  4.59it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|########9 | 562/625 [02:02<00:13,  4.59it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|######### | 563/625 [02:02<00:13,  4.57it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|######### | 563/625 [02:02<00:13,  4.57it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|######### | 564/625 [02:02<00:13,  4.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|######### | 564/625 [02:02<00:13,  4.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|######### | 565/625 [02:02<00:13,  4.55it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  90%|######### | 565/625 [02:03<00:13,  4.55it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  91%|######### | 566/625 [02:03<00:12,  4.55it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|######### | 566/625 [02:03<00:12,  4.55it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|######### | 567/625 [02:03<00:12,  4.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|######### | 567/625 [02:03<00:12,  4.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|######### | 568/625 [02:03<00:12,  4.58it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|######### | 568/625 [02:03<00:12,  4.58it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|#########1| 569/625 [02:03<00:12,  4.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|#########1| 569/625 [02:03<00:12,  4.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|#########1| 570/625 [02:03<00:11,  4.61it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|#########1| 570/625 [02:04<00:11,  4.61it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|#########1| 571/625 [02:04<00:11,  4.62it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  91%|#########1| 571/625 [02:04<00:11,  4.62it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  92%|#########1| 572/625 [02:04<00:11,  4.62it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|#########1| 572/625 [02:04<00:11,  4.62it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|#########1| 573/625 [02:04<00:11,  4.60it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|#########1| 573/625 [02:04<00:11,  4.60it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|#########1| 574/625 [02:04<00:11,  4.60it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|#########1| 574/625 [02:04<00:11,  4.60it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|#########2| 575/625 [02:04<00:10,  4.58it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|#########2| 575/625 [02:05<00:10,  4.58it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|#########2| 576/625 [02:05<00:10,  4.57it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|#########2| 576/625 [02:05<00:10,  4.57it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|#########2| 577/625 [02:05<00:10,  4.55it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|#########2| 577/625 [02:05<00:10,  4.55it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|#########2| 578/625 [02:05<00:10,  4.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  92%|#########2| 578/625 [02:05<00:10,  4.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  93%|#########2| 579/625 [02:05<00:10,  4.56it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|#########2| 579/625 [02:06<00:10,  4.56it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|#########2| 580/625 [02:06<00:09,  4.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|#########2| 580/625 [02:06<00:09,  4.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|#########2| 581/625 [02:06<00:09,  4.55it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|#########2| 581/625 [02:06<00:09,  4.55it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|#########3| 582/625 [02:06<00:09,  4.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|#########3| 582/625 [02:06<00:09,  4.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|#########3| 583/625 [02:06<00:09,  4.54it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|#########3| 583/625 [02:06<00:09,  4.54it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|#########3| 584/625 [02:06<00:09,  4.55it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  93%|#########3| 584/625 [02:07<00:09,  4.55it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  94%|#########3| 585/625 [02:07<00:08,  4.57it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|#########3| 585/625 [02:07<00:08,  4.57it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|#########3| 586/625 [02:07<00:08,  4.59it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|#########3| 586/625 [02:07<00:08,  4.59it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|#########3| 587/625 [02:07<00:08,  4.60it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|#########3| 587/625 [02:07<00:08,  4.60it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|#########4| 588/625 [02:07<00:08,  4.58it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|#########4| 588/625 [02:08<00:08,  4.58it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|#########4| 589/625 [02:08<00:07,  4.56it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|#########4| 589/625 [02:08<00:07,  4.56it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|#########4| 590/625 [02:08<00:07,  4.56it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  94%|#########4| 590/625 [02:08<00:07,  4.56it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  95%|#########4| 591/625 [02:08<00:07,  4.56it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|#########4| 591/625 [02:08<00:07,  4.56it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|#########4| 592/625 [02:08<00:07,  4.56it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|#########4| 592/625 [02:08<00:07,  4.56it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|#########4| 593/625 [02:08<00:07,  4.55it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|#########4| 593/625 [02:09<00:07,  4.55it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|#########5| 594/625 [02:09<00:06,  4.55it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|#########5| 594/625 [02:09<00:06,  4.55it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|#########5| 595/625 [02:09<00:06,  4.56it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|#########5| 595/625 [02:09<00:06,  4.56it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|#########5| 596/625 [02:09<00:06,  4.56it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  95%|#########5| 596/625 [02:09<00:06,  4.56it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  96%|#########5| 597/625 [02:09<00:06,  4.57it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|#########5| 597/625 [02:10<00:06,  4.57it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|#########5| 598/625 [02:10<00:05,  4.56it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|#########5| 598/625 [02:10<00:05,  4.56it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|#########5| 599/625 [02:10<00:05,  4.55it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|#########5| 599/625 [02:10<00:05,  4.55it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|#########6| 600/625 [02:10<00:05,  4.55it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|#########6| 600/625 [02:10<00:05,  4.55it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|#########6| 601/625 [02:10<00:05,  4.56it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|#########6| 601/625 [02:10<00:05,  4.56it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|#########6| 602/625 [02:10<00:05,  4.57it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|#########6| 602/625 [02:11<00:05,  4.57it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|#########6| 603/625 [02:11<00:04,  4.55it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  96%|#########6| 603/625 [02:11<00:04,  4.55it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  97%|#########6| 604/625 [02:11<00:04,  4.55it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|#########6| 604/625 [02:11<00:04,  4.55it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|#########6| 605/625 [02:11<00:04,  4.54it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|#########6| 605/625 [02:11<00:04,  4.54it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|#########6| 606/625 [02:11<00:04,  4.54it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|#########6| 606/625 [02:11<00:04,  4.54it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|#########7| 607/625 [02:11<00:03,  4.55it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|#########7| 607/625 [02:12<00:03,  4.55it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|#########7| 608/625 [02:12<00:03,  4.55it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|#########7| 608/625 [02:12<00:03,  4.55it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|#########7| 609/625 [02:12<00:03,  4.55it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  97%|#########7| 609/625 [02:12<00:03,  4.55it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  98%|#########7| 610/625 [02:12<00:03,  4.55it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|#########7| 610/625 [02:12<00:03,  4.55it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|#########7| 611/625 [02:12<00:03,  4.55it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|#########7| 611/625 [02:13<00:03,  4.55it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|#########7| 612/625 [02:13<00:02,  4.55it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|#########7| 612/625 [02:13<00:02,  4.55it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|#########8| 613/625 [02:13<00:02,  4.56it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|#########8| 613/625 [02:13<00:02,  4.56it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|#########8| 614/625 [02:13<00:02,  4.58it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|#########8| 614/625 [02:13<00:02,  4.58it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|#########8| 615/625 [02:13<00:02,  4.59it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  98%|#########8| 615/625 [02:13<00:02,  4.59it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  99%|#########8| 616/625 [02:13<00:01,  4.60it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|#########8| 616/625 [02:14<00:01,  4.60it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|#########8| 617/625 [02:14<00:01,  4.61it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|#########8| 617/625 [02:14<00:01,  4.61it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|#########8| 618/625 [02:14<00:01,  4.62it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|#########8| 618/625 [02:14<00:01,  4.62it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|#########9| 619/625 [02:14<00:01,  4.62it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|#########9| 619/625 [02:14<00:01,  4.62it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|#########9| 620/625 [02:14<00:01,  4.62it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|#########9| 620/625 [02:15<00:01,  4.62it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|#########9| 621/625 [02:15<00:00,  4.60it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96:  99%|#########9| 621/625 [02:15<00:00,  4.60it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96: 100%|#########9| 622/625 [02:15<00:00,  4.58it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|#########9| 622/625 [02:15<00:00,  4.58it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|#########9| 623/625 [02:15<00:00,  4.57it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|#########9| 623/625 [02:15<00:00,  4.57it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|#########9| 624/625 [02:15<00:00,  4.56it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|#########9| 624/625 [02:15<00:00,  4.56it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|##########| 625/625 [02:15<00:00,  4.57it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|##########| 625/625 [02:15<00:00,  4.60it/s]

总结

在本教程中,我们学习了如何从头开始编写一个无状态环境。我们涉及了以下主题:

  • 编写环境时需要处理的四个基本组件(stepreset、种子设置和构建规范)。我们了解了这些方法和类如何与 TensorDict 类交互;

  • 如何使用 check_env_specs() 测试环境是否正确编码;

  • 在无状态环境中如何添加变换以及如何编写自定义变换;

  • 如何在完全可微分的模拟器上训练策略。

本页目录