Pendulum: 使用 TorchRL 编写环境和转换
作者: Vincent Moens
创建环境(模拟器或与物理控制系统的接口)是强化学习和控制工程中的一个重要组成部分。
TorchRL 提供了一套工具,可以在多种场景中实现这一目标。本教程演示了如何从零开始使用 PyTorch 和 TorchRL 编写一个钟摆模拟器。它受到了 OpenAI-Gym/Farama-Gymnasium 控制库 中 Pendulum-v1 实现的启发。
单摆
关键学习点:
-
如何在 TorchRL 中设计环境:
- 编写规范(输入、观察和奖励);
- 实现行为:初始化、重置和步骤。
-
转换环境输入和输出,并编写自定义转换;
-
如何使用
TensorDict
在codebase
中传递任意数据结构。
在此过程中,我们将涉及 TorchRL 的三个关键组件:
为了展示 TorchRL 环境的能力,我们将设计一个无状态环境。有状态环境会记录最近遇到的物理状态,并依赖这些信息来模拟状态到状态的转换,而无状态环境则在每一步都需要提供当前状态以及所采取的操作。TorchRL 支持这两种类型的环境,但无状态环境更加通用,因此能够涵盖 TorchRL 环境 API 的更多功能。
建模无状态环境使用户能够完全控制模拟器的输入和输出:用户可以在任何阶段重置实验或从外部主动修改动态。然而,这种方法假设我们对任务有一定的控制权,但这并非总是如此:解决无法控制当前状态的问题更具挑战性,但应用范围也更广。
无状态环境的另一个优势是它们可以实现批量执行的过渡模拟。如果后端和实现允许,代数操作可以无缝地在标量、向量或张量上执行。本教程将提供此类示例。
本教程的结构如下:
-
我们首先将熟悉环境属性:其形状(
batch_size
)、其方法(主要是step()
、reset()
和set_seed()
)以及其规范。 -
在编写完我们的模拟器之后,我们将演示如何在训练过程中使用转换。
-
我们将探索 TorchRL API 带来的新途径,包括:转换输入的可能性、模拟的向量化执行以及通过模拟图进行反向传播的可能性。
-
最后,我们将训练一个简单的策略来解决我们实现的系统。
fromcollectionsimport defaultdict
fromtypingimport Optional
importnumpyasnp
importtorch
importtqdm
fromtensordictimport TensorDict, TensorDictBase
fromtensordict.nnimport TensorDictModule
fromtorchimport nn
fromtorchrl.dataimport BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
fromtorchrl.envsimport (
CatTensors,
EnvBase,
Transform,
TransformedEnv,
UnsqueezeTransform,
)
fromtorchrl.envs.transforms.transformsimport _apply_to_composite
fromtorchrl.envs.utilsimport check_env_specs, step_mdp
DEFAULT_X = np.pi
DEFAULT_Y = 1.0
在设计一个新的环境类时,您必须注意以下四点:
-
EnvBase._reset()
,用于在(可能是随机的)初始状态下重置模拟器; -
EnvBase._step()
,用于编码状态转换动态; -
EnvBase._set_seed`()
,用于实现种子机制; -
环境规格。
让我们首先描述当前的问题:我们希望模拟一个简单的单摆,并能够控制施加在其固定点上的扭矩。我们的目标是将单摆放置在向上的位置(按惯例,角度位置为0),并使其在该位置保持静止。为了设计我们的动态系统,我们需要定义两个方程:在施加动作(扭矩)后的运动方程,以及构成我们目标函数的奖励方程。
对于运动方程,我们将根据以下公式更新角速度:
\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]
其中 \(\dot{\theta}\) 是角速度,单位为 rad/sec,\(g\) 是重力,\(L\) 是摆长,\(m\) 是质量,\(\theta\) 是角位置,\(u\) 是扭矩。角位置随后根据以下公式更新:
\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]
我们将奖励定义为
\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]
当角度接近 0(摆锤处于向上位置)、角速度接近 0(无运动)且扭矩也为 0 时,该值将最大化。
编码动作的效果: _step()
step
方法是首先要考虑的内容,因为它将编码我们感兴趣的模拟过程。在 TorchRL 中,EnvBase
类有一个 EnvBase.step()
方法,它接收一个带有 "action"
条目的 tensordict.TensorDict
实例,该条目指示要执行的操作。
为了便于从该 tensordict
中读取和写入数据,并确保键与库预期的内容一致,模拟部分已被委托给一个私有的抽象方法 _step()
,该方法从 tensordict
中读取输入数据,并将输出数据写入一个新的 tensordict
中。
_step()
方法应执行以下操作:
读取输入键(例如
"action"
)并根据这些键执行模拟;获取观测值、完成状态和奖励;
将一组观测值连同奖励和完成状态写入新
TensorDict
中的相应条目。
接下来,step()
方法会将输入 tensordict
中的 step()
输出进行合并,以确保输入/输出的一致性。
通常情况下,对于有状态的环境,这看起来会像这样:
>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
请注意,根 tensordict
并未发生变化,唯一的修改是出现了一个新的 "next"
条目,其中包含了新的信息。
在 Pendulum 示例中,我们的 _step()
方法将从输入的 tensordict
中读取相关条目,并计算在 "action"
键所编码的力施加后,摆锤的位置和速度。我们计算摆锤的新角度位置 "new_th"
,作为前一个位置 "th"
加上新速度 "new_thdot"
在时间间隔 dt
内的结果。
由于我们的目标是将摆锤直立并保持静止,因此对于接近目标位置且速度较低的情况,我们的 cost
(负奖励)函数值较低。实际上,我们希望抑制那些远离“直立”位置和/或速度远离 0 的情况。
在我们的示例中,EnvBase._step()
被编码为静态方法,因为我们的环境是无状态的。在有状态的环境中,需要 self
参数,因为状态需要从环境中读取。
def_step(tensordict):
th, thdot = tensordict["th"], tensordict["thdot"] # th := theta
g_force = tensordict["params", "g"]
mass = tensordict["params", "m"]
length = tensordict["params", "l"]
dt = tensordict["params", "dt"]
u = tensordict["action"].squeeze(-1)
u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)
new_thdot = (
thdot
+ (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
)
new_thdot = new_thdot.clamp(
*tensordict["params", "max_speed"], tensordict["params", "max_speed"]
)
new_th = th + new_thdot * dt
reward = -costs.view(*tensordict.shape, 1)
done = torch.zeros_like(reward, dtype=torch.bool)
out = TensorDict(
{
"th": new_th,
"thdot": new_thdot,
"params": tensordict["params"],
"reward": reward,
"done": done,
},
tensordict.shape,
)
return out
defangle_normalize(x):
return ((x + torch.pi) % (2 * torch.pi)) - torch.pi
重置模拟器: _reset()
我们需要关注的第二个方法是 _reset()
方法。与 _step()
类似,它应该在输出的 tensordict
中写入观察条目,并可能包含一个 done 状态(如果省略 done 状态,父方法 reset()
会将其填充为 False
)。在某些情况下,_reset
方法需要接收调用它的函数传递的命令(例如,在多代理设置中,我们可能希望指示哪些代理需要重置)。这就是为什么 _reset()
方法也期望一个 tensordict
作为输入,尽管它完全可以为空或 None
。
父类 EnvBase.reset()
会执行一些与 EnvBase.step()
类似的简单检查,例如确保输出的 tensordict
中包含 "done"
状态,并且形状与规格中的预期一致。
对于我们来说,唯一需要考虑的是 EnvBase._reset()
是否包含所有预期的观察值。再次强调,由于我们处理的是一个无状态环境,我们将摆锤的配置传递到名为 "params"
的嵌套 tensordict
中。
在这个示例中,我们没有传递 done
状态,因为这对于 _reset()
并不是强制性的,而且我们的环境是非终止的,因此我们始终期望它为 False
。
def_reset(self, tensordict):
if tensordict is None or tensordict.is_empty():
# if no ``tensordict`` is passed, we generate a single set of hyperparameters
# Otherwise, we assume that the input ``tensordict`` contains all the relevant
# parameters to get started.
tensordict = self.gen_params(batch_size=self.batch_size)
high_th = torch.tensor(DEFAULT_X, device=self.device)
high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
low_th = -high_th
low_thdot = -high_thdot
# for non batch-locked environments, the input ``tensordict`` shape dictates the number
# of simulators run simultaneously. In other contexts, the initial
# random state's shape will depend upon the environment batch-size instead.
th = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_th - low_th)
+ low_th
)
thdot = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_thdot - low_thdot)
+ low_thdot
)
out = TensorDict(
{
"th": th,
"thdot": thdot,
"params": tensordict["params"],
},
batch_size=tensordict.shape,
)
return out
环境元数据: env.*_spec
规范定义了环境的输入和输出域。重要的是,规范必须准确地定义在运行时将接收的张量,因为它们通常用于在多进程和分布式设置中传递环境信息。它们还可以用于实例化延迟定义的神经网络和测试脚本,而无需实际查询环境(例如,在现实世界的物理系统中,查询可能代价高昂)。
在我们的环境中,必须编写四种规范:
-
EnvBase.observation_spec
: 这将是一个CompositeSpec
实例,其中每个键对应一个观测值(CompositeSpec
可以视为一组规范的字典)。 -
EnvBase.action_spec
: 它可以是任何类型的规范,但必须与输入tensordict
中的"action"
条目相对应; -
EnvBase.reward_spec
: 提供有关奖励空间的信息; -
EnvBase.done_spec
: 提供有关完成标志空间的信息。
TorchRL 的规格被组织在两个通用容器中:input_spec
包含步进函数读取信息的规格(分为包含动作的 action_spec
和包含其余所有内容的 state_spec
),以及 output_spec
,它编码了步进输出的规格(observation_spec
、reward_spec
和 done_spec
)。通常,您不应直接与 output_spec
和 input_spec
交互,而只应与其内容交互:observation_spec
、reward_spec
、done_spec
、action_spec
和 state_spec
。原因是这些规格在 output_spec
和 input_spec
中以非平凡的方式组织,并且不应直接修改它们。
换句话说,observation_spec
和相关属性是对输出和输入规范容器内容的便捷快捷方式。
TorchRL 提供了多种 TensorSpec
子类 来编码环境的输入和输出特性。
规格形状
环境规格的前导维度必须与环境的批量大小相匹配。这样做是为了确保环境的每个组件(包括其变换)都能准确表示预期的输入和输出形状。在有状态设置中,这一点应该被准确编码。
对于非批量锁定的环境,如我们示例中的环境(见下文),这一点无关紧要,因为环境的批量大小很可能是空的。
def_make_spec(self, td_params):
# Under the hood, this will populate self.output_spec["observation"]
self.observation_spec = CompositeSpec(
th=BoundedTensorSpec(
low=-torch.pi,
high=torch.pi,
shape=(),
dtype=torch.float32,
),
thdot=BoundedTensorSpec(
low=-td_params["params", "max_speed"],
high=td_params["params", "max_speed"],
shape=(),
dtype=torch.float32,
),
# we need to add the ``params`` to the observation specs, as we want
# to pass it at each step during a rollout
params=make_composite_from_td(td_params["params"]),
shape=(),
)
# since the environment is stateless, we expect the previous output as input.
# For this, ``EnvBase`` expects some state_spec to be available
self.state_spec = self.observation_spec.clone()
# action-spec will be automatically wrapped in input_spec when
# `self.action_spec = spec` will be called supported
self.action_spec = BoundedTensorSpec(
low=-td_params["params", "max_torque"],
high=td_params["params", "max_torque"],
shape=(1,),
dtype=torch.float32,
)
self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))
defmake_composite_from_td(td):
# custom function to convert a ``tensordict`` in a similar spec structure
# of unbounded values.
composite = CompositeSpec(
{
key: make_composite_from_td(tensor)
if isinstance(tensor, TensorDictBase)
else UnboundedContinuousTensorSpec(
dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
)
for key, tensor in td.items()
},
shape=td.shape,
)
return composite
可复现的实验:种子设置
在初始化实验时,设置环境种子是一个常见操作。EnvBase._set_seed()
的唯一目标是设置所包含模拟器的种子。如果可能的话,此操作不应调用 reset()
或与环境执行进行交互。父方法 EnvBase.set_seed()
包含了一种机制,允许使用不同的伪随机且可复现的种子为多个环境设置种子。
def_set_seed(self, seed: Optional[int]):
rng = torch.manual_seed(seed)
self.rng = rng
整合内容:EnvBase
类
我们终于可以将各个部分整合起来,设计我们的环境类。由于需要在环境构建期间执行 specs
的初始化,因此我们必须在 PendulumEnv.__init__()
中调用 _make_spec()
方法。
我们添加了一个静态方法 PendulumEnv.gen_params()
,它可以确定性地生成一组在执行期间使用的超参数:
defgen_params(g=10.0, batch_size=None) -> TensorDictBase:
"""Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
if batch_size is None:
batch_size = []
td = TensorDict(
{
"params": TensorDict(
{
"max_speed": 8,
"max_torque": 2.0,
"dt": 0.05,
"g": g,
"m": 1.0,
"l": 1.0,
},
[],
)
},
[],
)
if batch_size:
td = td.expand(batch_size).contiguous()
return td
我们将环境定义为非 batch_locked
,通过将同名的 homonymous
属性设置为 False
。这意味着我们不会强制要求输入的 tensordict
具有与环境匹配的 batch-size
。
以下代码将把我们上面编写的部分整合在一起。
classPendulumEnv(EnvBase):
metadata = {
"render_modes": ["human", "rgb_array"],
"render_fps": 30,
}
batch_locked = False
def__init__(self, td_params=None, seed=None, device="cpu"):
if td_params is None:
td_params = self.gen_params()
super().__init__(device=device, batch_size=[])
self._make_spec(td_params)
if seed is None:
seed = torch.empty((), dtype=torch.int64).random_().item()
self.set_seed(seed)
# Helpers: _make_step and gen_params
gen_params = staticmethod(gen_params)
_make_spec = _make_spec
# Mandatory methods: _step, _reset and _set_seed
_reset = _reset
_step = staticmethod(_step)
_set_seed = _set_seed
测试我们的环境
TorchRL 提供了一个简单的函数 check_env_specs()
,用于检查(转换后的)环境的输入/输出结构是否与其规范所要求的结构匹配。让我们来试一下:
env = PendulumEnv()
check_env_specs(env)
我们可以查看我们的配置,以获取环境签名的可视化表示:
print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: Composite(
th: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: Composite(
max_speed: UnboundedDiscrete(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
state_spec: Composite(
th: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: Composite(
max_speed: UnboundedDiscrete(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
reward_spec: UnboundedContinuous(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous)
我们可以执行几条命令来检查输出结构是否符合预期。
td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
我们可以运行 env.rand_step()
来从 action_spec
域中随机生成一个动作。由于我们的环境是无状态的,必须传递一个包含超参数和当前状态的 tensordict
。在有状态的上下文中,env.rand_step()
也同样适用。
td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
环境转换
为无状态模拟器编写环境转换比有状态模拟器稍微复杂一些:转换需要在下一个迭代中读取的输出条目时,需要在下一步调用 meth.step()
之前应用逆转换。这是展示 TorchRL 转换所有功能的理想场景!
例如,在以下转换后的环境中,我们对条目 ["th", "thdot"]
进行 unsqueeze
操作,以便能够沿最后一个维度堆叠它们。我们还将其作为 in_keys_inv
传递,以便在下一迭代中作为输入传递时将其压缩回原始形状。
env = TransformedEnv(
env,
# ``Unsqueeze`` the observations that we will concatenate
UnsqueezeTransform(
dim=-1,
in_keys=["th", "thdot"],
in_keys_inv=["th", "thdot"],
),
)
编写自定义变换
TorchRL 的变换(transforms)可能无法涵盖在执行环境后需要执行的所有操作。编写一个变换并不需要太多的工作量。与设计环境类似,编写变换包含两个步骤:
-
正确掌握动力学(正向和逆向);
-
调整环境规格。
变换可以在两种场景中使用:单独使用时,它可以作为一个Module
。它也可以附加到 TransformedEnv
中使用。该类的结构允许在不同的上下文中自定义行为。
Transform
的骨架可以总结如下:
classTransform(nn.Module):
defforward(self, tensordict):
...
def_apply_transform(self, tensordict):
...
def_step(self, tensordict):
...
def_call(self, tensordict):
...
definv(self, tensordict):
...
def_inv_apply_transform(self, tensordict):
...
有三个入口点(forward()
、_step()
和 inv()
),它们都接收 tensordict.TensorDict
实例。前两个最终会遍历 in_keys
指定的键,并对每个键调用 _apply_transform()
。如果提供了 Transform.out_keys
,结果将被写入这些键指向的条目中(如果没有提供,则 in_keys
将使用转换后的值进行更新)。如果需要执行反向转换,将执行类似的数据流,但会使用 Transform.inv()
和 Transform._inv_apply_transform()
方法,并遍历 in_keys_inv
和 out_keys_inv
键列表。下图总结了环境和回放缓冲区的这一流程。
Transform API
在某些情况下,转换无法以单一方式处理键的子集,而是会对父环境执行某些操作或处理整个输入 tensordict
。在这些情况下,应重写 _call()
和 forward()
方法,并且可以跳过 _apply_transform()
方法。
让我们编写新的转换来计算位置角度的 sine
和 cosine
值,因为这些值对于我们学习策略来说比原始角度值更有用:
classSinTransform(Transform):
def_apply_transform(self, obs: torch.Tensor) -> None:
return obs.sin()
# The transform must also modify the data at reset time
def_reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
deftransform_observation_spec(self, observation_spec):
return BoundedTensorSpec(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
classCosTransform(Transform):
def_apply_transform(self, obs: torch.Tensor) -> None:
return obs.cos()
# The transform must also modify the data at reset time
def_reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
deftransform_observation_spec(self, observation_spec):
return BoundedTensorSpec(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th'])))
将观测值连接到一个“observation”条目上。del_keys=False
确保我们在下一次迭代中保留这些值。
cat_transform = CatTensors(
in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th']),
CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))
再次确认我们的环境规范与接收到的内容是否匹配:
check_env_specs(env)
执行滚动更新
执行滚动更新是一系列简单的步骤:
-
重置环境
-
当某些条件未满足时:
-
根据策略计算动作
-
根据此动作执行一步
-
收集数据
-
进行一步
MDP
-
-
收集数据并返回
这些操作已经被方便地封装在 rollout()
方法中,我们在下面提供了一个简化版本。
defsimple_rollout(steps=100):
# preallocate:
data = TensorDict({}, [steps])
# reset
_data = env.reset()
for i in range(steps):
_data["action"] = env.action_spec.rand()
_data = env.step(_data)
data[i] = _data
_data = step_mdp(_data, keep_other=True)
return data
print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
fields={
action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False)
批量计算
我们教程中最后一个未被探索的部分是 TorchRL 中批量计算的能力。由于我们的环境对输入数据的形状没有任何假设,因此我们可以无缝地在数据批次上执行它。更好的是:对于像我们的 Pendulum 这样的非批量锁定环境,我们可以动态更改批量大小而无需重新创建环境。为此,我们只需生成具有所需形状的参数即可。
batch_size = 10 # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
rand step (batch size of 10) TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
执行带有批量数据的 rollout 操作需要我们重置环境,而不能在 rollout 函数内部进行,因为我们需要动态定义 batch_size,而 rollout()
不支持这一操作:
rollout = env.rollout(
3,
auto_reset=False, # we're executing the reset out of the ``rollout`` call
tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False)
训练一个简单的策略
在这个示例中,我们将使用奖励作为可微分的目标(例如负损失)来训练一个简单的策略。我们将利用动态系统完全可微分的特性,通过轨迹回报进行反向传播,并调整策略的权重以直接最大化该值。当然,在许多情况下,我们所做的许多假设并不成立,例如系统可微分和完全访问底层机制。
尽管如此,这是一个非常简单的示例,展示了如何在 TorchRL 中使用自定义环境编写训练循环。
让我们首先编写策略网络:
torch.manual_seed(0)
env.set_seed(0)
net = nn.Sequential(
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(1),
)
policy = TensorDictModule(
net,
in_keys=["observation"],
out_keys=["action"],
)
以及我们的优化器:
optim = torch.optim.Adam(policy.parameters(), lr=2e-3)
训练循环
我们将依次进行以下操作:
-
生成一条轨迹
-
累加奖励
-
通过这些操作定义的图进行反向传播
-
裁剪梯度范数并执行优化步骤
-
重复上述过程
在训练循环结束时,我们应该得到一个接近 0 的最终奖励,这表明摆杆已经直立并保持静止,符合预期。
batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)
for _ in pbar:
init_td = env.reset(env.gen_params(batch_size=[batch_size]))
rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
traj_return = rollout["next", "reward"].mean()
(-traj_return).backward()
gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
optim.step()
optim.zero_grad()
pbar.set_description(
f"reward: {traj_return: 4.4f}, "
f"last reward: {rollout[...,-1]['next','reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
)
logs["return"].append(traj_return.item())
logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
scheduler.step()
defplot():
importmatplotlib
frommatplotlibimport pyplot as plt
is_ipython = "inline" in matplotlib.get_backend()
if is_ipython:
fromIPythonimport display
with plt.ion():
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(logs["return"])
plt.title("returns")
plt.xlabel("iteration")
plt.subplot(1, 2, 2)
plt.plot(logs["last_reward"])
plt.title("last reward")
plt.xlabel("iteration")
if is_ipython:
display.display(plt.gcf())
display.clear_output(wait=True)
plt.show()
plot()
0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 1/625 [00:00<02:15, 4.60it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 1/625 [00:00<02:15, 4.60it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 2/625 [00:00<02:16, 4.57it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 2/625 [00:00<02:16, 4.57it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 3/625 [00:00<02:16, 4.56it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 0%| | 3/625 [00:00<02:16, 4.56it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 1%| | 4/625 [00:00<02:15, 4.58it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 4/625 [00:01<02:15, 4.58it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 5/625 [00:01<02:15, 4.58it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 5/625 [00:01<02:15, 4.58it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 6/625 [00:01<02:15, 4.57it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%| | 6/625 [00:01<02:15, 4.57it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%|1 | 7/625 [00:01<02:14, 4.58it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%|1 | 7/625 [00:01<02:14, 4.58it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%|1 | 8/625 [00:01<02:14, 4.59it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|1 | 8/625 [00:01<02:14, 4.59it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|1 | 9/625 [00:01<02:14, 4.59it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 1%|1 | 9/625 [00:02<02:14, 4.59it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 2%|1 | 10/625 [00:02<02:14, 4.59it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|1 | 10/625 [00:02<02:14, 4.59it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|1 | 11/625 [00:02<02:13, 4.59it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|1 | 11/625 [00:02<02:13, 4.59it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|1 | 12/625 [00:02<02:13, 4.60it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|1 | 12/625 [00:02<02:13, 4.60it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|2 | 13/625 [00:02<02:13, 4.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|2 | 13/625 [00:03<02:13, 4.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|2 | 14/625 [00:03<02:12, 4.60it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|2 | 14/625 [00:03<02:12, 4.60it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|2 | 15/625 [00:03<02:12, 4.60it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 2%|2 | 15/625 [00:03<02:12, 4.60it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 3%|2 | 16/625 [00:03<02:12, 4.60it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|2 | 16/625 [00:03<02:12, 4.60it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|2 | 17/625 [00:03<02:12, 4.60it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|2 | 17/625 [00:03<02:12, 4.60it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|2 | 18/625 [00:03<02:11, 4.60it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|2 | 18/625 [00:04<02:11, 4.60it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|3 | 19/625 [00:04<02:12, 4.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|3 | 19/625 [00:04<02:12, 4.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|3 | 20/625 [00:04<02:13, 4.55it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|3 | 20/625 [00:04<02:13, 4.55it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|3 | 21/625 [00:04<02:13, 4.54it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 3%|3 | 21/625 [00:04<02:13, 4.54it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 4%|3 | 22/625 [00:04<02:13, 4.53it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|3 | 22/625 [00:05<02:13, 4.53it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|3 | 23/625 [00:05<02:13, 4.52it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|3 | 23/625 [00:05<02:13, 4.52it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|3 | 24/625 [00:05<02:13, 4.52it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|3 | 24/625 [00:05<02:13, 4.52it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|4 | 25/625 [00:05<02:12, 4.52it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|4 | 25/625 [00:05<02:12, 4.52it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|4 | 26/625 [00:05<02:12, 4.52it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|4 | 26/625 [00:05<02:12, 4.52it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|4 | 27/625 [00:05<02:11, 4.55it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|4 | 27/625 [00:06<02:11, 4.55it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|4 | 28/625 [00:06<02:10, 4.56it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 4%|4 | 28/625 [00:06<02:10, 4.56it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 5%|4 | 29/625 [00:06<02:10, 4.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|4 | 29/625 [00:06<02:10, 4.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|4 | 30/625 [00:06<02:10, 4.56it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|4 | 30/625 [00:06<02:10, 4.56it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|4 | 31/625 [00:06<02:09, 4.57it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|4 | 31/625 [00:06<02:09, 4.57it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|5 | 32/625 [00:07<02:09, 4.59it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|5 | 32/625 [00:07<02:09, 4.59it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|5 | 33/625 [00:07<02:09, 4.58it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|5 | 33/625 [00:07<02:09, 4.58it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|5 | 34/625 [00:07<02:09, 4.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 5%|5 | 34/625 [00:07<02:09, 4.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 6%|5 | 35/625 [00:07<02:10, 4.53it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|5 | 35/625 [00:07<02:10, 4.53it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|5 | 36/625 [00:07<02:09, 4.55it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|5 | 36/625 [00:08<02:09, 4.55it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|5 | 37/625 [00:08<02:08, 4.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|5 | 37/625 [00:08<02:08, 4.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|6 | 38/625 [00:08<02:08, 4.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|6 | 38/625 [00:08<02:08, 4.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|6 | 39/625 [00:08<02:07, 4.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|6 | 39/625 [00:08<02:07, 4.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|6 | 40/625 [00:08<02:07, 4.60it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 6%|6 | 40/625 [00:08<02:07, 4.60it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 7%|6 | 41/625 [00:08<02:07, 4.58it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|6 | 41/625 [00:09<02:07, 4.58it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|6 | 42/625 [00:09<02:07, 4.58it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|6 | 42/625 [00:09<02:07, 4.58it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|6 | 43/625 [00:09<02:06, 4.58it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|6 | 43/625 [00:09<02:06, 4.58it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|7 | 44/625 [00:09<02:06, 4.58it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|7 | 44/625 [00:09<02:06, 4.58it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|7 | 45/625 [00:09<02:06, 4.58it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|7 | 45/625 [00:10<02:06, 4.58it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|7 | 46/625 [00:10<02:06, 4.58it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 7%|7 | 46/625 [00:10<02:06, 4.58it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 8%|7 | 47/625 [00:10<02:05, 4.59it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|7 | 47/625 [00:10<02:05, 4.59it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|7 | 48/625 [00:10<02:06, 4.58it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|7 | 48/625 [00:10<02:06, 4.58it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|7 | 49/625 [00:10<02:05, 4.58it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|7 | 49/625 [00:10<02:05, 4.58it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|8 | 50/625 [00:10<02:05, 4.57it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|8 | 50/625 [00:11<02:05, 4.57it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|8 | 51/625 [00:11<02:05, 4.56it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|8 | 51/625 [00:11<02:05, 4.56it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|8 | 52/625 [00:11<02:05, 4.56it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|8 | 52/625 [00:11<02:05, 4.56it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|8 | 53/625 [00:11<02:05, 4.55it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 8%|8 | 53/625 [00:11<02:05, 4.55it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 9%|8 | 54/625 [00:11<02:05, 4.56it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|8 | 54/625 [00:12<02:05, 4.56it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|8 | 55/625 [00:12<02:04, 4.57it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|8 | 55/625 [00:12<02:04, 4.57it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|8 | 56/625 [00:12<02:04, 4.56it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|8 | 56/625 [00:12<02:04, 4.56it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|9 | 57/625 [00:12<02:04, 4.56it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|9 | 57/625 [00:12<02:04, 4.56it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|9 | 58/625 [00:12<02:04, 4.56it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|9 | 58/625 [00:12<02:04, 4.56it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|9 | 59/625 [00:12<02:04, 4.56it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 9%|9 | 59/625 [00:13<02:04, 4.56it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 10%|9 | 60/625 [00:13<02:03, 4.56it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|9 | 60/625 [00:13<02:03, 4.56it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|9 | 61/625 [00:13<02:03, 4.56it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|9 | 61/625 [00:13<02:03, 4.56it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|9 | 62/625 [00:13<02:03, 4.56it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|9 | 62/625 [00:13<02:03, 4.56it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|# | 63/625 [00:13<02:03, 4.56it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|# | 63/625 [00:14<02:03, 4.56it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|# | 64/625 [00:14<02:02, 4.57it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|# | 64/625 [00:14<02:02, 4.57it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|# | 65/625 [00:14<02:02, 4.58it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 10%|# | 65/625 [00:14<02:02, 4.58it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 11%|# | 66/625 [00:14<02:01, 4.59it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|# | 66/625 [00:14<02:01, 4.59it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|# | 67/625 [00:14<02:01, 4.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|# | 67/625 [00:14<02:01, 4.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|# | 68/625 [00:14<02:01, 4.60it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|# | 68/625 [00:15<02:01, 4.60it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|#1 | 69/625 [00:15<02:00, 4.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|#1 | 69/625 [00:15<02:00, 4.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|#1 | 70/625 [00:15<02:00, 4.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|#1 | 70/625 [00:15<02:00, 4.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|#1 | 71/625 [00:15<02:00, 4.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 11%|#1 | 71/625 [00:15<02:00, 4.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 12%|#1 | 72/625 [00:15<02:00, 4.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|#1 | 72/625 [00:15<02:00, 4.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|#1 | 73/625 [00:15<02:00, 4.60it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|#1 | 73/625 [00:16<02:00, 4.60it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|#1 | 74/625 [00:16<01:59, 4.60it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|#1 | 74/625 [00:16<01:59, 4.60it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|#2 | 75/625 [00:16<01:59, 4.61it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|#2 | 75/625 [00:16<01:59, 4.61it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|#2 | 76/625 [00:16<01:59, 4.61it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|#2 | 76/625 [00:16<01:59, 4.61it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|#2 | 77/625 [00:16<01:58, 4.61it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|#2 | 77/625 [00:17<01:58, 4.61it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|#2 | 78/625 [00:17<01:58, 4.61it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 12%|#2 | 78/625 [00:17<01:58, 4.61it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 13%|#2 | 79/625 [00:17<01:58, 4.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|#2 | 79/625 [00:17<01:58, 4.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|#2 | 80/625 [00:17<01:58, 4.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|#2 | 80/625 [00:17<01:58, 4.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|#2 | 81/625 [00:17<01:58, 4.59it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|#2 | 81/625 [00:17<01:58, 4.59it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|#3 | 82/625 [00:17<01:58, 4.60it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|#3 | 82/625 [00:18<01:58, 4.60it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|#3 | 83/625 [00:18<01:57, 4.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|#3 | 83/625 [00:18<01:57, 4.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|#3 | 84/625 [00:18<01:57, 4.59it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 13%|#3 | 84/625 [00:18<01:57, 4.59it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 14%|#3 | 85/625 [00:18<01:57, 4.59it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|#3 | 85/625 [00:18<01:57, 4.59it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|#3 | 86/625 [00:18<01:57, 4.59it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|#3 | 86/625 [00:19<01:57, 4.59it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|#3 | 87/625 [00:19<01:57, 4.59it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|#3 | 87/625 [00:19<01:57, 4.59it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|#4 | 88/625 [00:19<01:56, 4.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|#4 | 88/625 [00:19<01:56, 4.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|#4 | 89/625 [00:19<01:56, 4.59it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|#4 | 89/625 [00:19<01:56, 4.59it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|#4 | 90/625 [00:19<01:56, 4.60it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 14%|#4 | 90/625 [00:19<01:56, 4.60it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 15%|#4 | 91/625 [00:19<01:56, 4.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|#4 | 91/625 [00:20<01:56, 4.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|#4 | 92/625 [00:20<01:55, 4.60it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|#4 | 92/625 [00:20<01:55, 4.60it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|#4 | 93/625 [00:20<01:56, 4.58it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|#4 | 93/625 [00:20<01:56, 4.58it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|#5 | 94/625 [00:20<01:55, 4.59it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|#5 | 94/625 [00:20<01:55, 4.59it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|#5 | 95/625 [00:20<01:55, 4.59it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|#5 | 95/625 [00:20<01:55, 4.59it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|#5 | 96/625 [00:20<01:55, 4.60it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 15%|#5 | 96/625 [00:21<01:55, 4.60it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 16%|#5 | 97/625 [00:21<01:54, 4.60it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|#5 | 97/625 [00:21<01:54, 4.60it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|#5 | 98/625 [00:21<01:54, 4.60it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|#5 | 98/625 [00:21<01:54, 4.60it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|#5 | 99/625 [00:21<01:54, 4.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|#5 | 99/625 [00:21<01:54, 4.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|#6 | 100/625 [00:21<01:54, 4.58it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|#6 | 100/625 [00:22<01:54, 4.58it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|#6 | 101/625 [00:22<01:54, 4.57it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|#6 | 101/625 [00:22<01:54, 4.57it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|#6 | 102/625 [00:22<01:54, 4.58it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|#6 | 102/625 [00:22<01:54, 4.58it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|#6 | 103/625 [00:22<01:53, 4.59it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 16%|#6 | 103/625 [00:22<01:53, 4.59it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 17%|#6 | 104/625 [00:22<01:53, 4.59it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|#6 | 104/625 [00:22<01:53, 4.59it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|#6 | 105/625 [00:22<01:53, 4.59it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|#6 | 105/625 [00:23<01:53, 4.59it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|#6 | 106/625 [00:23<01:52, 4.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|#6 | 106/625 [00:23<01:52, 4.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|#7 | 107/625 [00:23<01:52, 4.60it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|#7 | 107/625 [00:23<01:52, 4.60it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|#7 | 108/625 [00:23<01:52, 4.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|#7 | 108/625 [00:23<01:52, 4.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|#7 | 109/625 [00:23<01:52, 4.60it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 17%|#7 | 109/625 [00:24<01:52, 4.60it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 18%|#7 | 110/625 [00:24<01:52, 4.59it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|#7 | 110/625 [00:24<01:52, 4.59it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|#7 | 111/625 [00:24<01:51, 4.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|#7 | 111/625 [00:24<01:51, 4.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|#7 | 112/625 [00:24<01:51, 4.60it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|#7 | 112/625 [00:24<01:51, 4.60it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|#8 | 113/625 [00:24<01:51, 4.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|#8 | 113/625 [00:24<01:51, 4.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|#8 | 114/625 [00:24<01:51, 4.60it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|#8 | 114/625 [00:25<01:51, 4.60it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|#8 | 115/625 [00:25<01:51, 4.59it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 18%|#8 | 115/625 [00:25<01:51, 4.59it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 19%|#8 | 116/625 [00:25<01:50, 4.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|#8 | 116/625 [00:25<01:50, 4.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|#8 | 117/625 [00:25<01:50, 4.59it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|#8 | 117/625 [00:25<01:50, 4.59it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|#8 | 118/625 [00:25<01:50, 4.60it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|#8 | 118/625 [00:25<01:50, 4.60it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|#9 | 119/625 [00:25<01:49, 4.60it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|#9 | 119/625 [00:26<01:49, 4.60it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|#9 | 120/625 [00:26<01:49, 4.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|#9 | 120/625 [00:26<01:49, 4.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|#9 | 121/625 [00:26<01:49, 4.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 19%|#9 | 121/625 [00:26<01:49, 4.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 20%|#9 | 122/625 [00:26<01:49, 4.61it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|#9 | 122/625 [00:26<01:49, 4.61it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|#9 | 123/625 [00:26<01:48, 4.61it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|#9 | 123/625 [00:27<01:48, 4.61it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|#9 | 124/625 [00:27<01:48, 4.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|#9 | 124/625 [00:27<01:48, 4.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|## | 125/625 [00:27<01:48, 4.62it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|## | 125/625 [00:27<01:48, 4.62it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|## | 126/625 [00:27<01:48, 4.62it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|## | 126/625 [00:27<01:48, 4.62it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|## | 127/625 [00:27<01:47, 4.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|## | 127/625 [00:27<01:47, 4.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|## | 128/625 [00:27<01:47, 4.62it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 20%|## | 128/625 [00:28<01:47, 4.62it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 21%|## | 129/625 [00:28<01:47, 4.62it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|## | 129/625 [00:28<01:47, 4.62it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|## | 130/625 [00:28<01:47, 4.62it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|## | 130/625 [00:28<01:47, 4.62it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|## | 131/625 [00:28<01:47, 4.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|## | 131/625 [00:28<01:47, 4.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|##1 | 132/625 [00:28<01:46, 4.61it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|##1 | 132/625 [00:29<01:46, 4.61it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|##1 | 133/625 [00:29<01:46, 4.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|##1 | 133/625 [00:29<01:46, 4.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|##1 | 134/625 [00:29<01:46, 4.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 21%|##1 | 134/625 [00:29<01:46, 4.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 22%|##1 | 135/625 [00:29<01:46, 4.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|##1 | 135/625 [00:29<01:46, 4.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|##1 | 136/625 [00:29<01:46, 4.61it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|##1 | 136/625 [00:29<01:46, 4.61it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|##1 | 137/625 [00:29<01:45, 4.61it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|##1 | 137/625 [00:30<01:45, 4.61it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|##2 | 138/625 [00:30<01:45, 4.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|##2 | 138/625 [00:30<01:45, 4.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|##2 | 139/625 [00:30<01:45, 4.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|##2 | 139/625 [00:30<01:45, 4.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|##2 | 140/625 [00:30<01:45, 4.61it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 22%|##2 | 140/625 [00:30<01:45, 4.61it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 23%|##2 | 141/625 [00:30<01:45, 4.61it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|##2 | 141/625 [00:30<01:45, 4.61it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|##2 | 142/625 [00:30<01:44, 4.60it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|##2 | 142/625 [00:31<01:44, 4.60it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|##2 | 143/625 [00:31<01:44, 4.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|##2 | 143/625 [00:31<01:44, 4.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|##3 | 144/625 [00:31<01:44, 4.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|##3 | 144/625 [00:31<01:44, 4.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|##3 | 145/625 [00:31<01:44, 4.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|##3 | 145/625 [00:31<01:44, 4.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|##3 | 146/625 [00:31<01:44, 4.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 23%|##3 | 146/625 [00:32<01:44, 4.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 24%|##3 | 147/625 [00:32<01:43, 4.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|##3 | 147/625 [00:32<01:43, 4.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|##3 | 148/625 [00:32<01:43, 4.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|##3 | 148/625 [00:32<01:43, 4.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|##3 | 149/625 [00:32<01:43, 4.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|##3 | 149/625 [00:32<01:43, 4.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|##4 | 150/625 [00:32<01:43, 4.59it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|##4 | 150/625 [00:32<01:43, 4.59it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|##4 | 151/625 [00:32<01:43, 4.59it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|##4 | 151/625 [00:33<01:43, 4.59it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|##4 | 152/625 [00:33<01:45, 4.50it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|##4 | 152/625 [00:33<01:45, 4.50it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|##4 | 153/625 [00:33<01:44, 4.52it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 24%|##4 | 153/625 [00:33<01:44, 4.52it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 25%|##4 | 154/625 [00:33<01:43, 4.55it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|##4 | 154/625 [00:33<01:43, 4.55it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|##4 | 155/625 [00:33<01:42, 4.56it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|##4 | 155/625 [00:34<01:42, 4.56it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|##4 | 156/625 [00:34<01:42, 4.57it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|##4 | 156/625 [00:34<01:42, 4.57it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|##5 | 157/625 [00:34<01:42, 4.58it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|##5 | 157/625 [00:34<01:42, 4.58it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|##5 | 158/625 [00:34<01:41, 4.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|##5 | 158/625 [00:34<01:41, 4.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|##5 | 159/625 [00:34<01:41, 4.59it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 25%|##5 | 159/625 [00:34<01:41, 4.59it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 26%|##5 | 160/625 [00:34<01:41, 4.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|##5 | 160/625 [00:35<01:41, 4.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|##5 | 161/625 [00:35<01:40, 4.59it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|##5 | 161/625 [00:35<01:40, 4.59it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|##5 | 162/625 [00:35<01:40, 4.60it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|##5 | 162/625 [00:35<01:40, 4.60it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|##6 | 163/625 [00:35<01:40, 4.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|##6 | 163/625 [00:35<01:40, 4.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|##6 | 164/625 [00:35<01:40, 4.60it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|##6 | 164/625 [00:35<01:40, 4.60it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|##6 | 165/625 [00:35<01:40, 4.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 26%|##6 | 165/625 [00:36<01:40, 4.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 27%|##6 | 166/625 [00:36<01:39, 4.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|##6 | 166/625 [00:36<01:39, 4.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|##6 | 167/625 [00:36<01:39, 4.61it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|##6 | 167/625 [00:36<01:39, 4.61it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|##6 | 168/625 [00:36<01:39, 4.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|##6 | 168/625 [00:36<01:39, 4.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|##7 | 169/625 [00:36<01:39, 4.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|##7 | 169/625 [00:37<01:39, 4.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|##7 | 170/625 [00:37<01:38, 4.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|##7 | 170/625 [00:37<01:38, 4.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|##7 | 171/625 [00:37<01:38, 4.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 27%|##7 | 171/625 [00:37<01:38, 4.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 28%|##7 | 172/625 [00:37<01:38, 4.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|##7 | 172/625 [00:37<01:38, 4.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|##7 | 173/625 [00:37<01:38, 4.60it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|##7 | 173/625 [00:37<01:38, 4.60it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|##7 | 174/625 [00:37<01:38, 4.60it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|##7 | 174/625 [00:38<01:38, 4.60it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|##8 | 175/625 [00:38<01:37, 4.61it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|##8 | 175/625 [00:38<01:37, 4.61it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|##8 | 176/625 [00:38<01:37, 4.61it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|##8 | 176/625 [00:38<01:37, 4.61it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|##8 | 177/625 [00:38<01:37, 4.61it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|##8 | 177/625 [00:38<01:37, 4.61it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|##8 | 178/625 [00:38<01:36, 4.61it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 28%|##8 | 178/625 [00:39<01:36, 4.61it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 29%|##8 | 179/625 [00:39<01:36, 4.61it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|##8 | 179/625 [00:39<01:36, 4.61it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|##8 | 180/625 [00:39<01:36, 4.61it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|##8 | 180/625 [00:39<01:36, 4.61it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|##8 | 181/625 [00:39<01:36, 4.62it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|##8 | 181/625 [00:39<01:36, 4.62it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|##9 | 182/625 [00:39<01:35, 4.62it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|##9 | 182/625 [00:39<01:35, 4.62it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|##9 | 183/625 [00:39<01:35, 4.62it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|##9 | 183/625 [00:40<01:35, 4.62it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|##9 | 184/625 [00:40<01:35, 4.62it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 29%|##9 | 184/625 [00:40<01:35, 4.62it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 30%|##9 | 185/625 [00:40<01:35, 4.62it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|##9 | 185/625 [00:40<01:35, 4.62it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|##9 | 186/625 [00:40<01:35, 4.61it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|##9 | 186/625 [00:40<01:35, 4.61it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|##9 | 187/625 [00:40<01:35, 4.61it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|##9 | 187/625 [00:40<01:35, 4.61it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|### | 188/625 [00:40<01:34, 4.61it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|### | 188/625 [00:41<01:34, 4.61it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|### | 189/625 [00:41<01:34, 4.61it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|### | 189/625 [00:41<01:34, 4.61it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|### | 190/625 [00:41<01:34, 4.61it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 30%|### | 190/625 [00:41<01:34, 4.61it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 31%|### | 191/625 [00:41<01:34, 4.61it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|### | 191/625 [00:41<01:34, 4.61it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|### | 192/625 [00:41<01:33, 4.61it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|### | 192/625 [00:42<01:33, 4.61it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|### | 193/625 [00:42<01:33, 4.61it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|### | 193/625 [00:42<01:33, 4.61it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|###1 | 194/625 [00:42<01:33, 4.61it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|###1 | 194/625 [00:42<01:33, 4.61it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|###1 | 195/625 [00:42<01:33, 4.61it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|###1 | 195/625 [00:42<01:33, 4.61it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|###1 | 196/625 [00:42<01:33, 4.61it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 31%|###1 | 196/625 [00:42<01:33, 4.61it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 32%|###1 | 197/625 [00:42<01:33, 4.60it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|###1 | 197/625 [00:43<01:33, 4.60it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|###1 | 198/625 [00:43<01:32, 4.60it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|###1 | 198/625 [00:43<01:32, 4.60it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|###1 | 199/625 [00:43<01:32, 4.60it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|###1 | 199/625 [00:43<01:32, 4.60it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|###2 | 200/625 [00:43<01:32, 4.60it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|###2 | 200/625 [00:43<01:32, 4.60it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|###2 | 201/625 [00:43<01:32, 4.60it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|###2 | 201/625 [00:44<01:32, 4.60it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|###2 | 202/625 [00:44<01:31, 4.61it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|###2 | 202/625 [00:44<01:31, 4.61it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|###2 | 203/625 [00:44<01:31, 4.61it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 32%|###2 | 203/625 [00:44<01:31, 4.61it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 33%|###2 | 204/625 [00:44<01:31, 4.61it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|###2 | 204/625 [00:44<01:31, 4.61it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|###2 | 205/625 [00:44<01:31, 4.61it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|###2 | 205/625 [00:44<01:31, 4.61it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|###2 | 206/625 [00:44<01:30, 4.61it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|###2 | 206/625 [00:45<01:30, 4.61it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|###3 | 207/625 [00:45<01:30, 4.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|###3 | 207/625 [00:45<01:30, 4.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|###3 | 208/625 [00:45<01:30, 4.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|###3 | 208/625 [00:45<01:30, 4.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|###3 | 209/625 [00:45<01:30, 4.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 33%|###3 | 209/625 [00:45<01:30, 4.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 34%|###3 | 210/625 [00:45<01:29, 4.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|###3 | 210/625 [00:45<01:29, 4.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|###3 | 211/625 [00:45<01:29, 4.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|###3 | 211/625 [00:46<01:29, 4.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|###3 | 212/625 [00:46<01:29, 4.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|###3 | 212/625 [00:46<01:29, 4.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|###4 | 213/625 [00:46<01:29, 4.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|###4 | 213/625 [00:46<01:29, 4.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|###4 | 214/625 [00:46<01:29, 4.61it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|###4 | 214/625 [00:46<01:29, 4.61it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|###4 | 215/625 [00:46<01:28, 4.61it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 34%|###4 | 215/625 [00:47<01:28, 4.61it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 35%|###4 | 216/625 [00:47<01:28, 4.60it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|###4 | 216/625 [00:47<01:28, 4.60it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|###4 | 217/625 [00:47<01:28, 4.60it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|###4 | 217/625 [00:47<01:28, 4.60it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|###4 | 218/625 [00:47<01:28, 4.60it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|###4 | 218/625 [00:47<01:28, 4.60it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|###5 | 219/625 [00:47<01:28, 4.59it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|###5 | 219/625 [00:47<01:28, 4.59it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|###5 | 220/625 [00:47<01:28, 4.59it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|###5 | 220/625 [00:48<01:28, 4.59it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|###5 | 221/625 [00:48<01:28, 4.59it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 35%|###5 | 221/625 [00:48<01:28, 4.59it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 36%|###5 | 222/625 [00:48<01:27, 4.59it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|###5 | 222/625 [00:48<01:27, 4.59it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|###5 | 223/625 [00:48<01:27, 4.59it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|###5 | 223/625 [00:48<01:27, 4.59it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|###5 | 224/625 [00:48<01:27, 4.60it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|###5 | 224/625 [00:49<01:27, 4.60it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|###6 | 225/625 [00:49<01:26, 4.60it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|###6 | 225/625 [00:49<01:26, 4.60it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|###6 | 226/625 [00:49<01:26, 4.59it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|###6 | 226/625 [00:49<01:26, 4.59it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|###6 | 227/625 [00:49<01:26, 4.59it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|###6 | 227/625 [00:49<01:26, 4.59it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|###6 | 228/625 [00:49<01:26, 4.60it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 36%|###6 | 228/625 [00:49<01:26, 4.60it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 37%|###6 | 229/625 [00:49<01:26, 4.60it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|###6 | 229/625 [00:50<01:26, 4.60it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|###6 | 230/625 [00:50<01:25, 4.60it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|###6 | 230/625 [00:50<01:25, 4.60it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|###6 | 231/625 [00:50<01:25, 4.61it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|###6 | 231/625 [00:50<01:25, 4.61it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|###7 | 232/625 [00:50<01:25, 4.61it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|###7 | 232/625 [00:50<01:25, 4.61it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|###7 | 233/625 [00:50<01:25, 4.61it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|###7 | 233/625 [00:50<01:25, 4.61it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|###7 | 234/625 [00:50<01:25, 4.60it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 37%|###7 | 234/625 [00:51<01:25, 4.60it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 38%|###7 | 235/625 [00:51<01:24, 4.60it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|###7 | 235/625 [00:51<01:24, 4.60it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|###7 | 236/625 [00:51<01:24, 4.60it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|###7 | 236/625 [00:51<01:24, 4.60it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|###7 | 237/625 [00:51<01:24, 4.59it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|###7 | 237/625 [00:51<01:24, 4.59it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|###8 | 238/625 [00:51<01:24, 4.59it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|###8 | 238/625 [00:52<01:24, 4.59it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|###8 | 239/625 [00:52<01:24, 4.59it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|###8 | 239/625 [00:52<01:24, 4.59it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|###8 | 240/625 [00:52<01:24, 4.58it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 38%|###8 | 240/625 [00:52<01:24, 4.58it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 39%|###8 | 241/625 [00:52<01:23, 4.57it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|###8 | 241/625 [00:52<01:23, 4.57it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|###8 | 242/625 [00:52<01:23, 4.58it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|###8 | 242/625 [00:52<01:23, 4.58it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|###8 | 243/625 [00:52<01:23, 4.59it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|###8 | 243/625 [00:53<01:23, 4.59it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|###9 | 244/625 [00:53<01:22, 4.59it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|###9 | 244/625 [00:53<01:22, 4.59it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|###9 | 245/625 [00:53<01:22, 4.60it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|###9 | 245/625 [00:53<01:22, 4.60it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|###9 | 246/625 [00:53<01:22, 4.60it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 39%|###9 | 246/625 [00:53<01:22, 4.60it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 40%|###9 | 247/625 [00:53<01:22, 4.59it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|###9 | 247/625 [00:54<01:22, 4.59it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|###9 | 248/625 [00:54<01:22, 4.59it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|###9 | 248/625 [00:54<01:22, 4.59it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|###9 | 249/625 [00:54<01:21, 4.59it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|###9 | 249/625 [00:54<01:21, 4.59it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|#### | 250/625 [00:54<01:21, 4.60it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|#### | 250/625 [00:54<01:21, 4.60it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|#### | 251/625 [00:54<01:21, 4.60it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|#### | 251/625 [00:54<01:21, 4.60it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|#### | 252/625 [00:54<01:21, 4.60it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|#### | 252/625 [00:55<01:21, 4.60it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|#### | 253/625 [00:55<01:20, 4.61it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 40%|#### | 253/625 [00:55<01:20, 4.61it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 41%|#### | 254/625 [00:55<01:20, 4.61it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|#### | 254/625 [00:55<01:20, 4.61it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|#### | 255/625 [00:55<01:20, 4.61it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|#### | 255/625 [00:55<01:20, 4.61it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|#### | 256/625 [00:55<01:20, 4.61it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|#### | 256/625 [00:55<01:20, 4.61it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|####1 | 257/625 [00:55<01:19, 4.61it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|####1 | 257/625 [00:56<01:19, 4.61it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|####1 | 258/625 [00:56<01:19, 4.61it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|####1 | 258/625 [00:56<01:19, 4.61it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|####1 | 259/625 [00:56<01:19, 4.60it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 41%|####1 | 259/625 [00:56<01:19, 4.60it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 42%|####1 | 260/625 [00:56<01:19, 4.59it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|####1 | 260/625 [00:56<01:19, 4.59it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|####1 | 261/625 [00:56<01:19, 4.59it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|####1 | 261/625 [00:57<01:19, 4.59it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|####1 | 262/625 [00:57<01:19, 4.59it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|####1 | 262/625 [00:57<01:19, 4.59it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|####2 | 263/625 [00:57<01:18, 4.60it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|####2 | 263/625 [00:57<01:18, 4.60it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|####2 | 264/625 [00:57<01:18, 4.60it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|####2 | 264/625 [00:57<01:18, 4.60it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|####2 | 265/625 [00:57<01:18, 4.60it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 42%|####2 | 265/625 [00:57<01:18, 4.60it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 43%|####2 | 266/625 [00:57<01:18, 4.58it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|####2 | 266/625 [00:58<01:18, 4.58it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|####2 | 267/625 [00:58<01:18, 4.57it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|####2 | 267/625 [00:58<01:18, 4.57it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|####2 | 268/625 [00:58<01:18, 4.58it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|####2 | 268/625 [00:58<01:18, 4.58it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|####3 | 269/625 [00:58<01:17, 4.58it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|####3 | 269/625 [00:58<01:17, 4.58it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|####3 | 270/625 [00:58<01:17, 4.58it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|####3 | 270/625 [00:59<01:17, 4.58it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|####3 | 271/625 [00:59<01:16, 4.60it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 43%|####3 | 271/625 [00:59<01:16, 4.60it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 44%|####3 | 272/625 [00:59<01:16, 4.61it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|####3 | 272/625 [00:59<01:16, 4.61it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|####3 | 273/625 [00:59<01:16, 4.61it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|####3 | 273/625 [00:59<01:16, 4.61it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|####3 | 274/625 [00:59<01:16, 4.61it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|####3 | 274/625 [00:59<01:16, 4.61it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|####4 | 275/625 [00:59<01:15, 4.61it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|####4 | 275/625 [01:00<01:15, 4.61it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|####4 | 276/625 [01:00<01:15, 4.61it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|####4 | 276/625 [01:00<01:15, 4.61it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|####4 | 277/625 [01:00<01:15, 4.61it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|####4 | 277/625 [01:00<01:15, 4.61it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|####4 | 278/625 [01:00<01:15, 4.61it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 44%|####4 | 278/625 [01:00<01:15, 4.61it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 45%|####4 | 279/625 [01:00<01:15, 4.61it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|####4 | 279/625 [01:00<01:15, 4.61it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|####4 | 280/625 [01:00<01:14, 4.60it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|####4 | 280/625 [01:01<01:14, 4.60it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|####4 | 281/625 [01:01<01:15, 4.57it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|####4 | 281/625 [01:01<01:15, 4.57it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|####5 | 282/625 [01:01<01:14, 4.58it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|####5 | 282/625 [01:01<01:14, 4.58it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|####5 | 283/625 [01:01<01:14, 4.59it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|####5 | 283/625 [01:01<01:14, 4.59it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|####5 | 284/625 [01:01<01:14, 4.59it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 45%|####5 | 284/625 [01:02<01:14, 4.59it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 46%|####5 | 285/625 [01:02<01:13, 4.60it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|####5 | 285/625 [01:02<01:13, 4.60it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|####5 | 286/625 [01:02<01:13, 4.60it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|####5 | 286/625 [01:02<01:13, 4.60it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|####5 | 287/625 [01:02<01:13, 4.61it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|####5 | 287/625 [01:02<01:13, 4.61it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|####6 | 288/625 [01:02<01:13, 4.61it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|####6 | 288/625 [01:02<01:13, 4.61it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|####6 | 289/625 [01:02<01:13, 4.59it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|####6 | 289/625 [01:03<01:13, 4.59it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|####6 | 290/625 [01:03<01:13, 4.59it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 46%|####6 | 290/625 [01:03<01:13, 4.59it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 47%|####6 | 291/625 [01:03<01:12, 4.59it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|####6 | 291/625 [01:03<01:12, 4.59it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|####6 | 292/625 [01:03<01:12, 4.59it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|####6 | 292/625 [01:03<01:12, 4.59it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|####6 | 293/625 [01:03<01:12, 4.60it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|####6 | 293/625 [01:04<01:12, 4.60it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|####7 | 294/625 [01:04<01:11, 4.61it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|####7 | 294/625 [01:04<01:11, 4.61it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|####7 | 295/625 [01:04<01:11, 4.61it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|####7 | 295/625 [01:04<01:11, 4.61it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|####7 | 296/625 [01:04<01:11, 4.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 47%|####7 | 296/625 [01:04<01:11, 4.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 48%|####7 | 297/625 [01:04<01:11, 4.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|####7 | 297/625 [01:04<01:11, 4.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|####7 | 298/625 [01:04<01:10, 4.61it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|####7 | 298/625 [01:05<01:10, 4.61it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|####7 | 299/625 [01:05<01:10, 4.61it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|####7 | 299/625 [01:05<01:10, 4.61it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|####8 | 300/625 [01:05<01:10, 4.61it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|####8 | 300/625 [01:05<01:10, 4.61it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|####8 | 301/625 [01:05<01:10, 4.61it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|####8 | 301/625 [01:05<01:10, 4.61it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|####8 | 302/625 [01:05<01:10, 4.61it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|####8 | 302/625 [01:05<01:10, 4.61it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|####8 | 303/625 [01:05<01:09, 4.61it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 48%|####8 | 303/625 [01:06<01:09, 4.61it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 49%|####8 | 304/625 [01:06<01:09, 4.61it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|####8 | 304/625 [01:06<01:09, 4.61it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|####8 | 305/625 [01:06<01:09, 4.61it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|####8 | 305/625 [01:06<01:09, 4.61it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|####8 | 306/625 [01:06<01:09, 4.61it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|####8 | 306/625 [01:06<01:09, 4.61it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|####9 | 307/625 [01:06<01:08, 4.61it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|####9 | 307/625 [01:07<01:08, 4.61it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|####9 | 308/625 [01:07<01:08, 4.61it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|####9 | 308/625 [01:07<01:08, 4.61it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|####9 | 309/625 [01:07<01:08, 4.61it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 49%|####9 | 309/625 [01:07<01:08, 4.61it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 50%|####9 | 310/625 [01:07<01:08, 4.62it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|####9 | 310/625 [01:07<01:08, 4.62it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|####9 | 311/625 [01:07<01:07, 4.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|####9 | 311/625 [01:07<01:07, 4.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|####9 | 312/625 [01:07<01:07, 4.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|####9 | 312/625 [01:08<01:07, 4.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|##### | 313/625 [01:08<01:07, 4.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|##### | 313/625 [01:08<01:07, 4.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|##### | 314/625 [01:08<01:07, 4.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|##### | 314/625 [01:08<01:07, 4.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|##### | 315/625 [01:08<01:07, 4.61it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 50%|##### | 315/625 [01:08<01:07, 4.61it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 51%|##### | 316/625 [01:08<01:07, 4.61it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|##### | 316/625 [01:09<01:07, 4.61it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|##### | 317/625 [01:09<01:06, 4.61it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|##### | 317/625 [01:09<01:06, 4.61it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|##### | 318/625 [01:09<01:06, 4.61it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|##### | 318/625 [01:09<01:06, 4.61it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|#####1 | 319/625 [01:09<01:06, 4.61it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|#####1 | 319/625 [01:09<01:06, 4.61it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|#####1 | 320/625 [01:09<01:06, 4.61it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|#####1 | 320/625 [01:09<01:06, 4.61it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|#####1 | 321/625 [01:09<01:05, 4.61it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 51%|#####1 | 321/625 [01:10<01:05, 4.61it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 52%|#####1 | 322/625 [01:10<01:05, 4.61it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|#####1 | 322/625 [01:10<01:05, 4.61it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|#####1 | 323/625 [01:10<01:05, 4.61it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|#####1 | 323/625 [01:10<01:05, 4.61it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|#####1 | 324/625 [01:10<01:05, 4.61it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|#####1 | 324/625 [01:10<01:05, 4.61it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|#####2 | 325/625 [01:10<01:05, 4.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|#####2 | 325/625 [01:10<01:05, 4.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|#####2 | 326/625 [01:10<01:04, 4.62it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|#####2 | 326/625 [01:11<01:04, 4.62it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|#####2 | 327/625 [01:11<01:04, 4.62it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|#####2 | 327/625 [01:11<01:04, 4.62it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|#####2 | 328/625 [01:11<01:04, 4.61it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 52%|#####2 | 328/625 [01:11<01:04, 4.61it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 53%|#####2 | 329/625 [01:11<01:04, 4.61it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|#####2 | 329/625 [01:11<01:04, 4.61it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|#####2 | 330/625 [01:11<01:03, 4.61it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|#####2 | 330/625 [01:12<01:03, 4.61it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|#####2 | 331/625 [01:12<01:03, 4.61it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|#####2 | 331/625 [01:12<01:03, 4.61it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|#####3 | 332/625 [01:12<01:03, 4.62it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|#####3 | 332/625 [01:12<01:03, 4.62it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|#####3 | 333/625 [01:12<01:03, 4.62it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|#####3 | 333/625 [01:12<01:03, 4.62it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|#####3 | 334/625 [01:12<01:02, 4.62it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 53%|#####3 | 334/625 [01:12<01:02, 4.62it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 54%|#####3 | 335/625 [01:12<01:02, 4.62it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|#####3 | 335/625 [01:13<01:02, 4.62it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|#####3 | 336/625 [01:13<01:02, 4.62it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|#####3 | 336/625 [01:13<01:02, 4.62it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|#####3 | 337/625 [01:13<01:02, 4.62it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|#####3 | 337/625 [01:13<01:02, 4.62it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|#####4 | 338/625 [01:13<01:02, 4.62it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|#####4 | 338/625 [01:13<01:02, 4.62it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|#####4 | 339/625 [01:13<01:01, 4.62it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|#####4 | 339/625 [01:13<01:01, 4.62it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|#####4 | 340/625 [01:13<01:01, 4.61it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 54%|#####4 | 340/625 [01:14<01:01, 4.61it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 55%|#####4 | 341/625 [01:14<01:01, 4.61it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|#####4 | 341/625 [01:14<01:01, 4.61it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|#####4 | 342/625 [01:14<01:01, 4.61it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|#####4 | 342/625 [01:14<01:01, 4.61it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|#####4 | 343/625 [01:14<01:01, 4.61it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|#####4 | 343/625 [01:14<01:01, 4.61it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|#####5 | 344/625 [01:14<01:00, 4.62it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|#####5 | 344/625 [01:15<01:00, 4.62it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|#####5 | 345/625 [01:15<01:00, 4.61it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|#####5 | 345/625 [01:15<01:00, 4.61it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|#####5 | 346/625 [01:15<01:00, 4.61it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 55%|#####5 | 346/625 [01:15<01:00, 4.61it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 56%|#####5 | 347/625 [01:15<01:00, 4.60it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|#####5 | 347/625 [01:15<01:00, 4.60it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|#####5 | 348/625 [01:15<01:00, 4.61it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|#####5 | 348/625 [01:15<01:00, 4.61it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|#####5 | 349/625 [01:15<00:59, 4.61it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|#####5 | 349/625 [01:16<00:59, 4.61it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|#####6 | 350/625 [01:16<00:59, 4.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|#####6 | 350/625 [01:16<00:59, 4.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|#####6 | 351/625 [01:16<00:59, 4.61it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|#####6 | 351/625 [01:16<00:59, 4.61it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|#####6 | 352/625 [01:16<00:59, 4.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|#####6 | 352/625 [01:16<00:59, 4.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|#####6 | 353/625 [01:16<00:59, 4.60it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 56%|#####6 | 353/625 [01:17<00:59, 4.60it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 57%|#####6 | 354/625 [01:17<00:58, 4.60it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|#####6 | 354/625 [01:17<00:58, 4.60it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|#####6 | 355/625 [01:17<00:58, 4.60it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|#####6 | 355/625 [01:17<00:58, 4.60it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|#####6 | 356/625 [01:17<00:58, 4.60it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|#####6 | 356/625 [01:17<00:58, 4.60it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|#####7 | 357/625 [01:17<00:58, 4.60it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|#####7 | 357/625 [01:17<00:58, 4.60it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|#####7 | 358/625 [01:17<00:58, 4.59it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|#####7 | 358/625 [01:18<00:58, 4.59it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|#####7 | 359/625 [01:18<00:58, 4.59it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 57%|#####7 | 359/625 [01:18<00:58, 4.59it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 58%|#####7 | 360/625 [01:18<00:57, 4.59it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|#####7 | 360/625 [01:18<00:57, 4.59it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|#####7 | 361/625 [01:18<00:57, 4.60it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|#####7 | 361/625 [01:18<00:57, 4.60it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|#####7 | 362/625 [01:18<00:57, 4.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|#####7 | 362/625 [01:18<00:57, 4.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|#####8 | 363/625 [01:18<00:56, 4.60it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|#####8 | 363/625 [01:19<00:56, 4.60it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|#####8 | 364/625 [01:19<00:56, 4.60it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|#####8 | 364/625 [01:19<00:56, 4.60it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|#####8 | 365/625 [01:19<00:56, 4.60it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 58%|#####8 | 365/625 [01:19<00:56, 4.60it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 59%|#####8 | 366/625 [01:19<00:56, 4.60it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|#####8 | 366/625 [01:19<00:56, 4.60it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|#####8 | 367/625 [01:19<00:55, 4.61it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|#####8 | 367/625 [01:20<00:55, 4.61it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|#####8 | 368/625 [01:20<00:55, 4.61it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|#####8 | 368/625 [01:20<00:55, 4.61it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|#####9 | 369/625 [01:20<00:55, 4.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|#####9 | 369/625 [01:20<00:55, 4.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|#####9 | 370/625 [01:20<00:55, 4.60it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|#####9 | 370/625 [01:20<00:55, 4.60it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|#####9 | 371/625 [01:20<00:55, 4.61it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 59%|#####9 | 371/625 [01:20<00:55, 4.61it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 60%|#####9 | 372/625 [01:20<00:54, 4.61it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|#####9 | 372/625 [01:21<00:54, 4.61it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|#####9 | 373/625 [01:21<00:54, 4.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|#####9 | 373/625 [01:21<00:54, 4.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|#####9 | 374/625 [01:21<00:54, 4.61it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|#####9 | 374/625 [01:21<00:54, 4.61it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|###### | 375/625 [01:21<00:54, 4.59it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|###### | 375/625 [01:21<00:54, 4.59it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|###### | 376/625 [01:21<00:54, 4.60it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|###### | 376/625 [01:22<00:54, 4.60it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|###### | 377/625 [01:22<00:53, 4.60it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|###### | 377/625 [01:22<00:53, 4.60it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|###### | 378/625 [01:22<00:53, 4.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 60%|###### | 378/625 [01:22<00:53, 4.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 61%|###### | 379/625 [01:22<00:53, 4.61it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|###### | 379/625 [01:22<00:53, 4.61it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|###### | 380/625 [01:22<00:53, 4.60it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|###### | 380/625 [01:22<00:53, 4.60it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|###### | 381/625 [01:22<00:53, 4.60it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|###### | 381/625 [01:23<00:53, 4.60it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|######1 | 382/625 [01:23<00:52, 4.60it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|######1 | 382/625 [01:23<00:52, 4.60it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|######1 | 383/625 [01:23<00:52, 4.60it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|######1 | 383/625 [01:23<00:52, 4.60it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|######1 | 384/625 [01:23<00:52, 4.61it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 61%|######1 | 384/625 [01:23<00:52, 4.61it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 62%|######1 | 385/625 [01:23<00:52, 4.61it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|######1 | 385/625 [01:23<00:52, 4.61it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|######1 | 386/625 [01:23<00:51, 4.61it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|######1 | 386/625 [01:24<00:51, 4.61it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|######1 | 387/625 [01:24<00:51, 4.61it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|######1 | 387/625 [01:24<00:51, 4.61it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|######2 | 388/625 [01:24<00:51, 4.61it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|######2 | 388/625 [01:24<00:51, 4.61it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|######2 | 389/625 [01:24<00:51, 4.61it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|######2 | 389/625 [01:24<00:51, 4.61it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|######2 | 390/625 [01:24<00:51, 4.61it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 62%|######2 | 390/625 [01:25<00:51, 4.61it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 63%|######2 | 391/625 [01:25<00:50, 4.61it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|######2 | 391/625 [01:25<00:50, 4.61it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|######2 | 392/625 [01:25<00:50, 4.61it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|######2 | 392/625 [01:25<00:50, 4.61it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|######2 | 393/625 [01:25<00:50, 4.62it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|######2 | 393/625 [01:25<00:50, 4.62it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|######3 | 394/625 [01:25<00:50, 4.62it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|######3 | 394/625 [01:25<00:50, 4.62it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|######3 | 395/625 [01:25<00:49, 4.62it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|######3 | 395/625 [01:26<00:49, 4.62it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|######3 | 396/625 [01:26<00:49, 4.61it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 63%|######3 | 396/625 [01:26<00:49, 4.61it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 64%|######3 | 397/625 [01:26<00:49, 4.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|######3 | 397/625 [01:26<00:49, 4.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|######3 | 398/625 [01:26<00:49, 4.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|######3 | 398/625 [01:26<00:49, 4.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|######3 | 399/625 [01:26<00:48, 4.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|######3 | 399/625 [01:27<00:48, 4.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|######4 | 400/625 [01:27<00:48, 4.61it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|######4 | 400/625 [01:27<00:48, 4.61it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|######4 | 401/625 [01:27<00:48, 4.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|######4 | 401/625 [01:27<00:48, 4.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|######4 | 402/625 [01:27<00:48, 4.62it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|######4 | 402/625 [01:27<00:48, 4.62it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|######4 | 403/625 [01:27<00:48, 4.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 64%|######4 | 403/625 [01:27<00:48, 4.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 65%|######4 | 404/625 [01:27<00:48, 4.58it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|######4 | 404/625 [01:28<00:48, 4.58it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|######4 | 405/625 [01:28<00:48, 4.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|######4 | 405/625 [01:28<00:48, 4.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|######4 | 406/625 [01:28<00:48, 4.54it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|######4 | 406/625 [01:28<00:48, 4.54it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|######5 | 407/625 [01:28<00:48, 4.53it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|######5 | 407/625 [01:28<00:48, 4.53it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|######5 | 408/625 [01:28<00:47, 4.55it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|######5 | 408/625 [01:28<00:47, 4.55it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|######5 | 409/625 [01:28<00:47, 4.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 65%|######5 | 409/625 [01:29<00:47, 4.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 66%|######5 | 410/625 [01:29<00:47, 4.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|######5 | 410/625 [01:29<00:47, 4.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|######5 | 411/625 [01:29<00:46, 4.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|######5 | 411/625 [01:29<00:46, 4.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|######5 | 412/625 [01:29<00:46, 4.59it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|######5 | 412/625 [01:29<00:46, 4.59it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|######6 | 413/625 [01:29<00:46, 4.60it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|######6 | 413/625 [01:30<00:46, 4.60it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|######6 | 414/625 [01:30<00:45, 4.60it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|######6 | 414/625 [01:30<00:45, 4.60it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|######6 | 415/625 [01:30<00:45, 4.60it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 66%|######6 | 415/625 [01:30<00:45, 4.60it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 67%|######6 | 416/625 [01:30<00:45, 4.60it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|######6 | 416/625 [01:30<00:45, 4.60it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|######6 | 417/625 [01:30<00:45, 4.61it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|######6 | 417/625 [01:30<00:45, 4.61it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|######6 | 418/625 [01:30<00:44, 4.62it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|######6 | 418/625 [01:31<00:44, 4.62it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|######7 | 419/625 [01:31<00:44, 4.61it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|######7 | 419/625 [01:31<00:44, 4.61it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|######7 | 420/625 [01:31<00:44, 4.61it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|######7 | 420/625 [01:31<00:44, 4.61it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|######7 | 421/625 [01:31<00:44, 4.62it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 67%|######7 | 421/625 [01:31<00:44, 4.62it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 68%|######7 | 422/625 [01:31<00:44, 4.59it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|######7 | 422/625 [01:32<00:44, 4.59it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|######7 | 423/625 [01:32<00:43, 4.59it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|######7 | 423/625 [01:32<00:43, 4.59it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|######7 | 424/625 [01:32<00:43, 4.60it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|######7 | 424/625 [01:32<00:43, 4.60it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|######8 | 425/625 [01:32<00:43, 4.61it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|######8 | 425/625 [01:32<00:43, 4.61it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|######8 | 426/625 [01:32<00:43, 4.61it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|######8 | 426/625 [01:32<00:43, 4.61it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|######8 | 427/625 [01:32<00:42, 4.61it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|######8 | 427/625 [01:33<00:42, 4.61it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|######8 | 428/625 [01:33<00:42, 4.61it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 68%|######8 | 428/625 [01:33<00:42, 4.61it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 69%|######8 | 429/625 [01:33<00:42, 4.61it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|######8 | 429/625 [01:33<00:42, 4.61it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|######8 | 430/625 [01:33<00:42, 4.61it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|######8 | 430/625 [01:33<00:42, 4.61it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|######8 | 431/625 [01:33<00:42, 4.61it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|######8 | 431/625 [01:33<00:42, 4.61it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|######9 | 432/625 [01:33<00:41, 4.62it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|######9 | 432/625 [01:34<00:41, 4.62it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|######9 | 433/625 [01:34<00:41, 4.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|######9 | 433/625 [01:34<00:41, 4.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|######9 | 434/625 [01:34<00:41, 4.61it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 69%|######9 | 434/625 [01:34<00:41, 4.61it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 70%|######9 | 435/625 [01:34<00:41, 4.61it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|######9 | 435/625 [01:34<00:41, 4.61it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|######9 | 436/625 [01:34<00:40, 4.61it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|######9 | 436/625 [01:35<00:40, 4.61it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|######9 | 437/625 [01:35<00:40, 4.61it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|######9 | 437/625 [01:35<00:40, 4.61it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|####### | 438/625 [01:35<00:40, 4.61it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|####### | 438/625 [01:35<00:40, 4.61it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|####### | 439/625 [01:35<00:40, 4.61it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|####### | 439/625 [01:35<00:40, 4.61it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|####### | 440/625 [01:35<00:40, 4.61it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 70%|####### | 440/625 [01:35<00:40, 4.61it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 71%|####### | 441/625 [01:35<00:39, 4.60it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|####### | 441/625 [01:36<00:39, 4.60it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|####### | 442/625 [01:36<00:39, 4.60it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|####### | 442/625 [01:36<00:39, 4.60it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|####### | 443/625 [01:36<00:39, 4.61it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|####### | 443/625 [01:36<00:39, 4.61it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|#######1 | 444/625 [01:36<00:39, 4.61it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|#######1 | 444/625 [01:36<00:39, 4.61it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|#######1 | 445/625 [01:36<00:39, 4.61it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|#######1 | 445/625 [01:37<00:39, 4.61it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|#######1 | 446/625 [01:37<00:38, 4.62it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 71%|#######1 | 446/625 [01:37<00:38, 4.62it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 72%|#######1 | 447/625 [01:37<00:38, 4.62it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|#######1 | 447/625 [01:37<00:38, 4.62it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|#######1 | 448/625 [01:37<00:38, 4.62it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|#######1 | 448/625 [01:37<00:38, 4.62it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|#######1 | 449/625 [01:37<00:38, 4.62it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|#######1 | 449/625 [01:37<00:38, 4.62it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|#######2 | 450/625 [01:37<00:37, 4.62it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|#######2 | 450/625 [01:38<00:37, 4.62it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|#######2 | 451/625 [01:38<00:37, 4.61it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|#######2 | 451/625 [01:38<00:37, 4.61it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|#######2 | 452/625 [01:38<00:37, 4.61it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|#######2 | 452/625 [01:38<00:37, 4.61it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|#######2 | 453/625 [01:38<00:37, 4.61it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 72%|#######2 | 453/625 [01:38<00:37, 4.61it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 73%|#######2 | 454/625 [01:38<00:37, 4.61it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|#######2 | 454/625 [01:38<00:37, 4.61it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|#######2 | 455/625 [01:38<00:36, 4.61it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|#######2 | 455/625 [01:39<00:36, 4.61it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|#######2 | 456/625 [01:39<00:36, 4.61it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|#######2 | 456/625 [01:39<00:36, 4.61it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|#######3 | 457/625 [01:39<00:36, 4.62it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|#######3 | 457/625 [01:39<00:36, 4.62it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|#######3 | 458/625 [01:39<00:36, 4.62it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|#######3 | 458/625 [01:39<00:36, 4.62it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|#######3 | 459/625 [01:39<00:35, 4.62it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 73%|#######3 | 459/625 [01:40<00:35, 4.62it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 74%|#######3 | 460/625 [01:40<00:35, 4.62it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|#######3 | 460/625 [01:40<00:35, 4.62it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|#######3 | 461/625 [01:40<00:35, 4.62it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|#######3 | 461/625 [01:40<00:35, 4.62it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|#######3 | 462/625 [01:40<00:35, 4.62it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|#######3 | 462/625 [01:40<00:35, 4.62it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|#######4 | 463/625 [01:40<00:34, 4.63it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|#######4 | 463/625 [01:40<00:34, 4.63it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|#######4 | 464/625 [01:40<00:34, 4.63it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|#######4 | 464/625 [01:41<00:34, 4.63it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|#######4 | 465/625 [01:41<00:34, 4.62it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 74%|#######4 | 465/625 [01:41<00:34, 4.62it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 75%|#######4 | 466/625 [01:41<00:34, 4.62it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|#######4 | 466/625 [01:41<00:34, 4.62it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|#######4 | 467/625 [01:41<00:34, 4.62it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|#######4 | 467/625 [01:41<00:34, 4.62it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|#######4 | 468/625 [01:41<00:33, 4.62it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|#######4 | 468/625 [01:41<00:33, 4.62it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|#######5 | 469/625 [01:41<00:33, 4.62it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|#######5 | 469/625 [01:42<00:33, 4.62it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|#######5 | 470/625 [01:42<00:33, 4.62it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|#######5 | 470/625 [01:42<00:33, 4.62it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|#######5 | 471/625 [01:42<00:33, 4.61it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 75%|#######5 | 471/625 [01:42<00:33, 4.61it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 76%|#######5 | 472/625 [01:42<00:33, 4.61it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|#######5 | 472/625 [01:42<00:33, 4.61it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|#######5 | 473/625 [01:42<00:32, 4.61it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|#######5 | 473/625 [01:43<00:32, 4.61it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|#######5 | 474/625 [01:43<00:32, 4.62it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|#######5 | 474/625 [01:43<00:32, 4.62it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|#######6 | 475/625 [01:43<00:32, 4.62it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|#######6 | 475/625 [01:43<00:32, 4.62it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|#######6 | 476/625 [01:43<00:32, 4.62it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|#######6 | 476/625 [01:43<00:32, 4.62it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|#######6 | 477/625 [01:43<00:32, 4.61it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|#######6 | 477/625 [01:43<00:32, 4.61it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|#######6 | 478/625 [01:43<00:31, 4.61it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 76%|#######6 | 478/625 [01:44<00:31, 4.61it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 77%|#######6 | 479/625 [01:44<00:31, 4.61it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|#######6 | 479/625 [01:44<00:31, 4.61it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|#######6 | 480/625 [01:44<00:31, 4.61it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|#######6 | 480/625 [01:44<00:31, 4.61it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|#######6 | 481/625 [01:44<00:31, 4.61it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|#######6 | 481/625 [01:44<00:31, 4.61it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|#######7 | 482/625 [01:44<00:30, 4.61it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|#######7 | 482/625 [01:45<00:30, 4.61it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|#######7 | 483/625 [01:45<00:30, 4.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|#######7 | 483/625 [01:45<00:30, 4.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|#######7 | 484/625 [01:45<00:30, 4.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 77%|#######7 | 484/625 [01:45<00:30, 4.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 78%|#######7 | 485/625 [01:45<00:30, 4.62it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|#######7 | 485/625 [01:45<00:30, 4.62it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|#######7 | 486/625 [01:45<00:30, 4.62it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|#######7 | 486/625 [01:45<00:30, 4.62it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|#######7 | 487/625 [01:45<00:29, 4.63it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|#######7 | 487/625 [01:46<00:29, 4.63it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|#######8 | 488/625 [01:46<00:29, 4.63it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|#######8 | 488/625 [01:46<00:29, 4.63it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|#######8 | 489/625 [01:46<00:29, 4.63it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|#######8 | 489/625 [01:46<00:29, 4.63it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|#######8 | 490/625 [01:46<00:29, 4.63it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 78%|#######8 | 490/625 [01:46<00:29, 4.63it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 79%|#######8 | 491/625 [01:46<00:28, 4.63it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|#######8 | 491/625 [01:46<00:28, 4.63it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|#######8 | 492/625 [01:46<00:28, 4.63it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|#######8 | 492/625 [01:47<00:28, 4.63it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|#######8 | 493/625 [01:47<00:28, 4.62it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|#######8 | 493/625 [01:47<00:28, 4.62it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|#######9 | 494/625 [01:47<00:28, 4.62it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|#######9 | 494/625 [01:47<00:28, 4.62it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|#######9 | 495/625 [01:47<00:28, 4.62it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|#######9 | 495/625 [01:47<00:28, 4.62it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|#######9 | 496/625 [01:47<00:27, 4.62it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 79%|#######9 | 496/625 [01:48<00:27, 4.62it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 80%|#######9 | 497/625 [01:48<00:27, 4.62it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|#######9 | 497/625 [01:48<00:27, 4.62it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|#######9 | 498/625 [01:48<00:27, 4.62it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|#######9 | 498/625 [01:48<00:27, 4.62it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|#######9 | 499/625 [01:48<00:27, 4.62it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|#######9 | 499/625 [01:48<00:27, 4.62it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|######## | 500/625 [01:48<00:27, 4.62it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|######## | 500/625 [01:48<00:27, 4.62it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|######## | 501/625 [01:48<00:26, 4.62it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|######## | 501/625 [01:49<00:26, 4.62it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|######## | 502/625 [01:49<00:26, 4.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|######## | 502/625 [01:49<00:26, 4.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|######## | 503/625 [01:49<00:26, 4.62it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 80%|######## | 503/625 [01:49<00:26, 4.62it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 81%|######## | 504/625 [01:49<00:26, 4.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|######## | 504/625 [01:49<00:26, 4.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|######## | 505/625 [01:49<00:25, 4.62it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|######## | 505/625 [01:50<00:25, 4.62it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|######## | 506/625 [01:50<00:25, 4.62it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|######## | 506/625 [01:50<00:25, 4.62it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|########1 | 507/625 [01:50<00:25, 4.61it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|########1 | 507/625 [01:50<00:25, 4.61it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|########1 | 508/625 [01:50<00:25, 4.59it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|########1 | 508/625 [01:50<00:25, 4.59it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|########1 | 509/625 [01:50<00:25, 4.59it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 81%|########1 | 509/625 [01:50<00:25, 4.59it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 82%|########1 | 510/625 [01:50<00:24, 4.60it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|########1 | 510/625 [01:51<00:24, 4.60it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|########1 | 511/625 [01:51<00:24, 4.61it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|########1 | 511/625 [01:51<00:24, 4.61it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|########1 | 512/625 [01:51<00:24, 4.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|########1 | 512/625 [01:51<00:24, 4.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|########2 | 513/625 [01:51<00:24, 4.61it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|########2 | 513/625 [01:51<00:24, 4.61it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|########2 | 514/625 [01:51<00:24, 4.62it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|########2 | 514/625 [01:51<00:24, 4.62it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|########2 | 515/625 [01:51<00:23, 4.62it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 82%|########2 | 515/625 [01:52<00:23, 4.62it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 83%|########2 | 516/625 [01:52<00:23, 4.62it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|########2 | 516/625 [01:52<00:23, 4.62it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|########2 | 517/625 [01:52<00:23, 4.62it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|########2 | 517/625 [01:52<00:23, 4.62it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|########2 | 518/625 [01:52<00:23, 4.62it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|########2 | 518/625 [01:52<00:23, 4.62it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|########3 | 519/625 [01:52<00:22, 4.62it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|########3 | 519/625 [01:53<00:22, 4.62it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|########3 | 520/625 [01:53<00:22, 4.62it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|########3 | 520/625 [01:53<00:22, 4.62it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|########3 | 521/625 [01:53<00:22, 4.62it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 83%|########3 | 521/625 [01:53<00:22, 4.62it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 84%|########3 | 522/625 [01:53<00:22, 4.62it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|########3 | 522/625 [01:53<00:22, 4.62it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|########3 | 523/625 [01:53<00:22, 4.62it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|########3 | 523/625 [01:53<00:22, 4.62it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|########3 | 524/625 [01:53<00:21, 4.62it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|########3 | 524/625 [01:54<00:21, 4.62it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|########4 | 525/625 [01:54<00:21, 4.62it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|########4 | 525/625 [01:54<00:21, 4.62it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|########4 | 526/625 [01:54<00:21, 4.62it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|########4 | 526/625 [01:54<00:21, 4.62it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|########4 | 527/625 [01:54<00:21, 4.62it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|########4 | 527/625 [01:54<00:21, 4.62it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|########4 | 528/625 [01:54<00:20, 4.62it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 84%|########4 | 528/625 [01:54<00:20, 4.62it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 85%|########4 | 529/625 [01:54<00:20, 4.62it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|########4 | 529/625 [01:55<00:20, 4.62it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|########4 | 530/625 [01:55<00:20, 4.62it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|########4 | 530/625 [01:55<00:20, 4.62it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|########4 | 531/625 [01:55<00:20, 4.62it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|########4 | 531/625 [01:55<00:20, 4.62it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|########5 | 532/625 [01:55<00:20, 4.62it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|########5 | 532/625 [01:55<00:20, 4.62it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|########5 | 533/625 [01:55<00:19, 4.63it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|########5 | 533/625 [01:56<00:19, 4.63it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|########5 | 534/625 [01:56<00:19, 4.63it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 85%|########5 | 534/625 [01:56<00:19, 4.63it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 86%|########5 | 535/625 [01:56<00:19, 4.62it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|########5 | 535/625 [01:56<00:19, 4.62it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|########5 | 536/625 [01:56<00:19, 4.62it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|########5 | 536/625 [01:56<00:19, 4.62it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|########5 | 537/625 [01:56<00:19, 4.62it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|########5 | 537/625 [01:56<00:19, 4.62it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|########6 | 538/625 [01:56<00:18, 4.62it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|########6 | 538/625 [01:57<00:18, 4.62it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|########6 | 539/625 [01:57<00:18, 4.62it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|########6 | 539/625 [01:57<00:18, 4.62it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|########6 | 540/625 [01:57<00:18, 4.62it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 86%|########6 | 540/625 [01:57<00:18, 4.62it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 87%|########6 | 541/625 [01:57<00:18, 4.62it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|########6 | 541/625 [01:57<00:18, 4.62it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|########6 | 542/625 [01:57<00:17, 4.62it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|########6 | 542/625 [01:58<00:17, 4.62it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|########6 | 543/625 [01:58<00:17, 4.62it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|########6 | 543/625 [01:58<00:17, 4.62it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|########7 | 544/625 [01:58<00:17, 4.62it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|########7 | 544/625 [01:58<00:17, 4.62it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|########7 | 545/625 [01:58<00:17, 4.62it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|########7 | 545/625 [01:58<00:17, 4.62it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|########7 | 546/625 [01:58<00:17, 4.62it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 87%|########7 | 546/625 [01:58<00:17, 4.62it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 88%|########7 | 547/625 [01:58<00:16, 4.62it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|########7 | 547/625 [01:59<00:16, 4.62it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|########7 | 548/625 [01:59<00:16, 4.62it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|########7 | 548/625 [01:59<00:16, 4.62it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|########7 | 549/625 [01:59<00:16, 4.62it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|########7 | 549/625 [01:59<00:16, 4.62it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|########8 | 550/625 [01:59<00:16, 4.62it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|########8 | 550/625 [01:59<00:16, 4.62it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|########8 | 551/625 [01:59<00:15, 4.63it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|########8 | 551/625 [01:59<00:15, 4.63it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|########8 | 552/625 [01:59<00:15, 4.62it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|########8 | 552/625 [02:00<00:15, 4.62it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|########8 | 553/625 [02:00<00:15, 4.63it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 88%|########8 | 553/625 [02:00<00:15, 4.63it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 89%|########8 | 554/625 [02:00<00:15, 4.63it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|########8 | 554/625 [02:00<00:15, 4.63it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|########8 | 555/625 [02:00<00:15, 4.63it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|########8 | 555/625 [02:00<00:15, 4.63it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|########8 | 556/625 [02:00<00:14, 4.63it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|########8 | 556/625 [02:01<00:14, 4.63it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|########9 | 557/625 [02:01<00:14, 4.63it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|########9 | 557/625 [02:01<00:14, 4.63it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|########9 | 558/625 [02:01<00:14, 4.63it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|########9 | 558/625 [02:01<00:14, 4.63it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|########9 | 559/625 [02:01<00:14, 4.62it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 89%|########9 | 559/625 [02:01<00:14, 4.62it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 90%|########9 | 560/625 [02:01<00:14, 4.60it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|########9 | 560/625 [02:01<00:14, 4.60it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|########9 | 561/625 [02:01<00:13, 4.59it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|########9 | 561/625 [02:02<00:13, 4.59it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|########9 | 562/625 [02:02<00:13, 4.59it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|########9 | 562/625 [02:02<00:13, 4.59it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|######### | 563/625 [02:02<00:13, 4.57it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|######### | 563/625 [02:02<00:13, 4.57it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|######### | 564/625 [02:02<00:13, 4.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|######### | 564/625 [02:02<00:13, 4.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|######### | 565/625 [02:02<00:13, 4.55it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 90%|######### | 565/625 [02:03<00:13, 4.55it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 91%|######### | 566/625 [02:03<00:12, 4.55it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|######### | 566/625 [02:03<00:12, 4.55it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|######### | 567/625 [02:03<00:12, 4.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|######### | 567/625 [02:03<00:12, 4.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|######### | 568/625 [02:03<00:12, 4.58it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|######### | 568/625 [02:03<00:12, 4.58it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|#########1| 569/625 [02:03<00:12, 4.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|#########1| 569/625 [02:03<00:12, 4.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|#########1| 570/625 [02:03<00:11, 4.61it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|#########1| 570/625 [02:04<00:11, 4.61it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|#########1| 571/625 [02:04<00:11, 4.62it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 91%|#########1| 571/625 [02:04<00:11, 4.62it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 92%|#########1| 572/625 [02:04<00:11, 4.62it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|#########1| 572/625 [02:04<00:11, 4.62it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|#########1| 573/625 [02:04<00:11, 4.60it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|#########1| 573/625 [02:04<00:11, 4.60it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|#########1| 574/625 [02:04<00:11, 4.60it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|#########1| 574/625 [02:04<00:11, 4.60it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|#########2| 575/625 [02:04<00:10, 4.58it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|#########2| 575/625 [02:05<00:10, 4.58it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|#########2| 576/625 [02:05<00:10, 4.57it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|#########2| 576/625 [02:05<00:10, 4.57it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|#########2| 577/625 [02:05<00:10, 4.55it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|#########2| 577/625 [02:05<00:10, 4.55it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|#########2| 578/625 [02:05<00:10, 4.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 92%|#########2| 578/625 [02:05<00:10, 4.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 93%|#########2| 579/625 [02:05<00:10, 4.56it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|#########2| 579/625 [02:06<00:10, 4.56it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|#########2| 580/625 [02:06<00:09, 4.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|#########2| 580/625 [02:06<00:09, 4.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|#########2| 581/625 [02:06<00:09, 4.55it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|#########2| 581/625 [02:06<00:09, 4.55it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|#########3| 582/625 [02:06<00:09, 4.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|#########3| 582/625 [02:06<00:09, 4.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|#########3| 583/625 [02:06<00:09, 4.54it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|#########3| 583/625 [02:06<00:09, 4.54it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|#########3| 584/625 [02:06<00:09, 4.55it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 93%|#########3| 584/625 [02:07<00:09, 4.55it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 94%|#########3| 585/625 [02:07<00:08, 4.57it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|#########3| 585/625 [02:07<00:08, 4.57it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|#########3| 586/625 [02:07<00:08, 4.59it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|#########3| 586/625 [02:07<00:08, 4.59it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|#########3| 587/625 [02:07<00:08, 4.60it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|#########3| 587/625 [02:07<00:08, 4.60it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|#########4| 588/625 [02:07<00:08, 4.58it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|#########4| 588/625 [02:08<00:08, 4.58it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|#########4| 589/625 [02:08<00:07, 4.56it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|#########4| 589/625 [02:08<00:07, 4.56it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|#########4| 590/625 [02:08<00:07, 4.56it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 94%|#########4| 590/625 [02:08<00:07, 4.56it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 95%|#########4| 591/625 [02:08<00:07, 4.56it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|#########4| 591/625 [02:08<00:07, 4.56it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|#########4| 592/625 [02:08<00:07, 4.56it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|#########4| 592/625 [02:08<00:07, 4.56it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|#########4| 593/625 [02:08<00:07, 4.55it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|#########4| 593/625 [02:09<00:07, 4.55it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|#########5| 594/625 [02:09<00:06, 4.55it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|#########5| 594/625 [02:09<00:06, 4.55it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|#########5| 595/625 [02:09<00:06, 4.56it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|#########5| 595/625 [02:09<00:06, 4.56it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|#########5| 596/625 [02:09<00:06, 4.56it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 95%|#########5| 596/625 [02:09<00:06, 4.56it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 96%|#########5| 597/625 [02:09<00:06, 4.57it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|#########5| 597/625 [02:10<00:06, 4.57it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|#########5| 598/625 [02:10<00:05, 4.56it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|#########5| 598/625 [02:10<00:05, 4.56it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|#########5| 599/625 [02:10<00:05, 4.55it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|#########5| 599/625 [02:10<00:05, 4.55it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|#########6| 600/625 [02:10<00:05, 4.55it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|#########6| 600/625 [02:10<00:05, 4.55it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|#########6| 601/625 [02:10<00:05, 4.56it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|#########6| 601/625 [02:10<00:05, 4.56it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|#########6| 602/625 [02:10<00:05, 4.57it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|#########6| 602/625 [02:11<00:05, 4.57it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|#########6| 603/625 [02:11<00:04, 4.55it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 96%|#########6| 603/625 [02:11<00:04, 4.55it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 97%|#########6| 604/625 [02:11<00:04, 4.55it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|#########6| 604/625 [02:11<00:04, 4.55it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|#########6| 605/625 [02:11<00:04, 4.54it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|#########6| 605/625 [02:11<00:04, 4.54it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|#########6| 606/625 [02:11<00:04, 4.54it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|#########6| 606/625 [02:11<00:04, 4.54it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|#########7| 607/625 [02:11<00:03, 4.55it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|#########7| 607/625 [02:12<00:03, 4.55it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|#########7| 608/625 [02:12<00:03, 4.55it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|#########7| 608/625 [02:12<00:03, 4.55it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|#########7| 609/625 [02:12<00:03, 4.55it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 97%|#########7| 609/625 [02:12<00:03, 4.55it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 98%|#########7| 610/625 [02:12<00:03, 4.55it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|#########7| 610/625 [02:12<00:03, 4.55it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|#########7| 611/625 [02:12<00:03, 4.55it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|#########7| 611/625 [02:13<00:03, 4.55it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|#########7| 612/625 [02:13<00:02, 4.55it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|#########7| 612/625 [02:13<00:02, 4.55it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|#########8| 613/625 [02:13<00:02, 4.56it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|#########8| 613/625 [02:13<00:02, 4.56it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|#########8| 614/625 [02:13<00:02, 4.58it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|#########8| 614/625 [02:13<00:02, 4.58it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|#########8| 615/625 [02:13<00:02, 4.59it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 98%|#########8| 615/625 [02:13<00:02, 4.59it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 99%|#########8| 616/625 [02:13<00:01, 4.60it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|#########8| 616/625 [02:14<00:01, 4.60it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|#########8| 617/625 [02:14<00:01, 4.61it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|#########8| 617/625 [02:14<00:01, 4.61it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|#########8| 618/625 [02:14<00:01, 4.62it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|#########8| 618/625 [02:14<00:01, 4.62it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|#########9| 619/625 [02:14<00:01, 4.62it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|#########9| 619/625 [02:14<00:01, 4.62it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|#########9| 620/625 [02:14<00:01, 4.62it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|#########9| 620/625 [02:15<00:01, 4.62it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|#########9| 621/625 [02:15<00:00, 4.60it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 99%|#########9| 621/625 [02:15<00:00, 4.60it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 100%|#########9| 622/625 [02:15<00:00, 4.58it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|#########9| 622/625 [02:15<00:00, 4.58it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|#########9| 623/625 [02:15<00:00, 4.57it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|#########9| 623/625 [02:15<00:00, 4.57it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|#########9| 624/625 [02:15<00:00, 4.56it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|#########9| 624/625 [02:15<00:00, 4.56it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|##########| 625/625 [02:15<00:00, 4.57it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|##########| 625/625 [02:15<00:00, 4.60it/s]
总结
在本教程中,我们学习了如何从头开始编写一个无状态环境。我们涉及了以下主题:
-
编写环境时需要处理的四个基本组件(
step
、reset
、种子设置和构建规范)。我们了解了这些方法和类如何与TensorDict
类交互; -
如何使用
check_env_specs()
测试环境是否正确编码; -
在无状态环境中如何添加变换以及如何编写自定义变换;
-
如何在完全可微分的模拟器上训练策略。