TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Under review

We diagnose critic-side gradient ill-conditioning as a previously overlooked bottleneck of PPO in multi-task reinforcement learning, where tail tasks stall while easy tasks dominate value-function updates.

We design Critic Balancing for PPO, which combines:

  • per-task PopArt value normalization,
  • pre-activation LayerNorm in the critic body, and
  • per-side gradient combiners (PCGrad / CAGrad / FairGrad) chosen independently for actor and critic,

to recondition gradients without enlarging the model.

On Meta-World+ MT50, the resulting algorithm surpasses published SAC-family and ARS-family baselines on both mean and worst-k tail-task success while using up to 22.7× fewer parameters and substantially fewer environment steps.

Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao. Under review. arXiv:2605.11473