Non-local Policy Optimization via
Diversity-regularized Collaborative Exploration
Zhenghao PengHao SunBolei Zhou
The Chinese University of Hong Kong
[EN]  [中文]

Working together in a team towards a common goal makes life easier. However, in most of the existing Reinforcement Learning (RL) algorithms, usually only one agent or a global agent with several replicas explore the environment and learn to solve the task. The agent usually limits its exploration within a small region of the state-action space due to the initialization and previous experience, as illustrated by the light area in the above figure, which we called the local exploration problem.

We address the local exploration problem with a new policy optimization framework called Diversity-regularized Collaborative Exploration (DiCE). DiCE combines the Collaborative Exploration (CE) that maintains a team of agents and shares knowledge across multiple agents as well as the Diversity Regularization (DR) that directs the exploration of each agent and maintains the diversity among them. DiCE is implemented in both on-policy and off-policy settings and is compared with baselines e.g. PPO and SAC. The experimental results show that DiCE outperforms both on-policy and off-policy baselines in most cases in the MuJoCo locomotion benchmarks.


Overall Performance
We implement DiCE framework in both on-policy and off-policy settings and compare them with two on-policy baselines PPO, A2C, one off-policy baseline SAC and one diversity-encouraging baseline TNB. We train our agents in five locomotion tasks in MuJoCo simulator.
On-policy Setting
As shown in following figures, in all the five tasks, our method achieves better results compared to the baselines PPO and A2C. In the above table, we see that in four environments DiCE-PPO achieves a substantial improvement over the baselines, while in the Hopper-v3 PPO and TNB achieve higher score than DiCE. In Hopper-v3, PPO collapses after a long time of training, while DiCE maintains its performance until the end of the training, which shows that DiCE is robust in training stability.
Off-policy Setting
As shown in the table and the following figures, in off-policy setting, DiCE-SAC outperforms the SAC baseline in Hopper-v3 and Humanoid-v3 with faster convergence while achieves comparable performance in HalfCheetah-v3 and Walker2d-v3. In Ant-v3, the DiCE-SAC fails to progress compared to SAC. This might because that Ant-v3 environment has loose constraints on action and has larger action space, thus the structure of diversity is more complex than other environments, making the learning of diversity critic harder. We have the similar observation for on-policy DiCE when utilizing a diversity value network (please refer to the ablation study in paper).
The performance improvements brought by DiCE in on-policy and off-policy settings shows the generalization ability of our framework.
  title={Non-local Policy Optimization via Diversity-regularized Collaborative Exploration},
  author={Peng, Zhenghao and Sun, Hao and Zhou, Bolei},
  journal={arXiv preprint arXiv:2006.07781},

bilibili link of video
YouTube link of video