Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization

International Conference on Learning Representations (ICLR) 2022

Quanyi Li¹*, Zhenghao Peng²*, Bolei Zhou³

²The Chinese University of Hong Kong, ¹Centre for Perceptual and Interactive Intelligence
³University of California, Los Angeles

Human-AI Copilot Optimization (HACO)

Fig. 1 Framework

We develop an efficient Human-AI Copilot Optimization method (HACO), which incorporate human into the Reinforcement Learning (RL) training loop to boost the learning efficiency and ensure the safety. Sidestepping the requirement of complex reward engineering, HACO injects the human knowledge extracted from partial demonstration into the proxy value function by Offline RL technique. On the other hand, entropy regularization and intervention minimization are used for encouraging exploration and saving human budget respectively. The comprehensive experiments show the superior sample efficiency and safety guarantee of the proposed method.

Experiment Result

Fig. 2 Learning Dynamics

The experiments conducted on MetaDrive Simulator show the efficient training as well as the low safety violation, when compared with RL, Offline RL, Imitation Learning and human-in-the-loop baselines. Results are reported in the Table 1. Compared to other data-hungry baselines, HACO utilizes less transitions and achieve highest success rate. Also, with the human protection, HACO yields only 30.14 total safety violations in the whole training process, two orders of magnitude less than other RL baselines, even though HACO access neither the cost or reward signal. In the human-in-the-loop paradigm, another concern is the expensive human budget consuming. As shown in Fig. 2 B., expensive human budget decreases along with the training.

Table. 1 Comparison results

Talk

We summarize our core technical comtribution in this talk.

Demo Video

HACO is tested on MetaDrive Simulator, which is efficient and allows generating various scenarios. Here, we provide the full training process of HACO and compare it with RL, IL and Offline RL baselines. As a result, HACO achieves superior sample efficiency with safety guarantee and outperform all baselines.

Benchmark HACO on CARLA

Furthermore, we benchmark HACO on CARLA Simulator, where agent takes semantic top-down view as observation. Equipped with 3-layer convolution neural network, HACO agent learns not only the feature extractor but the driving policy with human involvement in 10 minutes.

Reference

@inproceedings{
    li2022efficient,
    title={Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization},
    author={Quanyi Li and Zhenghao Peng and Bolei Zhou},
    booktitle={International Conference on Learning Representations},
    year={2022},
    url={https://openreview.net/forum?id=0cgU-BZp2ky}
}