Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic understanding using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user's commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate.
We welcome everyone to use the name 'DoggyBot' for their open-source quadruped projects. We hope that our open-sourced project can serve as a starting point and inspire a series of projects focused on legged locomotion and real-world interaction, all under the 'xxx DoggyBot' name."
We thank the hardware and firmware supports from Unitree Robotics. We appreciate the initial brainstorming, valuable discussions and constructive feedback from Ziwen Zhuang and Xin Duan. We appreciate long-term supports on hardware and code from and discussions with Huy Ha and Yihuai Gao. We appreciate the help on experiments from Ziang Cao, Tian-Ao Ren and Hang Dong. We also appreciate discussions with Wenhao Yu and Erwin Coumans. This project is supported by the AI Institute and ONR grant N00014-21-1-2685. Zipeng Fu is supported by Pierre and Christine Lamond Fellowship.
@inproceedings{wu2024helpful,
author = {Wu, Qi and Fu, Zipeng and Cheng, Xuxin and Wang, Xiaolong and Finn, Chelsea},
title = {Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models},
booktitle = {arXiv},
year = {2024},
}