GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

.

Zhe Ni†, Xiao-Xin Deng†, Cong Tai†,
Xin-Yue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng∗

Tsinghua university

In this paper, we propose a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction. Our method encodes object attributes and relationships in graphs through an LLM and Graph Attention Networks, integrating instruction features to predict subtasks consisting of pre-defined robot actions and target objects in the scene graph. This strategy enables robots to acquire semantic knowledge widely observed in the environment from the scene graph. To train and evaluate GRID, we establish a dataset construction pipeline to generate synthetic datasets for graph based robotic task planning. Experiments have shown that our method outperforms GPT-4 by over 25.4% in subtask accuracy and 43.6% in task accuracy.

Great receptive field

We first introduce scene graphs to promote instruction driven robotic task planning via the graphs’ ability to understand wide-perspective and rich-semantic knowledge in the environment.

Accurate and powerful

We design a novel GAT-based network named GRID,which takes instruction,robot graph & scene graph as inputs and outperforms GPT-4 by over 43.6% in task accuracy.

Supporting dataset

We build a synthetic dataset construction pipeline to generate datasets of scene graphs for instruction-driven robotic task planning.

Network lightweight

GRID network is lightweight and can be deployed locally, without relying on large cloud models.

Display Video 1:


Display Video 2:

Unlike previous works, we utilize scene graphs to rep resent the environment for robotic task planning. In scene graphs, objects and their relationships within a scene are structured into graph nodes and edges. To provide a precise portrayal of the robot’s status, we separate the robot from the scene graph to create a robot graph, which contains the robot’s location, nearby objects, and grasped objects. We introduce a lightweight transformer-based model GRID, de signed for deployment on offline embodied agents. As illus- trated, the model takes instruction, scene graph, and robot graph as inputs, subsequently determining the subtask for the robot to execute. In GRID, the instruction and both graphs are mapped into a unified latent space by a shared weight LLM encoder named INSTRUCTOR. Then the encoded graph nodes and their relationships are refined by graph attention network (GAT) modules. Integrating the outputs from GAT and the encoded instruction with a cross attention-based feature enhancer, the resultant enhanced fea tures are fed into a transformer-based task decoder to yield the robot subtask. Compared with the holistic sentence form, our subtask expressed as (action)-(object) pair form requires less enumeration to cover diverse object categories. GRID iteratively plans the subtask, enabling it to respond to real-time scene changes and human interference, thereby correcting unfinished tasks.

Network Structure:

Image 1

Feature Enhancer&Task Decoder:

Image 1

Simulation system setup:

Image 1

Real machine deployment system setup:

Image 1

Generating and updating scene graph in Unity:

Image 1

we compare our model with GPT-4 and validate the effectiveness of two key modules through ablation studies. We also explore the generalization ability of the model with different datasets. Finally, we demonstrate our approach both in the simulation environment and the real world.

Comparison between GRID and LLM-As-Planner on datasets
with different numbers of objects in each scene:

Image 1

Ablation experiment:

Image 1

response from chatGPT4 API:

Image 1

Instruction-subtask examples:

Image 1

Cooperative unit

Thank you to the following units for their support and assistance.

.

If you use the data or code please cite:

@article{ni2024grid,
title={GRID: Scene-Graph-based Instruction-driven Robotic Task Planning},
author={Zhe Ni and Xiaoxin Deng and Cong Tai and Xinyue Zhu and Qinghongbing Xie and Weihang Huang and Xiang Wu and Long Zeng},
journal={arXiv preprint arXiv:2309.07726},
year={2024}}

full papers