Great receptive field

We first introduce scene graphs to promote instruction driven robotic task planning via the graphs’ ability to understand wide-perspective and rich-semantic knowledge in the environment.

Accurate and powerful

We design a novel GAT-based network named GRID,which takes instruction,robot graph & scene graph as inputs and outperforms GPT-4 by over 43.6% in task accuracy.

Supporting dataset

We build a synthetic dataset construction pipeline to generate datasets of scene graphs for instruction-driven robotic task planning.

Network lightweight

GRID network is lightweight and can be deployed locally, without relying on large cloud models.

Unlike previous works, we utilize scene graphs to rep resent the environment for robotic task planning. In scene graphs, objects and their relationships within a scene are structured into graph nodes and edges. To provide a precise portrayal of the robot’s status, we separate the robot from the scene graph to create a robot graph, which contains the robot’s location, nearby objects, and grasped objects. We introduce a lightweight transformer-based model GRID, de signed for deployment on offline embodied agents. As illus- trated, the model takes instruction, scene graph, and robot graph as inputs, subsequently determining the subtask for the robot to execute. In GRID, the instruction and both graphs are mapped into a unified latent space by a shared weight LLM encoder named INSTRUCTOR. Then the encoded graph nodes and their relationships are refined by graph attention network (GAT) modules. Integrating the outputs from GAT and the encoded instruction with a cross attention-based feature enhancer, the resultant enhanced fea tures are fed into a transformer-based task decoder to yield the robot subtask. Compared with the holistic sentence form, our subtask expressed as (action)-(object) pair form requires less enumeration to cover diverse object categories. GRID iteratively plans the subtask, enabling it to respond to real-time scene changes and human interference, thereby correcting unfinished tasks.

Network Structure:

Feature Enhancer&Task Decoder:

Simulation system setup:

Real machine deployment system setup:

we compare our model with GPT-4 and validate the effectiveness of two key modules through ablation studies. We also explore the generalization ability of the model with different datasets. Finally, we demonstrate our approach both in the simulation environment and the real world.

Comparison between GRID and LLM-As-Planner on datasets
with different numbers of objects in each scene:

If you use the data or code please cite:

@article{ni2024grid,
title={GRID: Scene-Graph-based Instruction-driven Robotic Task Planning},
author={Zhe Ni and Xiaoxin Deng and Cong Tai and Xinyue Zhu and Qinghongbing Xie and Weihang Huang and Xiang Wu and Long Zeng},
journal={arXiv preprint arXiv:2309.07726},
year={2024}}

full papers

GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

Zhe Ni†, Xiao-Xin Deng†, Cong Tai†,
Xin-Yue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng∗

Tsinghua university

Great receptive field

Accurate and powerful

Supporting dataset

Network lightweight

Display Video 1:

Display Video 2:

Network Structure:

Feature Enhancer&Task Decoder:

Simulation system setup:

Real machine deployment system setup:

Generating and updating scene graph in Unity:

we compare our model with GPT-4 and validate the effectiveness of two key modules through ablation studies. We also explore the generalization ability of the model with different datasets. Finally, we demonstrate our approach both in the simulation environment and the real world.

Comparison between GRID and LLM-As-Planner on datasets
with different numbers of objects in each scene:

Ablation experiment:

response from chatGPT4 API:

Instruction-subtask examples:

Cooperative unit

If you use the data or code please cite:

GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

Zhe Ni†, Xiao-Xin Deng†, Cong Tai†, Xin-Yue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng∗

Tsinghua university

Great receptive field

Accurate and powerful

Supporting dataset

Network lightweight

Display Video 1:

Display Video 2:

Network Structure:

Feature Enhancer&Task Decoder:

Simulation system setup:

Real machine deployment system setup:

Generating and updating scene graph in Unity:

we compare our model with GPT-4 and validate the effectiveness of two key modules through ablation studies. We also explore the generalization ability of the model with different datasets. Finally, we demonstrate our approach both in the simulation environment and the real world.

Comparison between GRID and LLM-As-Planner on datasets with different numbers of objects in each scene:

Ablation experiment:

response from chatGPT4 API:

Instruction-subtask examples:

Cooperative unit

If you use the data or code please cite:

Zhe Ni†, Xiao-Xin Deng†, Cong Tai†,
Xin-Yue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng∗

Comparison between GRID and LLM-As-Planner on datasets
with different numbers of objects in each scene: