Introduction

This project was done as a class project for Deep Learning for Robotics class @Georgia Tech.

While humans naturally follow an act-see-reason-act paradigm, current robot planning approaches often lack the ability to incorporate feedback from actions and recover from failures. This sort of long horizon task planning is missing in robots.

In this project, we propose a novel approach that combines large language models (LLMs) with a visual reasoning system (VRS) inspired by the human visual cortex.

Goals

Consider a discretized system of n steps from start state s0 to goal state g(L). We consider the problem of embodied task planning where an agent needs to generate a sequence of n discrete low level actions ak, in order to accomplish a high-level natural language instruction L given in english like ”Place all fruits in the bowl”.

The agent has access to observations ok of the world state sk, at the start of every action ak, in the form of RGB image inputs. The agent has to find a policy function π such that, given an instruction L, the function maps the observation space to actions that drive the state toward sn = g(L)

The system chosen was based on Ravens simulator doing pick and place tabletop tasks. The goal was to make the robotic manipulator perform tasks like placing a stack of blocks on top of each other or putting blocks in bowls.

Work Done

Our system consists of 4 parts:
1. Controller - Central hub to co-ordinate actions
2. Actor - Executes low level actions
3. VRS - Visual Reasoning to sense the system
4. Planner - Makes code-plans to break down the action

The controller was just a locally executed script that timed each action with logic given predetermined checks. The actor was directly taken from CLIPORT which combines CLIP encodings and Transporter networks to execute intelligent actions from textual embeddings. The code-plan was generated using LLMs to break down the larger task into a simplistic one. While multimodal OFA model was used for the Visual Reasoning module.

My task was involved with finetuning the OFA module. It seemed to struggle with complex spatial tasks related to multiple objects for our data even when it had high accuracy on VQAV2 benchmark.

Generating training data was a challenge as we required a way to caption each image with multiple questions and corresponding answers. We decided to approach this using the CLEVR dataset generation pipeline which is based on top of the Blender API. This generates both random images with various spatial configurations and corresponding question and answers for the training.


Results

We generated over 1000 images and 5000+ corresponding QnA by modifying the pipeline for our use case. This involved multiple combination of block colors, bowl colors, stack sizes, spatial placements. The questions also had variety using 5-6 general question templates that challenged the model to generalize well.

We finetuned the model on a weighted mixture of this dataset and VQAV2 and the resulting model obtained about 50% accuracy on our tasks which was comparable to benchmark accuracies. (Earlier it was 0%)

For the final part we combine the feedback with our Codeplan module and compare the results without it. We test it for 3 different tasks over a series of 15 iterations of experiments. Retries more than 20 count as a failure for the task. We notice that there was a considerable increase in success when using our method when the robot has Visual Feedback to correct for its wrong actions.

By following a cyclic three-stage process of breaking tasks down into single-step sub-tasks, executing these sub-tasks while generating feedback, and dynamically re-planning when necessary, our method empowers robots to recover swiftly from failures and successfully accomplish long-term tasks

Contact me!

📞(404) 388-3944    📍Atlanta, GA, 30309   📧bhushan.pawaskar@hotmail.com

© Copyright 2023 Bhushan Pawaskar - All Rights Reserved