Leverage Large Language Models for Complex Robot Manipulation - Center of Excellence in Data Science and Artificial Intelligence

PI Researcher: Chenliang Xu, 蘑菇传媒
Company Partner: Corning

In robot manipulation, adapting to dynamic environments with flexible task specifications is challenging. Language-based vision manipulation systems offer a solution by linking language instructions to visual data and generating actions. However, current approaches often develop vision models and action policies separately, leading to poor integration. To address this, we propose ACTLLM, a method that unifies visual interpretation and policy learning using large language models (LLMs). By generating structured scene descriptions and incorporating an action consistency loss, ACTLLM is expected to enhance the fusion of visual and policy elements, facilitating the efficient execution of complex tasks within a multi-turn visual dialogue framework.