OpenNav: Open-World Outdoor Navigation with Multimodal Large Language Models

University of Toronto

Key Features

  • Open-World Outdoor Navigation (OpenNav) is the first framework to enable zero-shot outdoor navigation by directly generating trajectories with an MLLM through a single task-agnostic prompt, without pre-trained skills, motion primitives, or in-context examples, allowing navigation with open-set instructions and objects.
  • Multi-Expert System for Robust Scene Comprehension By integrating state-of-the-art MLLMs with an open-vocabulary perception system (OVPS), OpenNav enhances environmental perception granularity, ensuring accurate interpretation of free-form language instructions while maintaining robustness against detector misdetections.
  • Turning Language Instructions into Robot Actions (Trajectory Level). OpenNav combines the reasoning, code generation, and function-calling abilities of MLLMs with classical planning techniques, harnessing the benefits of both human-like reasoning and geometry-compliant trajectory synthesis.
  • Evaluation in Autonomous Vehicle Datasets (AVDs): We validate OpenNav's performance using AVDs, offering a new approach to studying embodied intelligence with rich, labeled data from real-world navigation tasks.
OpenNav Architecture

Overview of OpenNav. Given the posed RGB-Lidar observation of the environment and an open-set free-form language instruction, 1) we leverages task-agnostic prompts to enable zero-shot generalization and adaptability to varied instructions; 2) MLLM generates code, which interacts with OVPS, to produce open-set multimodal scene perception outputs, and 2D bird-eye-view (BEV) value map (consists of a semantic map and occupancy map) grounded in the operation environment. 3) MLLM synthesizes a human-like coarse trajectory based on instructions, scene understanding, and its reasoning capabilities. The generated BEV value map then serves as the objective function for the motion planner, which refines the trajectory to ensure geometry-compliant navigation. Please see the next Figure for detailed pipeline, inputs, and outputs of the OVPS.

OVPS Architecture

Overview of Open Vocabulary Perception System (OVPS). OVPS sequentially performs detection, segmentation, and object caption generation. Combined with 3D point clouds, the system will generate 1) multimodal observations for VLN, including text prompts and image prompts for MLLMs, 2) 3D reconstructed map, as well as 2D occupancy and semantic maps for trajectory refinement.

OpenNav Demo

Given a free-form language instruction and sensor observations, OpenNav is capable of generating a dense sequence of instruction-following and scene-compliant robot waypoints in a zero-shot manner for open-world navigation, effectively handling open-set objects and open-set instructions without relying on in-context examples or pre-trained skills.

OpenNav Demo

Selected examples demonstrating how OpenNav utilizes Value Maps to generate task-aligned and geometry-compliant navigation trajectories: Left: User-specified tasks and trajectories generated by OpenNav; Right: Value maps illustrating trajectories generated by different algorithms based on the given tasks.

OpenNav Demo

This demo illustrates how language instructions update the value map through the occupancy map and semantic map to generate instruction-aligned trajectories. In the first task, the instruction is to move straight for 20 meters, where flat road areas have a lower cost. In the second task, the instruction requires avoiding white-lettered areas on the ground, resulting in these regions being assigned a higher cost. In the third task, the instruction specifies avoiding shaded sidewalk areas, which are also assigned a higher cost accordingly.