VLM-empowered Multi-mode System for Efficient and Safe Planetary Navigation

Sinuo Cheng1, Ruyi Zhou1, Wenhao Feng1, Huaiguang Yang1, Haibo Gao1, Zongquan Deng1, *Liang Ding1

Abstract

The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain complexity. Based on the complexity classification, the system switches to the most suitable navigation mode, composing of perception, mapping and planning modules designed for different terrain types, to traverse the terrain ahead before reaching the next waypoint. By integrating the local navigation system with a map server and a global waypoint generation module, the rover is equipped to handle long-distance navigation tasks in complex scenarios. The navigation system is evaluated in various simulation environments. Compared to the single-mode conservative navigation method, our multi-mode system is able to bootstrap the time and energy efficiency in a long-distance traversal with varied type of obstacles, enhancing efficiency by 79.5%, while maintaining its avoidance capabilities against terrain hazards to guarantee rover safety.

Video

System Description

The local navigation system utilizes a VLM terrain classifier and three navigation methods tailored to different terrains: flat, rocky, and challenging. Terrain complexity is determined from RGB images by analyzing slope and rock distribution. Three distinct navigation strategies are designed and adopted, and a closed-loop navigation system is established that dynamically adapts to different terrains.

Multi-mode Planetary Navigation Framework.

Efficient Mode

For flat terrain, we introduce efficient mode, which eliminates onboard perception and complex planning, generating a smooth path for efficiency.

Safe Mode

For rocky terrain, safe mode performs rock detection to construct a local obstacle map for real-time, and plans a path through obstacles with a lower speed.

Conservative Mode

For challenging terrain, we utilize elevation mapping to generate a 2.5D costmap. A* planning combined with the costmap and a conservative speed, ensures safe traversal.

Results

Terrain Classification

The classification results show that the VLM approach has better performance in moderate and complex terrain than geometric method, especially in ambiguous scenarios.

Terrain Type Geometry-based Method VLM Method Avg. Accuracy
Avg. Rock Grid Num. Avg. Slope Value Avg. Slope Variance Avg. Rock Complexity Avg. Slope Complexity Geometry VLM
Flat 0 2.6815 4.67 0.045 0.095 100% 100%
Rocky 404.65 5.5095 141.3865 0.59 0.2 90% 95%
Challenging 444.35 30.118 273.3555 0.435 0.695 85% 100%

Classification Scenes.

Single-mode Traversal

The efficient mode minimizes travel time in obstacle-free areas but fails in complex environments due to a lack of obstacle detection. The safe mode avoids obstacles effectively but misinterprets terrain features as hazards in challenging landscapes. The conservative mode, though less efficient, ensures successful navigation across all terrains.

Multi-mode Planetary Navigation Framework.

Efficient Mode Failure

Safe Mode Failure

Conservative Mode

Multi-mode Traversal

In complex environments, the multi-mode system dynamically adapts to terrain complexity by switching to corresponding mode, reducing traversal time to 55.7% compared to single-mode conservative navigation, improving efficiency without compromising safety.

Single-mode Multi-mode
Total Total Efficient Safe Conservative
Time 1081.7 602.6 144.9 158.6 299.1
Distance 413.7 411.8 215.1 94.7 102

Multi-mode.