VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation

The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain complexity. Based on the complexity classification, the system switches to the most suitable navigation mode, composing of perception, mapping and planning modules designed for different terrain types, to traverse the terrain ahead before reaching the next waypoint. By integrating the local navigation system with a map server and a global waypoint generation module, the rover is equipped to handle long-distance navigation tasks in complex scenarios. The navigation system is evaluated in various simulation environments. Compared to the single-mode conservative navigation method, our multi-mode system is able to bootstrap the time and energy efficiency in a long-distance traversal with varied type of obstacles, enhancing efficiency by 79.5%, while maintaining its avoidance capabilities against terrain hazards to guarantee rover safety.

The local navigation system utilizes a VLM terrain classifier and three navigation methods tailored to different terrains: flat, rocky, and challenging. Terrain complexity is determined from RGB images by analyzing slope and rock distribution. Three distinct navigation strategies are designed and adopted, and a closed-loop navigation system is established that dynamically adapts to different terrains.

Efficient Mode

For flat terrain, we introduce efficient mode, which eliminates onboard perception and complex planning, generating a smooth path for efficiency.

Safe Mode

For rocky terrain, safe mode performs rock detection to construct a local obstacle map for real-time, and plans a path through obstacles with a lower speed.

Conservative Mode

For challenging terrain, we utilize elevation mapping to generate a 2.5D costmap. A* planning combined with the costmap and a conservative speed, ensures safe traversal.

The classification results show that the VLM approach has better performance in moderate and complex terrain than geometric method, especially in ambiguous scenarios.

Terrain Type	Geometry-based Method	VLM Method	Avg. Accuracy
Avg. Rock Grid Num.	Avg. Slope Value	Avg. Slope Variance	Avg. Rock Complexity	Avg. Slope Complexity	Geometry	VLM
Flat	0	2.6815	4.67	0.045	0.095	100%	100%
Rocky	404.65	5.5095	141.3865	0.59	0.2	90%	95%
Challenging	444.35	30.118	273.3555	0.435	0.695	85%	100%

The efficient mode minimizes travel time in obstacle-free areas but fails in complex environments due to a lack of obstacle detection. The safe mode avoids obstacles effectively but misinterprets terrain features as hazards in challenging landscapes. The conservative mode, though less efficient, ensures successful navigation across all terrains.

Efficient Mode Failure

In complex environments, the multi-mode system dynamically adapts to terrain complexity by switching to corresponding mode, reducing traversal time to 55.7% compared to single-mode conservative navigation, improving efficiency without compromising safety.

	Single-mode	Multi-mode
	Total	Total	Efficient	Safe	Conservative
Time	1081.7	602.6	144.9	158.6	299.1
Distance	413.7	411.8	215.1	94.7	102

VLM-empowered Multi-mode System for Efficient and Safe Planetary Navigation

Abstract

Video

System Description

Efficient Mode

Safe Mode

Conservative Mode

Results

Terrain Classification

Single-mode Traversal

Efficient Mode Failure

Safe Mode Failure

Conservative Mode

Multi-mode Traversal