Stairway to Success

Overview Video

Real-World Demonstrations

Testing our method in real-world environments.

Multi-Floor Navigation

Target Object: Potted Plant (Upstairs Navigation), 5x Speed

Target Object: Truck (Downstairs Navigation), 5x Speed

Single-Floor Navigation

Target Object: Chair, 5x Speed

Simulation Demonstrations

Multi-floor and single-floor navigation with open-vocabulary target objects.

Multi-Floor Navigation

Target Object: Bed

Target Object: Table

Target Object: Couch

Target Object: Chair

Single-Floor Navigation

Target Object: Fireplace

Target Object: Nightstand

Abstract

Object-Goal Navigation (OGN) remains challenging in real-world, multi-floor environments and under open-vocabulary object descriptions. We observe that most episodes in widely used benchmarks such as HM3D and MP3D involve multi-floor buildings, with many requiring explicit floor transitions. However, existing methods are often limited to single-floor settings or predefined object categories. To address these limitations, we tackle two key challenges: (1) efficient cross-level planning and (2) zero-shot object-goal navigation (ZS-OGN), where agents must interpret novel object descriptions without prior exposure. We propose ASCENT, a framework that combines a Multi-Floor Spatial Abstraction module for hierarchical semantic mapping and a Coarse-to-Fine Frontier Reasoning module leveraging Large Language Models (LLMs) for context-aware exploration, without requiring additional training on new object semantics or locomotion data. Our method outperforms state-of-the-art ZS-OGN approaches on HM3D and MP3D benchmarks while enabling efficient multi-floor navigation. We further validate its practicality through real-world deployment on a quadruped robot, achieving successful object exploration across unseen floors.

Quantitative Results on the OGN Task

HM3D and MP3D datasets. Metrics: SR (Success Rate) / SPL (Success weighted by Path Length).

Setting	Method	Venue	Vision	Language	HM3D SR	HM3D SPL	MP3D SR	MP3D SPL
Setting: Close-Set
Single-Floor	SemExp	NeurIPS'20	-	-	37.9	18.8	36.0	14.4
	Aux	ICCV'21	-	-	-	-	30.3	10.8
	PONI	CVPR'22	-	-	-	-	31.8	12.1
	Habitat-Web	CVPR'22	-	-	41.5	16.0	35.4	10.2
	RIM	IROS'23	-	-	57.8	27.2	50.3	17.0
Multi-Floor	PIRLNav	CVPR'23	-	-	64.1	27.1	-	-
Multi-Floor	XGX	ICRA'24	-	-	72.9	35.7	-	-
Setting: Open-Set
Single-Floor	ZSON	NeurIPS'22	CLIP	-	25.5	12.6	15.3	4.8
	L3MVN	IROS'23	-	GPT-2	50.4	23.1	34.9	14.5
	SemUtil	RSS'23	-	BERT	54.0	24.9	-	-
	CoW	CVPR'23	CLIP	-	32.0	18.1	-	-
	ESC	ICML'23	-	GPT-3.5	39.2	22.3	28.7	14.2
	PSL	ECCV'24	CLIP	-	42.4	19.2	-	-
	VoroNav	ICML'24	BLIP	GPT-3.5	42.0	26.0	-	-
	PixNav	ICRA'24	LLaMA-Adapter	GPT-4	37.9	20.5	-	-
	Trihelper	IROS'24	Qwen-VLChat-Int4	GPT-2	56.5	25.3	-	-
	VLFM	ICRA'24	BLIP-2	-	52.5	30.4	36.4	17.5
	GAMap	NeurIPS'24	CLIP	GPT-4V	53.1	26.0	-	-
	SG-Nav	NeurIPS'24	LLaVA	GPT-4	54.0	24.9	40.2	16.0
	InstructNav	CoRL'24	-	GPT-4V	58.0	20.9	-	-
	UniGoal	CVPR'25	LLaVA	LLaMA-2	54.0	24.9	41.0	16.4
Multi-Floor	MFNP	ICRA'25	Qwen-VLChat	Qwen2-7B	58.3	26.7	41.1	15.4
Multi-Floor	Ours	–	BLIP-2	Qwen2.5-7B	65.4	33.5	44.5	15.5

Tab: Quantitative Results on the OGN Task. This table presents quantitative comparisons of the Object-Goal Navigation task on the HM3D and MP3D datasets. It contrasts supervised and zero-shot methods across the metrics of Success Rate (SR) and Success Weighted by Path Length (SPL), highlighting the state-of-the-art performance of our approach in open-vocabulary and multi-floor navigation scenarios.

Cross-Floor Cases

Case 1: Stair Ascending

After traversing the current floor, the agent makes a multi-floor decision to ascend stairs and successfully finds the goal on a higher floor.

Case 2: Stair Descending

The agent infers the target is on a lower floor, chooses to descend, and successfully navigates downstairs.

Case 3: Stairwell Reasoning

Even starting mid-stair or in ambiguous areas, the agent infers and commits to multi-floor actions.

Case 4: Floor Revisiting

Even after incorrect decisions, the agent is capable of returning to previous floors and successfully completing navigation.

Stairway to Success:
Zero-Shot Floor-Aware Object-Goal Navigation
via LLM-Driven Coarse-to-Fine Exploration

Overview Video

Real-World Demonstrations

Multi-Floor Navigation

Target Object: Potted Plant (Upstairs Navigation), 5x Speed

Target Object: Truck (Downstairs Navigation), 5x Speed

Single-Floor Navigation

Target Object: Chair, 5x Speed

Simulation Demonstrations

Multi-Floor Navigation

Target Object: Bed

Target Object: Table

Target Object: Couch

Target Object: Chair

Single-Floor Navigation

Target Object: Fireplace

Target Object: Nightstand

Abstract

Motivation

The ASCENT Framework

Quantitative Results on the OGN Task

Cross-Floor Cases

Case 1: Stair Ascending

Case 2: Stair Descending

Case 3: Stairwell Reasoning

Case 4: Floor Revisiting

Stairway to Success:Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration

Overview Video

Real-World Demonstrations

Multi-Floor Navigation

Target Object: Potted Plant (Upstairs Navigation), 5x Speed

Target Object: Truck (Downstairs Navigation), 5x Speed

Single-Floor Navigation

Target Object: Chair, 5x Speed

Simulation Demonstrations

Multi-Floor Navigation

Target Object: Bed

Target Object: Table

Target Object: Couch

Target Object: Chair

Single-Floor Navigation

Target Object: Fireplace

Target Object: Nightstand

Abstract

Motivation

The ASCENT Framework

Quantitative Results on the OGN Task

Cross-Floor Cases

Case 1: Stair Ascending

Case 2: Stair Descending

Case 3: Stairwell Reasoning

Case 4: Floor Revisiting

Stairway to Success:
Zero-Shot Floor-Aware Object-Goal Navigation
via LLM-Driven Coarse-to-Fine Exploration