Stairway to Success:
Zero-Shot Floor-Aware Object-Goal Navigation
via LLM-Driven Coarse-to-Fine Exploration

Zeying Gong1       Rong Li1       Tianshuai Hu2       Ronghe Qiu1       Lingdong Kong3      
Lingfeng Zhang1       Yiyi Ding1       Leying Zhang4       Junwei Liang1,2,✉

1 The Hong Kong University of Science and Technology (Guangzhou)     2 The Hong Kong University of Science and Technology    
3 National University of Singapore     4 Sun Yat-sen University

  Overview Video


  Real-World Demonstrations

Testing our method in real-world environments.

  Multi-Floor Navigation

  Target Object: Potted Plant (Upstairs Navigation), 5x Speed

  Target Object: Truck (Downstairs Navigation), 5x Speed

  Single-Floor Navigation

  Target Object: Chair, 5x Speed



  Simulation Demonstrations

Multi-floor and single-floor navigation with open-vocabulary target objects.

  Multi-Floor Navigation

  Target Object: Bed

  Target Object: Table

  Target Object: Couch

  Target Object: Chair

  Single-Floor Navigation

  Target Object: Fireplace

  Target Object: Nightstand


  Abstract

Object-Goal Navigation (OGN) remains challenging in real-world, multi-floor environments and under open-vocabulary object descriptions. We observe that most episodes in widely used benchmarks such as HM3D and MP3D involve multi-floor buildings, with many requiring explicit floor transitions. However, existing methods are often limited to single-floor settings or predefined object categories. To address these limitations, we tackle two key challenges: (1) efficient cross-level planning and (2) zero-shot object-goal navigation (ZS-OGN), where agents must interpret novel object descriptions without prior exposure. We propose ASCENT, a framework that combines a Multi-Floor Spatial Abstraction module for hierarchical semantic mapping and a Coarse-to-Fine Frontier Reasoning module leveraging Large Language Models (LLMs) for context-aware exploration, without requiring additional training on new object semantics or locomotion data. Our method outperforms state-of-the-art ZS-OGN approaches on HM3D and MP3D benchmarks while enabling efficient multi-floor navigation. We further validate its practicality through real-world deployment on a quadruped robot, achieving successful object exploration across unseen floors.


  Motivation

Fig: Motivation & Objective. Our method enables robotic navigation in unexplored multi-floor environments under a zero-shot object-goal setting, leveraging prior knowledge and coarse-to-fine reasoning to prioritize likely target locations. Unlike previous approaches that struggle with floor-aware planning, our method efficiently handles cross-level transitions and successfully locates objects across floors. This demonstrates a meaningful step forward in the zero-shot object-goal navigation field.


  The ASCENT Framework

Fig: Overview of the ASCENT framework. The system takes RGB-D and GPS+Compass inputs (top-left), and uses a pretrained navigation policy (bottom-left) to output actions at each timestep. The Multi-Floor Spatial Abstraction module (top-right) builds single-floor BEV maps and models inter-floor connectivity, enabling cross-level navigation. The Coarse-to-Fine Frontier Reasoning module (bottom-right) selects top-k frontiers based on image-text matching scores and uses an LLM for contextual reasoning across floors, achieving efficient zero-shot, floor-aware navigation.


  Quantitative Results on the OGN Task

HM3D and MP3D datasets. Metrics: SR (Success Rate) / SPL (Success weighted by Path Length).


Setting Method Venue Vision Language HM3D
SR
HM3D
SPL
MP3D
SR
MP3D
SPL
Setting: Close-Set
Single-FloorSemExpNeurIPS'20--37.918.836.014.4
AuxICCV'21----30.310.8
PONICVPR'22----31.812.1
Habitat-WebCVPR'22--41.516.035.410.2
RIMIROS'23--57.827.250.317.0
Multi-FloorPIRLNavCVPR'23--64.127.1--
XGXICRA'24--72.935.7--
Setting: Open-Set
Single-FloorZSONNeurIPS'22CLIP-25.512.615.34.8
L3MVNIROS'23- GPT-250.423.134.914.5
SemUtilRSS'23-BERT54.024.9--
CoWCVPR'23CLIP-32.018.1--
ESCICML'23- GPT-3.539.222.328.714.2
PSLECCV'24CLIP-42.419.2--
VoroNavICML'24BLIP GPT-3.542.026.0--
PixNavICRA'24LLaMA-Adapter GPT-437.920.5--
TrihelperIROS'24 Qwen-VLChat-Int4 GPT-256.525.3--
VLFMICRA'24BLIP-2-52.530.436.417.5
GAMapNeurIPS'24CLIP GPT-4V53.126.0--
SG-NavNeurIPS'24LLaVA GPT-454.024.940.216.0
InstructNavCoRL'24- GPT-4V58.020.9--
UniGoalCVPR'25LLaVALLaMA-254.024.941.016.4
Multi-FloorMFNPICRA'25 Qwen-VLChat Qwen2-7B58.326.741.115.4
Ours – BLIP-2 Qwen2.5-7B 65.4 33.5 44.5 15.5

Tab: Quantitative Results on the OGN Task. This table presents quantitative comparisons of the Object-Goal Navigation task on the HM3D and MP3D datasets. It contrasts supervised and zero-shot methods across the metrics of Success Rate (SR) and Success Weighted by Path Length (SPL), highlighting the state-of-the-art performance of our approach in open-vocabulary and multi-floor navigation scenarios.



  Cross-Floor Cases


  Case 1: Stair Ascending

After traversing the current floor, the agent makes a multi-floor decision to ascend stairs and successfully finds the goal on a higher floor.

  Case 2: Stair Descending

The agent infers the target is on a lower floor, chooses to descend, and successfully navigates downstairs.

  Case 3: Stairwell Reasoning

Even starting mid-stair or in ambiguous areas, the agent infers and commits to multi-floor actions.

  Case 4: Floor Revisiting

Even after incorrect decisions, the agent is capable of returning to previous floors and successfully completing navigation.