Stairway to Success:
An Online Floor-Aware Zero-Shot Object-Goal
Navigation Framework via LLM-Driven Coarse-to-Fine Exploration

Zeying Gong1       Rong Li1       Tianshuai Hu2       Ronghe Qiu1       Lingdong Kong3      
Lingfeng Zhang4       Guoyang Zhao1       Yiyi Ding1       Junwei Liang1,2,✉

1 The Hong Kong University of Science and Technology (Guangzhou)     2 The Hong Kong University of Science and Technology    
3 National University of Singapore     4 Tsinghua University

  Overview Video


  Real-World Demonstrations

Testing our method in real-world environments.

  Multi-Floor Navigation

  Target Object: Potted Plant (Upstairs Navigation), 5x Speed

  Target Object: Truck (Downstairs Navigation), 5x Speed

  Single-Floor Navigation

  Target Object: Chair, 5x Speed



  Simulation Demonstrations

Multi-floor and single-floor navigation with open-vocabulary target objects.

  Multi-Floor Navigation

  Target Object: Bed

  Target Object: Table

  Target Object: Couch

  Target Object: Chair

  Single-Floor Navigation

  Target Object: Fireplace

  Target Object: Nightstand


  Abstract

Deployable service and delivery robots struggle to navigate multi-floor buildings to reach object goals, as existing systems fail due to single-floor assumptions and requirements for offline, globally consistent maps. Multi-floor environments pose unique challenges including cross-floor transitions and vertical spatial reasoning, especially navigating unknown buildings. Object-Goal Navigation benchmarks like HM3D and MP3D also capture this multi-floor reality, yet current methods lack support for online, floor-aware navigation. To bridge this gap, we propose ASCENT, an online framework for Zero-Shot Object-Goal Navigation that enables robots to operate without pre-built maps or retraining on new object categories. It introduces: (1) a Multi-Floor Abstraction module that dynamically constructs hierarchical representations with stair-aware obstacle mapping and cross-floor topology modeling, and (2) a Coarse-to-Fine Reasoning module that combines frontier ranking with LLM-driven contextual analysis for multi-floor navigation decisions. We evaluate on HM3D and MP3D benchmarks, outperforming state-of-the-art zero-shot approaches, and demonstrate real-world deployment on a quadruped robot.


  Motivation

Fig: Motivation of ASCENT. Unlike prior approaches that fail in multi-floor scenarios, our method enables online multi-floor navigation. By reasoning across floors, our policy succeeds in locating the goal and demonstrates a meaningful step forward in Zero-Shot Object-Goal Navigation.


  The ASCENT Framework

Fig: Framework overview of ASCENT. The system takes RGB-D inputs (top-left), and outputs navigation actions (bottom-right). The Multi-Floor Abstraction module (top) builds intra-floor BEV maps and models inter-floor connectivity. The Coarse-to-Fine Reasoning module (bottom) uses the LLM for contextual reasoning across floors. Therefore, ASCENT achieves floor-aware, Zero-Shot Object-Goal Navigation.


  Quantitative Results on the OGN Task

HM3D and MP3D datasets. Metrics: SR (Success Rate) / SPL (Success weighted by Path Length).


Setting Method Venue Vision Language HM3D
SR
HM3D
SPL
MP3D
SR
MP3D
SPL
Setting: Learning-Based
Single-FloorSemExpNeurIPS'20--37.918.836.014.4
AuxICCV'21----30.310.8
PONICVPR'22----31.812.1
Habitat-WebCVPR'22--41.516.035.410.2
RIMIROS'23--57.827.250.317.0
Multi-FloorPIRLNavCVPR'23--64.127.1--
XGXICRA'24--72.935.7--
Setting: Zero-Shot
Single-FloorZSONNeurIPS'22CLIP-25.512.615.34.8
L3MVNIROS'23- GPT-250.423.134.914.5
SemUtilRSS'23-BERT54.024.9--
CoWCVPR'23CLIP-32.018.1--
ESCICML'23- GPT-3.539.222.328.714.2
PSLECCV'24CLIP-42.419.2--
VoroNavICML'24BLIP GPT-3.542.026.0--
PixNavICRA'24LLaMA-Adapter GPT-437.920.5--
TrihelperIROS'24 Qwen-VLChat-Int4 GPT-256.525.3--
VLFMICRA'24BLIP-2-52.530.436.417.5
GAMapNeurIPS'24CLIP GPT-4V53.126.0--
SG-NavNeurIPS'24LLaVA GPT-454.024.940.216.0
InstructNavCoRL'24- GPT-4V58.020.9--
UniGoalCVPR'25LLaVALLaMA-254.024.941.016.4
Multi-FloorMFNPICRA'25 Qwen-VLChat Qwen2-7B58.326.741.115.4
Ours – BLIP-2 Qwen2.5-7B 65.4 33.5 44.5 15.5

Tab: Quantitative Results on the OGN Task. This table presents quantitative comparisons of the Object-Goal Navigation task on the HM3D and MP3D datasets. It contrasts supervised and zero-shot methods across the metrics of Success Rate (SR) and Success Weighted by Path Length (SPL), highlighting the state-of-the-art performance of our approach in open-vocabulary and multi-floor navigation scenarios.



  Cross-Floor Cases


  Case 1: Stair Ascending

After traversing the current floor, the agent makes a multi-floor decision to ascend stairs and successfully finds the goal on a higher floor.

  Case 2: Stair Descending

The agent infers the target is on a lower floor, chooses to descend, and successfully navigates downstairs.

  Case 3: Stairwell Reasoning

Even starting mid-stair or in ambiguous areas, the agent infers and commits to multi-floor actions.

  Case 4: Floor Revisiting

Even after incorrect decisions, the robot can revisit previous floors and successfully completing navigation.