We evaluate RING directly on 4 unseen robots (Stretch-RE1, Stretch-RE1(factory config), LoCoBot, Unitree Go1) in a real-world apartment (layout shown below), without any further adaptation or real-world-specific finetuning. Here we present a trajectory for each robot embodiment. The agent's RGB input as well as a 3rd person perspective is shown for each example. All videos are sped up by up to 20x for ease of viewing.
[We use black boxes to hide individuals' identities when they appear in the frame.]
In this experiment, we use the original off-the-shelf camera equipped on the Stretch RE-1 (D435 with a vertical field of view of $69^{\circ}$ and resolution of $720 \times 1280$). RING effectively explores the room and finds the apple despite having a narrow field of view.
For this embodiment, following SPOC, we use 2 Intel RealSense 455 fixed cameras, with a vertical field of view of $59^{\circ}$ and resolution of $1280 \times 720$. RING explores multiple rooms while avoiding obstacles to find the mug.
We use Unitree Go1 as another embodiment with lower height. The robot navigates to the toilet from the living room. Please note that our low-level controller is not perfect, causing the robot to drift to left when moving forward. Despite this limitation, RING shows robustness by correcting the robot's trajectory using RotateRight actions.
The robot starts in a bedroom with 3 full 360 degree rotations to scan the room. It then navigates to the corridor and finds the basketball. The exhaustive exploration of the bedroom is due to the bias in simulation where the basketball is often placed in the bedroom.
Both robots go to the kitchen to find the trashcan. In the left video, Stretch goes around the table in the corridor to avoid collisions. In the right video, when the table completely blocks the way, RING controls the Go1 robot to walk under the table because of its lower height.
This example shows a failure example due to an embodiment limitation. The Go1 robot has a low camera placement with a limited field of view. This results in incorrectly identifying a leather bench at the foot of the bed as a sofa. (Please refer to the robot's first-person view to observe the recognition error.)
To further show RING's generalization capability to novel embodiments, we evaluate it as a navigation assistant with humans as new, unseen embodiments.
We asked 5 participants to navigate in a real-world kitchen area, following the policy’s output actions on their phones.
Each individual has unique characteristics, including step size, height, rotation angles, and camera-holding posture. Each person navigates to three different objects (Mug, Apple, Houseplant), resulting in a total of 15 trajectories.
Below we show two qualitative trajectories navigated by two participants using RING.
Find a houseplant.
Find an apple.
The RING policy shows embodiment-adaptive behavior.
In the simulation, both Stretch RE-1 and Unitree A1 start from the same pose behind the bed. The quadruped robot directly moves under the bed because of its lower height, while the Stretch RE-1 bypasses it.
Then, our third agent matches the Stretch RE-1’s height but has a camera positioned as low as the quadruped’s. Initially, it assumes a lower height and attempts to go under the bed, but after colliding, it adjusts to maneuver around the bed, similar to the Stretch RE-1.
This shows RING implicitly infers embodiment parameters from visual observations and transition dynamics, dynamically adjusting its navigation strategy accordingly. It does not have access to any privileged information about its current body. This embodiment-adaptive navigation strategy adjustment is an interesting emergent behavior that would not have been possible without training across the exhaustive space of embodiments at scale.
With the growing diversity of robots used in real-world applications, there remains a need for a policy that can operate a wide range of embodiments and transfer, in a zero- or few-shot manner, to novel robots. We introduce RING, a generalist policy for indoor visual navigation that learns from a broad spectrum of embodiments, trained exclusively in simulation, without any direct use of actual robot embodiments.
We augment the AI2-THOR simulator with the ability to instantiate robot embodiments with controllable configurations, varying across body size, rotation pivot point, and camera configurations. We model the body of the agent as an invisible collider box in the AI2-THOR simulator. Each agent can have 1 or 2 RGB cameras placed at a random pose within the collider box. Parameters corresponding to both the body and the cameras are sampled randomly from specified ranges. We also modify the planner of generating expert trajectories to account for the diversity of embodiments.
We first pretrain our policy on expert trajectories collected from randomized embodiments, followed by finetuning with on-policy RL, using the randomized embodiments in the AI2-THOR simulator. RL finetuning is specifically important for the policy to learn to navigate a diverse set of embodiments through trial-and-error. Architecture is outlined in the figure below.
Our experiments show that RING operates effectively across a wide range of embodiments, including actual robots (Stretch RE-1, LoCoBot , and Unitree Go1) and human evaluation with Navigation Assistants, despite being trained exclusively in simulation without any direct exposure to real robot embodiments.
We perform zero-shot evaluate of all policies on four robot embodiments: Stretch RE-1 (with 1 or 2 cameras), LoCoBot, and Unitree A1 in simulation. RING shows strong generalization across all embodiments despite not being trained on any of them, achieving an average absolute improvement of 16.7% in Success Rate.
RING transfers zero-shot to the real-world without any finetuning. Gray numbers are evaluated on same embodiment as their training. RING shows consistent performance across all embodiments achieving 78.9% success rate on average across $4$ real-world robots.
We asked 5 participants to navigate in a real-world kitchen area, following the policy’s output actions on their phones. Each individual has unique characteristics, including step size, height, rotation angles, and camera-holding posture. Each person navigates to three different objects (Mug, Apple, Houseplant), resulting in a total of $15$ trajectories. RING shows much better generalization to human embodiment than the FLaRe baseline trained on Stretch RE-1.