IEEE Robotics and Automation Letters (RA-L) 2026

Pixel2Catch:
Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Dongguk University
*Corresponding author
Pixel2Catch teaser figure

Pixel2Catch infers object motion from pixel-level features in image space — no 3D position required. Policies trained in simulation transfer directly to the real robot without fine-tuning.

Abstract

To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion.

Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.

Overview Video

Overview of Pixel2Catch: method, training in simulation, and real-world results.

Method

Heterogeneous Multi-Agent Framework

The robot arm and the multi-fingered hand are modeled as independent agents with role-specific observations and rewards. The arm agent (πarm) focuses on positioning the end-effector to reach the thrown object, while the hand agent (πhand) concentrates on forming stable grasps during the catching phase. Both policies are trained cooperatively with MAPPO under the Centralized Training with Decentralized Execution (CTDE) paradigm.

System overview
Pipeline of the system and experimental setup. Each policy operates on selected observations from two consecutive timesteps. Privileged information is used only during value network training. A single RGB camera is mounted 0.5 m behind and 2.2 m above the robot.

Pixel-Level Features from a Single RGB Image

Instead of explicit 3D position estimation, we extract pixel-level features from a single RGB image: the center coordinates (cx, cy), width w, height h, and their temporal differences. The center motion encodes the object's apparent direction; the width/height scale change encodes relative distance. SAM2 is used for robust object segmentation under varying lighting and backgrounds.

Pixel-level features
Pixel-level features extracted from simulation and real-world RGB frames. The same feature definition transfers across the sim-to-real gap.

Simulation Results

Training curves
Tracking and success rates over training, averaged over 3 seeds. Pixel-level features substantially outperform position-only baselines.

Performance on Seen and Unseen Objects (averaged over 3 seeds)

Metric Method Objects
SeenUnseen
T.R. (%) w/o PF 12.13 ± 1.2712.11 ± 1.90
Only-WH 8.03 ± 0.60 6.11 ± 0.35
Only-Center87.07 ± 1.6287.50 ± 1.53
S-A RL 78.20 ± 1.3275.83 ± 2.00
3D Pos 93.33 ± 0.5591.83 ± 1.16
Proposed 89.97 ± 0.2189.28 ± 0.79
S.R. (%) w/o PF 8.93 ± 1.04 7.89 ± 1.46
Only-WH 5.53 ± 0.51 4.00 ± 0.17
Only-Center81.27 ± 0.7680.72 ± 0.38
S-A RL 63.50 ± 0.8565.44 ± 2.25
3D Pos 90.43 ± 1.2789.10 ± 1.39
Proposed 84.13 ± 0.5084.83 ± 1.17

T.R. = Tracking Rate, S.R. = Success Rate. Bold = best; underline = comparable to best. Although 3D Pos attains the highest rates by using oracle 3D coordinates available only in simulation, the Proposed model achieves nearly the same tracking and success rates while relying solely on pixel-level visual cues from RGB.

Real-World Experiments

Real-world catching sequences
Real-world catching sequences for objects with different geometries. The mono-cam view shows the RGB scene captured by the installed camera; objects are segmented using SAM2 and pixel-level features are provided to the policy as input. Policies are transferred from simulation to the real robot without any fine-tuning.

Real-World Demonstrations

The proposed Pixel2Catch policy is transferred directly from simulation to the real robot without fine-tuning, catching three differently shaped objects thrown by a human.

Cube

L-block

Triangle

Real-World Performance (360 trials per policy)

Metric Method Objects
CubeL-blockTriangle
T.R. (%) Only-WH 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
Only-Center44.73 ± 7.8044.80 ± 7.8747.47 ± 5.15
S-A RL 64.67 ± 9.0157.27 ± 8.7661.40 ± 8.22
3D Pos 70.83 ± 5.6965.83 ± 10.6770.00 ± 6.09
Proposed 77.93 ± 6.5676.60 ± 4.1574.67 ± 7.67
S.R. (%) Only-WH 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0
Only-Center15.27 ± 7.3310.60 ± 8.3717.27 ± 3.52
S-A RL 31.27 ± 5.5534.00 ± 8.6330.00 ± 6.67
3D Pos 45.83 ± 5.0044.17 ± 4.1935.83 ± 5.69
Proposed 60.60 ± 4.3051.27 ± 6.6046.60 ± 4.77

Results averaged over 360 trials per policy (3 objects × 2 throwers × 2 backgrounds × 30 trials). The Proposed model (highlighted) consistently achieves the highest tracking and success rates across all objects in real-world deployment.

BibTeX

@article{kim2026pixel2catch,
  title   = {Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation
             with a Single RGB Camera},
  author  = {Kim, Seongyong and Cho, Junhyeon and Lee, Kang-Won and Lim, Soo-Chul},
  journal = {IEEE Robotics and Automation Letters},
  year    = {2026},
  note    = {Accepted, to appear}
}