The LongCat team has launched LARYBench, a groundbreaking benchmark designed to evaluate AI models’ capabilities in learning actions from video, rather than solely measuring their performance with robotic policies. This innovative tool addresses the challenge of limited robot training data by utilizing abundant human and robot videos, effectively translating visual signals into latent action representations. In their findings, they revealed that general vision foundation models outperform specialized embodied latent action models, with significant performance differences observed in several evaluations. Specifically, LARYBench distinguishes itself by isolating the evaluation of latent actions from overall policy performance, allowing for a clearer analysis of how well a model understands and can reproduce physical movements.

Qi Lv: Qi Lv is a researcher at Meituan’s LongCat team contributing to LARYBench. The benchmark evaluates models on high-level semantic actions and low-level robot dynamics using curated video datasets. Lv’s involvement supports advancing scalable action learning from unlabeled videos.
DINOv3: DINOv3 is Meta AI’s self-supervised vision foundation model using advanced strategies for scalable feature learning. It achieves top performance on LARYBench regression tasks, showing latent features align well with robot actions. The model leverages large-scale pre-training for dense visual understanding.
V-JEPA: V-JEPA 2 is Meta AI’s self-supervised video world model that predicts future video representations from past ones. It leads LARYBench in action classification by extracting strong latent actions without explicit motion supervision. The model enables better understanding of physical dynamics for robotics.
LongCat: LongCat is Meituan’s AI research team specializing in open-source models and benchmarks for embodied AI, vision-language-action systems, and multimodal understanding. They developed and released LARYBench, a benchmark that isolates and evaluates latent action representations from video for semantic action classification and robotic control regression. The release demonstrates how general vision models capture action knowledge better than specialized embodied models.
Dujun Nie: Dujun Nie is a researcher at Meituan’s LongCat team. He led the authorship of the LARYBench paper, introducing a framework to assess latent actions extracted from large-scale human and robot videos. His contributions focus on decoupling representation quality from policy performance in vision-to-action alignment.
Jun Kuang: Jun Kuang is a researcher at Meituan’s LongCat team involved in LARYBench creation. The project introduces pipelines for probing latent actions in classification and regression tasks. Kuang helps demonstrate latent spaces’ superiority over pixel spaces for robot control.
Xiaoyu Li: Xiaoyu Li is a researcher at Meituan’s LongCat team on the LARYBench paper. The benchmark uses data from diverse embodiments to test vision models’ action encoding. Li’s work highlights emergent action features from self-supervised visual pre-training.
Xuezhi Cao: Xuezhi Cao is a researcher at Meituan’s LongCat team contributing to LARYBench. It provides standardized evaluation for latent representations across action categories and robot platforms. Cao’s efforts aid in shifting robotics toward general visual foundations.
LAPA-DINOv2: LAPA-DINOv2 is a general latent action model using DINOv2 as backbone with self-supervised action training. Released by LongCat as part of LARYBench ablations, it outperforms embodied LAMs on diverse data. It demonstrates benefits of broad pre-training for vision-to-action generalization.
Xunliang Cai: Xunliang Cai is a researcher at Meituan’s LongCat team for LARYBench development. The benchmark reveals general models’ edge in action semantics and control alignment. Cai supports open-sourcing datasets and code for community research.
Fengjiao Chen: Fengjiao Chen is a researcher and project leader at Meituan’s LongCat team. She co-authored the LARYBench paper and oversees development of benchmarks for latent action representations in robotics. Her work emphasizes the advantages of general visual pre-training for physical control tasks.

Key Insight: General vision foundation models trained without action supervision outperform specialized embodied latent action models.
Benchmark Innovation: LARYBench decouples latent action evaluation from policy performance, testing semantic understanding and physical control separately.
Representation Advantage: Latent feature spaces provide better alignment to robotic action spaces than pixel reconstruction approaches.