Updated on 2025.11.21
Usage instructions: here
Manipulation
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations | Homanga Bharadhwaj Team | 2511.16661 | null |
| 2025-11-20 | InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy | Jiangmiao Pang Team | 2511.16651 | null |
| 2025-11-20 | Toward Artificial Palpation: Representation Learning of Touch on Soft Bodies | Aviv Tamar Team | 2511.16596 | null |
| 2025-11-20 | Green Resilience of Cyber-Physical Systems: Doctoral Dissertation | Diaeddin Rimawi Team | 2511.16593 | null |
| 2025-11-20 | VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference | Bo Zhao Team | 2511.16449 | null |
| 2025-11-20 | Graph Neural Networks for Surgical Scene Segmentation | Danail Stoyanov Team | 2511.16430 | null |
| 2025-11-20 | LAOF: Robust Latent Action Learning with Optical Flow Constraints | Wei Li Team | 2511.16407 | link |
| 2025-11-20 | Homogeneous Proportional-Integral-Derivative Controller in Mobile Robotic Manipulators | Andrey Polyakov Team | 2511.16406 | null |
| 2025-11-20 | Robot Metacognition: Decision Making with Confidence for Tool Invention | Pablo Lanillos Team | 2511.16390 | null |
| 2025-11-20 | Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning | Mohammad Yaqub Team | 2511.16333 | null |
| 2025-11-20 | Safe and Optimal Variable Impedance Control via Certified Reinforcement Learning | Ravi Prakash Team | 2511.16330 | null |
| 2025-11-20 | InEKFormer: A Hybrid State Estimator for Humanoid Robots | Frank Kirchner Team | 2511.16306 | null |
| 2025-11-20 | DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks | Anna Valente Team | 2511.16223 | null |
| 2025-11-20 | When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models | Yaochu Jin Team | 2511.16203 | null |
| 2025-11-20 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Zhijie Deng Team | 2511.16175 | null |
| 2025-11-20 | EvoVLA: Self-Evolving Vision-Language-Action Model | Hao Tang Team | 2511.16166 | null |
| 2025-11-20 | MagBotSim: Physics-Based Simulation and Reinforcement Learning Environments for Magnetic Robotics | Klaus Neumann Team | 2511.16158 | null |
| 2025-11-20 | Real-Time 3D Object Detection with Inference-Aligned Learning | Nan Xue Team | 2511.16140 | null |
| 2025-11-20 | Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers | Yuki Uranishi Team | 2511.16050 | null |
| 2025-11-20 | PushingBots: Collaborative Pushing via Neural Accelerated Combinatorial Hybrid Optimization | Meng Guo Team | 2511.15995 | null |
| 2025-11-19 | Optimus-Q: Utilizing Federated Learning in Adaptive Robots for Intelligent Nuclear Power Plant Operations through Quantum Cryptography | Sajedul Talukder Team | 2511.15614 | null |
| 2025-11-19 | SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models | Xipeng Qiu Team | 2511.15605 | null |
| 2025-11-19 | Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition | Tiantian Feng Team | 2511.15597 | null |
| 2025-11-19 | NMPC-based Motion Planning with Adaptive Weighting for Dynamic Object Interception | Steven Liu Team | 2511.15532 | null |
| 2025-11-19 | Decentralized Gaussian Process Classification and an Application in Subsea Robotics | James McMahon Team | 2511.15529 | null |
| 2025-11-19 | Theoretical Closed-loop Stability Bounds for Dynamical System Coupled with Diffusion Policies | François Ferland Team | 2511.15520 | null |
| 2025-11-19 | IPR-1: Interactive Physical Reasoner | Yong-Lu Li Team | 2511.15407 | null |
| 2025-11-19 | Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention | George Nikolakopoulos Team | 2511.15358 | null |
| 2025-11-19 | Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation | Fanjiang Xu Team | 2511.15292 | null |
| 2025-11-19 | Path Planning through Multi-Agent Reinforcement Learning in Dynamic Environments | Moharram Challenger Team | 2511.15284 | null |
| 2025-11-19 | Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception | Wenzhao Lian Team | 2511.15279 | null |
| 2025-11-19 | Behavior Trees vs Executable Ontologies: a Comparative Analysis of Robot Control Paradigms | Alexander Boldachev Team | 2511.15274 | null |
| 2025-11-19 | Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy | Tadashi Kozuno Team | 2511.15239 | null |
| 2025-11-19 | Efficient Transformer-Integrated Deep Neural Architectures for Robust EEG Decoding of Complex Visual Imagery | Byoung-Hee Kwon Team | 2511.15218 | null |
| 2025-11-19 | VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation | Yuke Zhu Team | 2511.15200 | link |
| 2025-11-19 | Eq.Bot: Enhance Robotic Manipulation Learning via Group Equivariant Canonicalization | Zhenzhou Shao Team | 2511.15194 | null |
| 2025-11-19 | Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation | Yong Huang Team | 2511.15167 | null |
| 2025-11-19 | An Alignment-Based Approach to Learning Motions from Demonstrations | Julie A Shah Team | 2511.14988 | null |
| 2025-11-18 | Automated laboratory x-ray diffractometer and fluorescence spectrometer for high-throughput materials characterization | Todd C. Hufnagel Team | 2511.14905 | link |
| 2025-11-19 | $π^{*}_{0.6}$ : a VLA That Learns From Experience | Zhiyuan Zhou Team | 2511.14759 | null |
| 2025-11-18 | HMC: Learning Heterogeneous Meta-Control for Contact-Rich Loco-Manipulation | Xiaolong Wang Team | 2511.14756 | null |
| 2025-11-18 | Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language | Andreea Bobu Team | 2511.14565 | null |
| 2025-11-18 | A Neuro-Symbolic Framework for Reasoning under Perceptual Uncertainty: Bridging Continuous Perception and Discrete Symbolic Planning | Shengwen Yu Team | 2511.14533 | null |
| 2025-11-18 | Achieving Safe Control Online through Integration of Harmonic Control Lyapunov-Barrier Functions with Unsafe Object-Centric Action Policies | Matthias Scheutz Team | 2511.14434 | null |
| 2025-11-18 | Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning | Georgia Chalvatzaki Team | 2511.14427 | null |
| 2025-11-18 | Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning | Hongpeng Wang Team | 2511.14396 | link |
| 2025-11-18 | MA-SLAM: Active SLAM in Large-Scale Unknown Environment using Map Aware Deep Reinforcement Learning | Yi Jiang Team | 2511.14330 | null |
| 2025-11-18 | NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration | Philipp Fürnstahl Team | 2511.14286 | null |
| 2025-11-18 | Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion | Fei Chen Team | 2511.14178 | null |
| 2025-11-18 | RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action | Jiayu Chen Team | 2511.14161 | null |
| 2025-11-18 | AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models | Biqing Qi Team | 2511.14148 | null |
| 2025-11-17 | From Power to Precision: Learning Fine-grained Dexterity for Multi-fingered Robotic Hands | Xiaolong Wang Team | 2511.13710 | link |
| 2025-11-17 | OpenRoboCare: A Multimodal Multi-Task Expert Demonstration Dataset for Robot Caregiving | Tapomayukh Bhattacharjee Team | 2511.13707 | null |
| 2025-11-17 | PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image | Ziwei Liu Team | 2511.13648 | link |
| 2025-11-17 | Contact-Safe Reinforcement Learning with ProMP Reparameterization and Energy Awareness | Luis Figueredo Team | 2511.13459 | null |
| 2025-11-17 | ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning | Ruizhen Hu Team | 2511.13327 | null |
| 2025-11-17 | EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation | Sven Behnke Team | 2511.13312 | null |
| 2025-11-17 | Robust Control Design Using a Hybrid-Gain Finite-Time Sliding-Mode Controller | Fernando A. C. C. Fontes Team | 2511.13260 | null |
| 2025-11-17 | Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection | Dongbo Min Team | 2511.13195 | null |
| 2025-11-17 | Orientation-Free Neural Network-Based Bias Estimation for Low-Cost Stationary Accelerometers | Itzik Klein Team | 2511.13071 | null |
| 2025-11-17 | Learning Branching Policies for MILPs with Proximal Policy Optimization | Amal El Fallah Seghrouchni Team | 2511.12986 | null |
| 2025-11-17 | ArtiWorld: LLM-Driven Articulation of 3D Objects in Scenes | Feng Zheng Team | 2511.12977 | null |
| 2025-11-17 | DiffuDepGrasp: Diffusion-based Depth Noise Modeling Empowers Sim2Real Robotic Grasping | Dongbin Zhao Team | 2511.12912 | null |
| 2025-11-17 | Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views | Hesheng Wang Team | 2511.12878 | null |
| 2025-11-17 | Structured Imitation Learning of Interactive Policies through Inverse Games | Todd Murphey Team | 2511.12848 | link |
| 2025-11-17 | Mapping fNIRS Signals to Agent Performance: Toward Reinforcement Learning from Neural Feedback | Jivko SInapov Team | 2511.12844 | null |
| 2025-11-16 | Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation | Hongyang R. Zhang Team | 2511.12779 | null |
| 2025-11-16 | Task-Aware Morphology Optimization of Planar Manipulators via Reinforcement Learning | Sohom Chakrabarty Team | 2511.12650 | null |
| 2025-11-16 | Botany Meets Robotics in Alpine Scree Monitoring | Manolo Garabini Team | 2511.12526 | null |
| 2025-11-16 | RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation | Long Chen Team | 2511.12436 | null |
| 2025-11-16 | VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving | David Hyunchul Shim Team | 2511.12405 | null |
| 2025-11-14 | Volumetric Ergodic Control | Todd Murphey Team | 2511.11533 | null |
| 2025-11-14 | Terrain Costmap Generation via Scaled Preference Conditioning | Joydeep Biswas Team | 2511.11529 | null |
| 2025-11-14 | Scalable Policy Evaluation with Video World Models | Lin Yen-Chen Team | 2511.11520 | null |
| 2025-11-14 | Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities | Jingyuan Chen Team | 2511.11512 | null |
| 2025-11-14 | Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective | Ngan Le Team | 2511.11478 | null |
| 2025-11-14 | Simulating an Autonomous System in CARLA using ROS 2 | Mohamed Al-Musleh Team | 2511.11310 | null |
| 2025-11-14 | Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation | Xi Zheng Team | 2511.11298 | null |
| 2025-11-14 | Sashimi-Bot: Autonomous Tri-manual Advanced Manipulation and Cutting of Deformable Objects | Ekrem Misimi Team | 2511.11223 | null |
| 2025-11-14 | Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning | Xiaoyu Ren Team | 2511.11218 | null |
| 2025-11-14 | One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger | Chongxia Wang Team | 2511.11210 | null |
| 2025-11-14 | Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation | Debesh Jha Team | 2511.11177 | null |
| 2025-11-14 | Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids | Tian Xia Team | 2511.11077 | link |
| 2025-11-14 | AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation | Lin Shao Team | 2511.11052 | null |
| 2025-11-14 | Autonomous Vehicle Path Planning by Searching With Differentiable Simulation | Luc Van Gool Team | 2511.11043 | null |
| 2025-11-14 | Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment | Yi Sun Team | 2511.10987 | null |
| 2025-11-14 | Collaborative Multi-Robot Non-Prehensile Manipulation via Flow-Matching Co-Generation | Jiaoyang Li Team | 2511.10874 | null |
| 2025-11-14 | WetExplorer: Automating Wetland Greenhouse-Gas Surveys with an Autonomous Mobile Robot | Xuping Zhang Team | 2511.10864 | null |
| 2025-11-13 | SURFACEBENCH: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces? | Chandan K. Reddy Team | 2511.10833 | null |
| 2025-11-13 | Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow | Nicolas Padoy Team | 2511.10766 | null |
| 2025-11-13 | Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues | Chris Xiaoxuan Lu Team | 2511.10762 | null |
| 2025-11-13 | Robot Crash Course: Learning Soft and Stylized Falling | Moritz Bächer Team | 2511.10635 | null |
| 2025-11-13 | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded | Ziwei Liu Team | 2511.10560 | link |
| 2025-11-13 | SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation | Liqiang Nie Team | 2511.10518 | link |
| 2025-11-13 | RoboBenchMart: Benchmarking Robots in Retail Environment | Vlad Shakhuro Team | 2511.10276 | null |
| 2025-11-13 | Learning a Thousand Tasks in a Day | Edward Johns Team | 2511.10110 | link |
| 2025-11-13 | Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning | Xiaocong Li Team | 2511.10087 | null |
| 2025-11-13 | Physics-informed Machine Learning for Static Friction Modeling in Robotic Manipulators Based on Kolmogorov-Arnold Networks | Yinghua Liu Team | 2511.10079 | null |
| 2025-11-13 | Efficient Verification and Falsification of ReLU Neural Barrier Certificates | Bai Xue Team | 2511.10015 | null |
| 2025-11-13 | Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation | Changbo Wang Team | 2511.09958 | null |
| 2025-11-13 | A Study on Enhancing the Generalization Ability of Visuomotor Policies via Data Augmentation | Hanwen Wang Team | 2511.09932 | null |
| 2025-11-13 | Harnessing Bounded-Support Evolution Strategies for Policy Refinement | Fabio Ramos Team | 2511.09923 | null |
| 2025-11-13 | Evolving Rules: Imitation and Best Response Learning in Cournot Oligopoly | Boyu Zhang Team | 2511.09839 | null |
| 2025-11-13 | Provably Safe Stein Variational Clarity-Aware Informative Planning | Dimitra Panagou Team | 2511.09836 | link) |
| 2025-11-12 | A Robust Task-Level Control Architecture for Learned Dynamical Systems | Naira Hovakimyan Team | 2511.09790 | null |
| 2025-11-12 | Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy | Peter R. Wurman Team | 2511.09737 | link |
| 2025-11-12 | Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard | Katerina Pastra Team | 2511.09727 | null |
| 2025-11-12 | SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning | Haibo Hu Team | 2511.09681 | null |
| 2025-11-12 | Statistically Consistent Approximate Model Predictive Control | Melanie N. Zeilinger Team | 2511.09661 | null |
| 2025-11-12 | IFG: Internet-Scale Guidance for Functional Grasping Generation | Deepak Pathak Team | 2511.09558 | link |
| 2025-11-12 | SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation | Gao Huang Team | 2511.09555 | link |
| 2025-11-10 | Lightning Grasp: High Performance Procedural Grasp Synthesis with Contact Fields | Pieter Abbeel Team | 2511.07418 | link |
| 2025-11-10 | Robot Learning from a Physical World Model | Yue Wang Team | 2511.07416 | link |
| 2025-11-10 | Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization | Shalabh Bhatnagar Team | 2511.07288 | null |
| 2025-11-10 | SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation | Ngan Le Team | 2511.06754 | null |
| 2025-11-10 | Physically-Grounded Goal Imagination: Physics-Informed Variational Autoencoder for Self-Supervised Reinforcement Learning | Nam Pham Hai Team | 2511.06745 | null |
| 2025-11-10 | Rapidly Learning Soft Robot Control via Implicit Time-Stepping | Dezhong Tong Team | 2511.06667 | link |
| 2025-11-09 | Real Garment Benchmark (RGBench): A Comprehensive Benchmark for Robotic Garment Manipulation featuring a High-Fidelity Scalable Simulator | Ruigang Yang Team | 2511.06434 | null |
| 2025-11-09 | ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval | Jeff Ichnowski Team | 2511.06202 | null |
| 2025-11-08 | Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds | Jun Liu Team | 2511.05996 | null |
| 2025-11-08 | Gentle Manipulation Policy Learning via Demonstrations from VLM Planned Atomic Skills | Renjing Xu Team | 2511.05855 | null |
| 2025-11-08 | VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models | Aniket Bera Team | 2511.05791 | null |
| 2025-11-07 | VLM-driven Skill Selection for Robotic Assembly Tasks | Chang-Hyun Kim Team | 2511.05680 | null |
| 2025-11-07 | EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation | Samuel Dickerson Team | 2511.05397 | null |
| 2025-11-07 | ETHOS: A Robotic Encountered-Type Haptic Display for Social Interaction in Virtual Reality | Matthew K. X. J. Pan Team | 2511.05379 | null |
| 2025-11-07 | Force-Safe Environment Maps and Real-Time Detection for Soft Robot Manipulators | Andrew P. Sabelhaus Team | 2511.05307 | null |
| 2025-11-07 | Context-aware Learned Mesh-based Simulation via Trajectory-Level Meta-Learning | Gerhard Neumann Team | 2511.05234 | null |
| 2025-11-07 | Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation | Feifei Feng Team | 2511.05199 | null |
| 2025-11-07 | Follow-Me in Micro-Mobility with End-to-End Imitation Learning | Jorge Peña Queralta Team | 2511.05158 | null |
| 2025-11-07 | TAPOM: Task-Space Topology-Guided Motion Planning for Manipulating Elongated Object in Cluttered Environments | Yijiang Huang Team | 2511.05052 | null |
| 2025-11-07 | MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery | Huazhe Xu Team | 2511.05007 | null |
| 2025-11-06 | Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning | Gavriel State Team | 2511.04831 | link |
| 2025-11-06 | Unified Multimodal Diffusion Forcing for Forceful Manipulation | Dmitry Berenson Team | 2511.04812 | link |
| 2025-11-06 | ReGen: Generative Robot Simulation via Inverse Design | Daniela Rus Team | 2511.04769 | null |
| 2025-11-06 | X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations | Kushal Kedia Team | 2511.04671 | null |
| 2025-11-06 | Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions | Yunzhu Li Team | 2511.04665 | link |
| 2025-11-06 | ForeRobo: Unlocking Infinite Simulation Data for 3D Goal-driven Robotic Manipulation | Chunsheng Liu Team | 2511.04381 | null |
| 2025-11-06 | GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies | Cédric Buche Team | 2511.04357 | null |
| 2025-11-06 | Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots | Mingguo Zhao Team | 2511.03996 | link |
| 2025-11-05 | Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures | Mathias Unberath Team | 2511.03882 | null |
| 2025-11-05 | Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning | Georgios Chalkiadakis Team | 2511.03616 | null |
| 2025-11-05 | Imitation Learning in the Deep Learning Era: A Novel Taxonomy and Recent Advances | Georgios Chalkiadakis Team | 2511.03565 | null |
| 2025-11-05 | Development of the Bioinspired Tendon-Driven DexHand 021 with Proprioceptive Compliance Control | Sheng Yi Team | 2511.03481 | null |
| 2025-11-05 | Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control | Kensuke Harada Team | 2511.03181 | null |
| 2025-11-05 | Learning Natural and Robust Hexapod Locomotion over Complex Terrains via Motion Priors based on Deep Reinforcement Learning | Feng Gao Team | 2511.03167 | null |
| 2025-11-05 | ISC-Perception: A Hybrid Computer Vision Dataset for Object Detection in Novel Steel Assembly | Debra F. Laefer Team | 2511.03098 | null |
| 2025-11-04 | 3D Cal: An Open-Source Software Library for Calibrating Tactile Sensors | Gregory Reardon Team | 2511.03078 | null |
| 2025-11-04 | Audience Amplified: Virtual Audiences in Asynchronously Performed AR Theater | Tobias Höllerer Team | 2511.02807 | null |
| 2025-11-04 | Dexterous Robotic Piano Playing at Scale | Dieter Büchler Team | 2511.02504 | null |
| 2025-11-04 | LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation | Changhyun Choi Team | 2511.02239 | link |
| 2025-10-31 | A Step Toward World Models: A Survey on Robotic Manipulation | Heng Tao Shen Team | 2511.02097 | null |
| 2025-11-03 | TRACE: Textual Reasoning for Affordance Coordinate Extraction | Matthew S. Brown Team | 2511.01999 | null |
| 2025-11-01 | iFlyBot-VLA Technical Report | Jia Pan Team | 2511.01914 | null |
| 2025-11-03 | SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation | Georgia Chalvatzaki Team | 2511.01501 | null |
| 2025-11-03 | RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models | Donglin Wang Team | 2511.01331 | null |
| 2025-11-03 | Improving Needle Penetration via Precise Rotational Insertion Using Iterative Learning Control | Tsu-Chin Tsao Team | 2511.01256 | null |
| 2025-11-03 | Embodiment Transfer Learning for Vision-Language-Action Models | Yaxin Peng Team | 2511.01224 | null |
| 2025-11-02 | Deployable Vision-driven UAV River Navigation via Human-in-the-loop Preference Alignment | Nina Mahmoudian Team | 2511.01083 | null |
| 2025-11-02 | GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies | Ruimao Zhang Team | 2511.00998 | link |
| 2025-11-01 | Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy | Zhongliang Jiang Team | 2511.00555 | null |
| 2025-10-31 | EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations | Philipp Wu Team | 2511.00153 | null |
| 2025-10-31 | Whole-Body Proprioceptive Morphing: A Modular Soft Gripper for Robust Cross-Scale Grasping | Xiaonan Huang Team | 2510.27666 | null |
| 2025-10-31 | Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs | Shinkyu Park Team | 2510.27558 | null |
| 2025-10-31 | When AI Trading Agents Compete: Adverse Selection of Meta-Orders by Reinforcement Learning-Based Market Making | Nick Firoozye Team | 2510.27334 | null |
| 2025-10-31 | Learning Generalizable Visuomotor Policy through Dynamics-Alignment | Jungwoo Lee Team | 2510.27114 | null |
| 2025-10-30 | Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation | Qiaojun Yu Team | 2510.26670 | null |
| 2025-10-31 | An Impulse Control Approach to Market Making in a Hawkes LOB Market | Philip Treleaven Team | 2510.26438 | null |
| 2025-10-30 | Human-in-the-loop Online Rejection Sampling for Robotic Manipulation | Yansong Tang Team | 2510.26406 | null |
| 2025-10-30 | Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving | Yandan Luo Team | 2510.26292 | null |
| 2025-10-30 | Learning to Manage Investment Portfolios beyond Simple Utility Functions | J. Doyne Farmer Team | 2510.26165 | null |
| 2025-10-28 | A Humanoid Visual-Tactile-Action Dataset for Contact-Rich Manipulation | Kyung-Joong Kim Team | 2510.25725 | null |
| 2025-10-29 | Sim-to-Real Gentle Manipulation of Deformable and Fragile Objects with Stress-Guided Reinforcement Learning | Florian T. Pokorny Team | 2510.25405 | null |
| 2025-10-29 | SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation | Dan Guo Team | 2510.25268 | null |
| 2025-10-29 | Time-Optimal Transport of Loosely Placed Liquid Filled Cups along Prescribed Paths | Andreas Mueller Team | 2510.25255 | null |
| 2025-10-29 | Hybrid Vision Servoing with Depp Alignment and GRU-Based Occlusion Recovery | Jongseong Brad Choi Team | 2510.25233 | null |
| 2025-10-29 | Learning Spatial-Aware Manipulation Ordering | Jian Pu Team | 2510.25138 | null |
| 2025-10-29 | NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies | Jinghui Lu Team | 2510.25122 | null |
| 2025-10-28 | Fare: Failure Resilience in Learned Visual Navigation Control | David Hsu Team | 2510.24680 | null |
| 2025-10-28 | Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning | Arnold W. Schumann Team | 2510.24650 | null |
| 2025-10-28 | DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation | Gang Hua Team | 2510.24261 | null |
| 2025-10-28 | Manipulate as Human: Learning Task-oriented Manipulation Skills by Adversarial Motion Priors | Yue Gao Team | 2510.24257 | null |
| 2025-10-28 | Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames | Aviv Tamar Team | 2510.24194 | null |
| 2025-10-28 | PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI | Philip Dames Team | 2510.24109 | null |
| 2025-10-28 | ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring | Jose M. Alvarez Team | 2510.24108 | null |
| 2025-10-28 | Learning Parameterized Skills from Demonstrations | George Konidaris Team | 2510.24095 | null |
| 2025-10-28 | Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation | Jiashuo Bai Team | 2510.24055 | null |
| 2025-10-27 | Adaptive Keyframe Selection for Scalable 3D Scene Reconstruction in Dynamic Environments | Giuseppe Loianno Team | 2510.23928 | null |
| 2025-10-29 | RoboOmni: Proactive Robot Manipulation in Omni-modal Context | Xipeng Qiu Team | 2510.23763 | null |
| 2025-10-27 | RobotArena $\infty$ : Scalable Robot Benchmarking via Real-to-Sim Translation | Katerina Fragkiadaki Team | 2510.23571 | link |
| 2025-10-27 | Optimal Dimensioning of Elastic-Link Manipulators regarding Lifetime Estimation | Andreas Mueller Team | 2510.23234 | null |
| 2025-10-27 | Workspace Registration and Collision Detection for Industrial Robotics Applications | Andreas Mueller Team | 2510.23227 | null |
| 2025-10-27 | Finding 3D Scene Analogies with Multimodal Foundation Models | Young Min Kim Team | 2510.23184 | null |
| 2025-10-27 | ManiDP: Manipulability-Aware Diffusion Policy for Posture-Dependent Bimanual Manipulation | Fei Chen Team | 2510.23016 | null |
| 2025-10-26 | Learning Neural Observer-Predictor Models for Limb-level Sampling-based Locomotion Planning | Guoquan Huang Team | 2510.22789 | null |
| 2025-10-26 | Edge Collaborative Gaussian Splatting with Integrated Rendering and Communication | Chengzhong Xu Team | 2510.22718 | null |
| 2025-10-26 | FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference | Manjesh Kumar Hanawal Team | 2510.22641 | null |
| 2025-10-25 | A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems | Benyamin Safizadeh Team | 2510.22420 | null |
| 2025-10-25 | ACG: Action Coherence Guidance for Flow-based VLA models | Jaegul Choo Team | 2510.22201 | null |
| 2025-10-25 | RaycastGrasp: Eye-Gaze Interaction with Wearable Devices for Robotic Manipulation | Yang Ye Team | 2510.22113 | null |
| 2025-10-24 | Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising | Yinchuan Li Team | 2510.21991 | null |
| 2025-10-27 | On Uncertainty Calibration for Equivariant Functions | Robin Walters Team | 2510.21691 | link |
| 2025-10-24 | Enhancing Tactile-based Reinforcement Learning for Robotic Control | Sethu Vijayakumar Team | 2510.21609 | null |
| 2025-10-24 | Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos | Baining Guo Team | 2510.21571 | link |
| 2025-10-24 | Learning Neural Control Barrier Functions from Expert Demonstrations using Inverse Constraint Learning | Hussein Sibai Team | 2510.21560 | null |
| 2025-10-24 | Generalizable Hierarchical Skill Learning via Object-Centric Representation | Robert Platt Team | 2510.21121 | null |
| 2025-10-23 | BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies | Benjamin Busam Team | 2510.21000 | null |
| 2025-10-23 | SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing | Axel Krieger Team | 2510.20965 | null |
| 2025-10-23 | GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation | Xiaolong Wang Team | 2510.20813 | null |
| 2025-10-23 | FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation | Yao Mu Team | 2510.20774 | link |
| 2025-10-23 | A Parameter-Linear Formulation of the Optimal Path Following Problem for Robotic Manipulator | Andreas Mueller Team | 2510.20496 | null |
| 2025-10-23 | Dual Control Reference Generation for Optimal Pick-and-Place Execution under Payload Uncertainty | Tom Lefebvre Team | 2510.20483 | null |
| 2025-10-23 | PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning | Gerhard Neumann Team | 2510.20406 | null |
| 2025-10-23 | NeuralTouch: Neural Descriptors for Precise Sim-to-Real Tactile Robot Control | Nathan F. Lepora Team | 2510.20390 | null |
| 2025-10-23 | MemER: Scaling Up Memory for Robot Control via Experience Retrieval | Chelsea Finn Team | 2510.20328 | link |
| 2025-10-22 | Approximate Model Predictive Control for Microgrid Energy Management via Imitation Learning | Bart De Schutter Team | 2510.20040 | null |
| 2025-10-22 | Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets | Xuanmeng Zhang Team | 2510.19944 | link |
| 2025-10-25 | Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning | Abhishek Gupta Team | 2510.19495 | null |
| 2025-10-22 | Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes | Baining Guo Team | 2510.19400 | link |
| 2025-10-22 | Using Temperature Sampling to Effectively Train Robot Learning Policies on Imbalanced Datasets | Bernadette Bucher Team | 2510.19373 | null |
| 2025-10-22 | Imitation Learning Policy based on Multi-Step Consistent Integration Shortcut Model | Jie Zhao Team | 2510.19356 | null |
| 2025-10-22 | Unified Reinforcement and Imitation Learning for Vision-Language Models | Yueh-Hua Wu Team | 2510.19307 | link |
| 2025-10-22 | TARMAC: A Taxonomy for Robot Manipulation in Chemistry | Jihong Zhu Team | 2510.19289 | null |
| 2025-10-21 | A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model | Homayoun Najjaran Team | 2510.19128 | null |
| 2025-10-21 | Efficient Model-Based Reinforcement Learning for Robot Control via Online Learning | Marco Hutter Team | 2510.18518 | null |
| 2025-10-23 | MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning | Heng Yang Team | 2510.18337 | null |
| 2025-10-21 | MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation | Li Fei-Fei Team | 2510.18316 | null |
| 2025-10-20 | Quality Over Quantity: Curating Contact-Based Robot Datasets Improves Learning | Ian Abraham Team | 2510.18137 | null |
| 2025-10-20 | R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations | Daniel S. Brown Team | 2510.18085 | null |
| 2025-10-20 | SPACeR: Self-Play Anchoring with Centralized Reference Models | Wei Zhan Team | 2510.18060 | link |
| 2025-10-20 | RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation | Ziwei Wang Team | 2510.17640 | null |
| 2025-10-20 | Learned Inertial Odometry for Cycling Based on Mixture of Experts Algorithm | Xiaoji Niu Team | 2510.17604 | null |
| 2025-10-20 | Plasma Shape Control via Zero-shot Generative Reinforcement Learning | Wulyu Zhong Team | 2510.17531 | null |
| 2025-10-20 | A Generalization of Input-Output Linearization via Dynamic Switching Between Melds of Output Functions | Antonio Franchi Team | 2510.17448 | null |
| 2025-10-22 | OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation | Arash Ajoudani Team | 2510.17150 | link |
| 2025-10-20 | Decentralized Real-Time Planning for Multi-UAV Cooperative Manipulation via Imitation Learning | Sihao Sun Team | 2510.17143 | null |
| 2025-10-20 | Learning to Design Soft Hands using Reward Models | Sha Yi Team | 2510.17086 | null |
| 2025-10-19 | End-to-end Listen, Look, Speak and Act | Chao Zhang Team | 2510.16756 | null |
| 2025-10-18 | MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation | Ufuk Topcu Team | 2510.16617 | null |
| 2025-10-18 | Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making | Jean-Michel Loubes Team | 2510.16462 | null |
| 2025-10-18 | Learning to Optimize Edge Robotics: A Fast Integrated Perception-Motion-Communication Approach | Chengzhong Xu Team | 2510.16424 | null |
| 2025-10-17 | DeGrip: A Compact Cable-driven Robotic Gripper for Desktop Disassembly | Minghui Zheng Team | 2510.16231 | null |
| 2025-10-17 | DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation | Yiwen Lu Team | 2510.15786 | null |
| 2025-10-22 | VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation | Bin He Team | 2510.15530 | null |
| 2025-10-17 | Exploring Conditions for Diffusion models in Robotic Control | Taekyung Kim Team | 2510.15510 | link |
| 2025-10-17 | Perfect Prediction or Plenty of Proposals? What Matters Most in Planning for Autonomous Driving | Joschka Boedecker Team | 2510.15505 | null |
| 2025-10-17 | Learning to Answer from Correct Demonstrations | Nathan Srebro Team | 2510.15464 | null |
| 2025-10-17 | GaussGym: An open-source real-to-sim framework for learning locomotion from pixels | Pieter Abbeel Team | 2510.15352 | null |
| 2025-10-16 | RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation | Jianfei Yang Team | 2510.15189 | null |
| 2025-10-18 | VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning | Yunzhu Li Team | 2510.14930 | link |
| 2025-10-16 | SADCHER: Scheduling using Attention-based Dynamic Coalitions of Heterogeneous Robots in Real-Time | Javier Alonso-Mora Team | 2510.14851 | link |
| 2025-10-16 | RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning | Huazhe Xu Team | 2510.14830 | link |
| 2025-10-16 | Open TeleDex: A Hardware-Agnostic Teleoperation System for Imitation Learning based Dexterous Manipulation | Shan An Team | 2510.14771 | null |
| 2025-10-16 | Accelerated Multi-Modal Motion Planning Using Context-Conditioned Diffusion Models | Wilm Decré Team | 2510.14615 | null |
| 2025-10-16 | Restoring Noisy Demonstration for Imitation Learning With Diffusion Models | Shao-Hua Sun Team | 2510.14467 | null |
| 2025-10-16 | Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning | Yao Mu Team | 2510.14300 | null |
| 2025-10-15 | ViTacGen: Robotic Pushing with Vision-to-Touch Generation | Shan Luo Team | 2510.14117 | null |
| 2025-10-15 | Optimistic Reinforcement Learning-Based Skill Insertions for Task and Motion Planning | Bram Vanderborght Team | 2510.14065 | null |
| 2025-10-17 | CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations | Kun Zhang Team | 2510.14049 | null |
| 2025-10-15 | LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models | Xipeng Qiu Team | 2510.13626 | null |
| 2025-10-15 | Efficient Force and Stiffness Prediction in Robotic Produce Handling with a Piezoresistive Pressure Sensor | Xiaobo Tan Team | 2510.13616 | link |
| 2025-10-15 | Active Tactile Exploration for Rigid Body Pose and Shape Estimation | Michael Posa Team | 2510.13595 | null |
| 2025-10-15 | Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation | Jan Peters Team | 2510.13324 | null |
| 2025-10-15 | Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models | Jingfeng Zhang Team | 2510.13237 | null |
| 2025-10-15 | Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation | Sen Wang Team | 2510.13229 | null |
| 2025-10-15 | VLA-0: Building State-of-the-Art VLAs with Zero Modification | Fabio Ramos Team | 2510.13054 | null |
| 2025-10-14 | Development of a Linear Guide-Rail Testbed for Physically Emulating ISAM Operations | Christopher Petersen Team | 2510.13005 | null |
| 2025-10-14 | Actron3D: Learning Actionable Neural Functions from Videos for Transferable Robotic Manipulation | Stefan Leutenegger Team | 2510.12971 | null |
| 2025-10-14 | Learning to Grasp Anything by Playing with Random Toys | Roei Herzig Team | 2510.12866 | null |
| 2025-10-14 | CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving | Jiangtao Gong Team | 2510.12560 | null |
| 2025-10-14 | Automated Behavior Planning for Fruit Tree Pruning via Redundant Robot Manipulators: Addressing the Behavior Planning Challenge | Bram Vanderborght Team | 2510.12509 | null |
| 2025-10-14 | Fast Visuomotor Policy for Robotic Manipulation | Wenqiang Zhang Team | 2510.12483 | null |
| 2025-10-14 | Robot Learning: A Tutorial | Michel Aractingi Team | 2510.12403 | null |
| 2025-10-14 | Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking | Eunhyeok Park Team | 2510.12392 | null |
| 2025-10-14 | Learning Social Navigation from Positive and Negative Demonstrations and Rule-Based Specifications | Sungjoon Choi Team | 2510.12215 | link |
| 2025-10-13 | Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation | Mac Schwager Team | 2510.11689 | null |
| 2025-10-14 | ManiAgent: An Agentic Framework for General Robotic Manipulation | Xudong Liu Team | 2510.11660 | null |
| 2025-10-13 | HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data | Yanchao Yang Team | 2510.11321 | null |
| 2025-10-13 | FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks | Alessandro Suglia Team | 2510.11307 | null |
| 2025-10-13 | DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation | Zongqing Lu Team | 2510.11258 | null |
| 2025-10-13 | Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling | Jingjing Liu Team | 2510.11083 | null |
| 2025-10-13 | Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey | Badong Chen Team | 2510.10903 | null |
| 2025-10-12 | High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting | Hua Zou Team | 2510.10637 | null |
| 2025-10-12 | Population-Coded Spiking Neural Networks for High-Dimensional Robotic Control | Jeethu Sreenivas Amuthan Team | 2510.10516 | null |
| 2025-10-12 | Data-driven simulator of multi-animal behavior with unknown dynamics via offline and online reinforcement learning | Yoshinobu Kawahara Team | 2510.10451 | null |
| 2025-10-11 | X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model | Xianyuan Zhan Team | 2510.10274 | null |
| 2025-10-11 | A3RNN: Bi-directional Fusion of Bottom-up and Top-down Process for Developmental Visual Attention in Robots | Tetsuya Ogata Team | 2510.10221 | null |
| 2025-10-11 | UF-RNN: Real-Time Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction | Tetsuya Ogata Team | 2510.10217 | null |
| 2025-10-15 | Ctrl-World: A Controllable Generative World Model for Robot Manipulation | Chelsea Finn Team | 2510.10125 | null |
| 2025-10-10 | VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation | Caifeng Shan Team | 2510.09607 | link |
| 2025-10-13 | Guiding Energy-Efficient Locomotion through Impact Mitigation Rewards | Alireza Ramezani Team | 2510.09543 | null |
| 2025-10-10 | Autonomous Soft Robotic Guidewire Navigation via Imitation Learning | Axel Krieger Team | 2510.09497 | null |
| 2025-10-13 | Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning | Weitong Zhang Team | 2510.09487 | null |
| 2025-10-13 | Failure Prediction at Runtime for Generative Robot Policies | Angela P. Schoellig Team | 2510.09459 | link |
| 2025-10-10 | Rate optimal learning of equilibria from data | Giorgia Ramponi Team | 2510.09325 | null |
| 2025-10-10 | Glovity: Learning Dexterous Contact-Rich Manipulation via Spatial Wrench Feedback Teleoperation System | Pai Zheng Team | 2510.09229 | null |
| 2025-10-10 | FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning | Ivor Tsang Team | 2510.09222 | null |
| 2025-10-10 | When a Robot is More Capable than a Human: Learning from Constrained Demonstrators | Erdem Bıyık Team | 2510.09096 | null |
| 2025-10-10 | iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation | Ziwei Wang Team | 2510.09036 | null |
| 2025-10-09 | Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation | Yue Wang Team | 2510.08807 | null |
| 2025-10-09 | Geometry-aware Policy Imitation | Sylvain Calinon Team | 2510.08787 | null |
| 2025-10-09 | Point and Go: Intuitive Reference Frame Reallocation in Mode Switching for Assistive Robotics | M. Jagersand Team | 2510.08753 | null |
| 2025-10-09 | Agent Learning via Early Experience | Yifan Wu Team | 2510.08558 | null |
| 2025-10-09 | R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation | Jiwen Lu Team | 2510.08547 | link |
| 2025-10-09 | Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge | Wei Shen Team | 2510.08316 | null |
| 2025-10-09 | FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset | Xuelong Li Team | 2510.08022 | null |
| 2025-10-09 | DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation | Weibing Li Team | 2510.07865 | link |
| 2025-10-09 | Trajectory Conditioned Cross-embodiment Skill Transfer | Bin Zhao Team | 2510.07773 | null |
| 2025-10-11 | Differentiable Particle Optimization for Fast Sequential Manipulation | Zachary Kingston Team | 2510.07674 | null |
| 2025-10-08 | WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation | Shanghang Zhang Team | 2510.07313 | null |
| 2025-10-09 | TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics | Shanghang Zhang Team | 2510.07181 | null |
| 2025-10-08 | DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning | Chen Lv Team | 2510.06913 | null |
| 2025-10-07 | Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels | Weiran Yao Team | 2510.06499 | null |
| 2025-10-07 | EmbodiedCoder: Parameterized Embodied Mobile Manipulation via Modern Coding Model | Zhaoxiang Zhang Team | 2510.06207 | link |
| 2025-10-07 | Differentiable Model Predictive Control on the GPU | Thomas Lew Team | 2510.06179 | null |
| 2025-10-07 | Towards Autonomous Tape Handling for Robotic Wound Redressing | Michael Yip Team | 2510.06127 | null |
| 2025-10-07 | Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion | Robin Chhabra Team | 2510.05957 | null |
| 2025-10-07 | VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation | Badong Chen Team | 2510.05827 | null |
| 2025-10-07 | DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation | Kuk-Jin Yoon Team | 2510.05662 | link |
| 2025-10-07 | Teaching Machines to Speak Using Articulatory Control | Gopala Anumanchipalli Team | 2510.05619 | null |
| 2025-10-07 | Correlation-Aware Dual-View Pose and Velocity Estimation for Dynamic Robotic Manipulation | Farrokh Janabi-Sharifi Team | 2510.05536 | null |
| 2025-10-06 | VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing | Masayoshi Tomizuka Team | 2510.05213 | null |
| 2025-10-06 | Curiosity-Driven Co-Development of Action and Language in Robots Through Self-Exploration | Jun Tani Team | 2510.05013 | null |
| 2025-10-06 | Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization | Arianna Traviglia Team | 2510.04781 | null |
| 2025-10-06 | MobRT: A Digital Twin-Based Framework for Scalable Learning in Mobile Manipulation | Wenjie Song Team | 2510.04592 | null |
| 2025-10-05 | Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators | Anirudha Majumdar Team | 2510.04354 | null |
| 2025-10-05 | RAP: 3D Rasterization Augmented End-to-End Planning | Alexandre Alahi Team | 2510.04333 | null |
| 2025-10-04 | NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation | Chunhua Shen Team | 2510.03895 | null |
| 2025-10-04 | EmbodiSwap for Zero-Shot Robot Imitation Learning | Yiannis Aloimonos Team | 2510.03706 | link |
| 2025-10-04 | Dissecting Larval Zebrafish Hunting using Deep Reinforcement Learning Trained RNN Agents | Kanaka Rajan Team | 2510.03699 | null |
| 2025-10-04 | Learning to Act Through Contact: A Unified View of Multi-Task Robot Learning | Majid Khadiv Team | 2510.03599 | null |
| 2025-10-03 | Warm-Starting Optimization-Based Motion Planning for Robotic Manipulators via Point Cloud-Conditioned Flow Matching | Xiao Liang Team | 2510.03460 | null |
| 2025-10-03 | Mask2IV: Interaction-Centric Video Generation via Mask Trajectories | Laura Sevilla-Lara Team | 2510.03135 | link |
| 2025-10-03 | Learning Stability Certificate for Robotics in Real-World Environments | Zhe Shen Team | 2510.03123 | null |
| 2025-10-06 | Distributional Inverse Reinforcement Learning | Anqi Wu Team | 2510.03013 | null |
| 2025-10-03 | Action Deviation-Aware Inference for Low-Latency Wireless Robots | Seong-Lyun Kim Team | 2510.02851 | null |
| 2025-10-03 | Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data | Nadia Figueroa Team | 2510.02738 | null |
| 2025-10-02 | A Recipe for Efficient Sim-to-Real Transfer in Manipulation with Online Imitation-Pretrained World Models | Hao Su Team | 2510.02538 | null |
| 2025-10-02 | U-LAG: Uncertainty-Aware, Lag-Adaptive Goal Retargeting for Robotic Manipulation | Anujith Muraleedharan Team | 2510.02526 | null |
| 2025-10-02 | Beyond Imitation: Recovering Dense Rewards from Demonstrations | Gholamreza Haffari Team | 2510.02493 | null |
| 2025-10-02 | ARMADA: Autonomous Online Failure Detection and Human Shared Control Empower Scalable Real-world Deployment and Adaptation | Cewu Lu Team | 2510.02298 | null |
| 2025-10-02 | Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning | Matthew R. Walter Team | 2510.02268 | null |
| 2025-10-02 | GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning | Bogdan Mazoure Team | 2510.02180 | null |
| 2025-10-02 | Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions | Shihua Li Team | 2510.02081 | null |
| 2025-10-02 | Contrastive Representation Regularization for Vision-Language-Action Models | Jinwoo Shin Team | 2510.01711 | null |
| 2025-10-02 | Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation | Nadia Figueroa Team | 2510.01661 | link) |
| 2025-10-02 | FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models | Bihan Wen Team | 2510.01642 | link |
| 2025-10-02 | MIMIC: Integrating Diverse Personality Traits for Better Game Testing Using Large Language Model | Lili Wei Team | 2510.01635 | null |
| 2025-10-02 | ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations | Yi Xu Team | 2510.01607 | link |
| 2025-10-02 | MiniBEE: A New Form Factor for Compact Bimanual Dexterity | Matei Ciocarlie Team | 2510.01603 | null |
| 2025-10-02 | Predictive Preference Learning from Human Interventions | Bolei Zhou Team | 2510.01545 | link |
| 2025-10-02 | Information Seeking for Robust Decision Making under Partial Observability | Tsung-Wei Ke Team | 2510.01531 | link |
| 2025-10-01 | Online Hierarchical Policy Learning using Physics Priors for Robot Navigation in Unknown Environments | Ahmed H. Qureshi Team | 2510.01519 | null |
| 2025-10-01 | Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets | Ali Baheri Team | 2510.01479 | null |
| 2025-10-01 | AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation | Pratap Tokekar Team | 2510.01433 | null |
| 2025-10-01 | How Well do Diffusion Policies Learn Kinematic Constraint Manifolds? | Russ Tedrake Team | 2510.01404 | link |
| 2025-10-01 | Temporal Score Rescaling for Temperature Sampling in Diffusion and Flow Models | Shubham Tulsiani Team | 2510.01184 | null |
| 2025-10-01 | Prometheus: Universal, Open-Source Mocap-Based Teleoperation System with Force Feedback for Dataset Collection in Robot Learning | D. Tsetserukou Team | 2510.01023 | null |
| 2025-10-01 | On Discovering Algorithms for Adversarial Imitation Learning | Pradeep Varakantham Team | 2510.00922 | null |
| 2025-10-01 | TubeDAgger: Reducing the Number of Expert Interventions with Stochastic Reach-Tubes | Sophie A. Neubauer Team | 2510.00906 | null |
| 2025-09-30 | MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation | Shanghang Zhang Team | 2509.26642 | null |
| 2025-09-30 | Learning from Hallucinating Critical Points for Navigation in Dynamic Environments | Xuesu Xiao Team | 2509.26513 | null |
| 2025-09-30 | Anomaly detection for generic failure monitoring in robotic assembly, screwing and manipulation | Kevin Haninger Team | 2509.26308 | null |
| 2025-09-30 | Noise-Guided Transport for Imitation Learning | Alexandros Kalousis Team | 2509.26294 | null |
| 2025-09-30 | Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation | Hao Chen Team | 2509.25852 | null |
| 2025-10-01 | Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies | Li Cheng Team | 2509.25822 | null |
| 2025-09-30 | Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding | Jiaojiao Fan Team | 2509.25794 | null |
| 2025-09-30 | SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling | Wenbo Ding Team | 2509.25756 | null |
| 2025-09-30 | Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real | Yang Gao Team | 2509.25747 | null |
| 2025-09-29 | Boolean Satisfiability via Imitation Learning | Xiangyu Xu Team | 2509.25411 | null |
| 2025-09-29 | Parallel Heuristic Search as Inference for Actor-Critic Reinforcement Learning Models | Maxim Likhachev Team | 2509.25402 | null |
| 2025-09-29 | SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation | Philipp Wu Team | 2509.25358 | null |
| 2025-09-29 | SRMP: Search-Based Robot Motion Planning Library | Maxim Likhachev Team | 2509.25352 | null |
| 2025-10-01 | Curriculum Imitation Learning of Distributed Multi-Robot Policies | Eduardo Montijano Team | 2509.25097 | null |
| 2025-09-29 | Annotation-Free One-Shot Imitation Learning for Multi-Step Manipulation Tasks | Ruchi Choudhary Team | 2509.24972 | null |
| 2025-09-29 | MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation | Abhinav Valada Team | 2509.24956 | null |
| 2025-09-29 | World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | Qing Zhang Team | 2509.24948 | null |
| 2025-09-29 | From Code to Action: Hierarchical Learning of Diffusion-VLM Policies | Daniel Dijkman Team | 2509.24917 | null |
| 2025-09-29 | Quantifying Generalisation in Imitation Learning | Odinaldo Rodrigues Team | 2509.24784 | null |
| 2025-09-29 | IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks | Ville Kyrki Team | 2509.24768 | null |
| 2025-09-29 | Stabilizing Humanoid Robot Trajectory Generation via Physics-Informed Learning and Control-Informed Steering | Daniele Pucci Team | 2509.24697 | null |
| 2025-09-29 | CEDex: Cross-Embodiment Dexterous Grasp Generation at Scale from Human-like Contact Representations | Shan Luo Team | 2509.24661 | null |
| 2025-09-29 | U-DiT Policy: U-shaped Diffusion Transformers for Robotic Manipulation | Zhongxue Gan Team | 2509.24579 | null |
| 2025-09-29 | Unlocking the Potential of Soft Actor-Critic for Imitation Learning | Frank Kirchner Team | 2509.24539 | null |
| 2025-09-29 | Learning to Sample: Reinforcement Learning-Guided Sampling for Autonomous Vehicle Motion Planning | Johannes Betz Team | 2509.24313 | null |
| 2025-09-29 | FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation | Minsu Cho Team | 2509.24241 | null |
| 2025-09-29 | ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning | Yang You Team | 2509.24219 | null |
| 2025-09-29 | Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models | Sethu Vijayakumar Team | 2509.24163 | null |
| 2025-09-29 | Memory Transfer Planning: LLM-driven Context-Aware Code Adaptation for Robot Manipulation | Yang You Team | 2509.24160 | null |
| 2025-09-28 | Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress | Kristen Grauman Team | 2509.24129 | null |
| 2025-09-28 | DexFlyWheel: A Scalable and Self-improving Data Generation Framework for Dexterous Manipulation | Yuanpei Chen Team | 2509.23829 | null |
| 2025-09-28 | Control Your Robot: A Unified System for Robot Control and Policy Deployment | Bingshan Hu Team | 2509.23823 | link |
| 2025-09-30 | Sequence Pathfinder for Multi-Agent Pickup and Delivery in the Warehouse | Ying Wen Team | 2509.23778 | null |
| 2025-09-26 | Pixel Motion Diffusion is What We Need for Robot Control | Michael S. Ryoo Team | 2509.22652 | null |
| 2025-09-26 | VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search | Ziwei Wang Team | 2509.22643 | null |
| 2025-09-26 | Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning | Xing Sun Team | 2509.22601 | null |
| 2025-09-26 | EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation | Liang Wang Team | 2509.22578 | null |
| 2025-09-26 | Learning to Ball: Composing Policies for Long-Horizon Basketball Moves | C. Karen Liu Team | 2509.22442 | link |
| 2025-09-26 | EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer | Guan Huang Team | 2509.22407 | null |
| 2025-09-26 | ReLAM: Learning Anticipation Model for Rewarding Visual Robotic Manipulation | Yang Yu Team | 2509.22402 | null |
| 2025-09-26 | RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation | Shuchao Pang Team | 2509.22356 | null |
| 2025-09-26 | DemoGrasp: Universal Dexterous Grasping from a Single Demonstration | Zongqing Lu Team | 2509.22149 | null |
| 2025-09-26 | Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation | Chang Xu Team | 2509.22093 | null |
| 2025-09-26 | Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error | Christos Tzamos Team | 2509.22023 | null |
| 2025-09-26 | WAVE: Worm Gear-based Adaptive Variable Elasticity for Decoupling Actuators from External Forces | Kazutoshi Tanaka Team | 2509.21878 | null |
| 2025-09-26 | Learning Multi-Skill Legged Locomotion Using Conditional Adversarial Motion Priors | Qinchuan Li Team | 2509.21810 | null |
| 2025-09-26 | The Turkish Ice Cream Robot: Examining Playful Deception in Social Human-Robot Interactions | Matthew Pan Team | 2509.21776 | link |
| 2025-09-25 | Generating Stable Placements via Physics-guided Diffusion Models | Jonathan Kelly Team | 2509.21664 | null |
| 2025-09-25 | Inverse Reinforcement Learning Using Just Classification and a Few Regressions | Aurélien Bibaut Team | 2509.21172 | null |
| 2025-09-25 | ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation | Kui Jia Team | 2509.20841 | link |
| 2025-09-25 | Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations | Weiming Zhi Team | 2509.20703 | null |
| 2025-09-24 | Large Pre-Trained Models for Bimanual Manipulation in 3D | David Meger Team | 2509.20579 | null |
| 2025-09-24 | Selective Progress-Aware Querying for Human-in-the-Loop Reinforcement Learning | Anamika J H Team | 2509.20541 | null |
| 2025-09-26 | mindmap: Spatial Memory in Deep Feature Maps for 3D Action Policies | Shiwei Sheng Team | 2509.20297 | null |
| 2025-09-24 | Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving | Xianpeng Lang Team | 2509.20109 | null |
| 2025-09-24 | LLM Trainer: Automated Robotic Data Generating via Demonstration Augmentation using LLMs | Amir Barati Farimani Team | 2509.20070 | null |
| 2025-09-25 | Generalist Robot Manipulation beyond Action Labeled Data | Danda Pani Paudel Team | 2509.19958 | null |
| 2025-09-24 | SAGE:State-Aware Guided End-to-End Policy for Multi-Stage Sequential Tasks via Hidden Markov Decision Process | JingYuan Wang Team | 2509.19853 | null |
| 2025-09-24 | TopoCut: Learning Multi-Step Cutting with Spectral Rewards and Discrete Diffusion Policies | Animesh Garg Team | 2509.19712 | null |
| 2025-09-24 | RoboSSM: Scalable In-context Imitation Learning via State-Space Models | Peter Stone Team | 2509.19658 | null |
| 2025-09-23 | EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data | Danfei Xu Team | 2509.19626 | null |
| 2025-09-23 | From Space to Time: Enabling Adaptive Safety with Learned Value Functions via Disturbance Recasting | Sylvia L. Herbert Team | 2509.19597 | null |
| 2025-09-23 | Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action | Liam Paull Team | 2509.19571 | link |
| 2025-09-23 | Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation | Chi-Guhn Lee Team | 2509.19524 | null |
| 2025-09-23 | Self-evolved Imitation Learning in Simulated World | Zhihe Lu Team | 2509.19460 | null |
| 2025-09-23 | ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation | Daniel Seita Team | 2509.19454 | null |
| 2025-09-23 | SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration | Cewu Lu Team | 2509.19292 | null |
| 2025-09-23 | Imitation-Guided Bimanual Planning for Stable Manipulation under Changing External Forces | Arash Ajoudani Team | 2509.19261 | null |
| 2025-09-23 | FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation | Jianwei Zhang Team | 2509.19102 | link |
| 2025-09-23 | World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation | Dongbin Zhao Team | 2509.19080 | null |
| 2025-09-23 | ManipForce: Force-Guided Policy Learning with Frequency-Aware Representation for Contact-Rich Manipulation | Kyoobin Lee Team | 2509.19047 | null |
| 2025-09-23 | Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations | Wen Yao Team | 2509.18953 | null |
| 2025-09-23 | Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation | Thanpimon Buamanee Team | 2509.18865 | null |
| 2025-09-23 | DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation | Jiajun Wu Team | 2509.18830 | null |
| 2025-09-23 | VGGT-DP: Generalizable Robot Control via Vision Foundation Models | Zhi Wang Team | 2509.18778 | null |
| 2025-09-23 | MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning | Fares Abu-Dakka Team | 2509.18757 | link |
| 2025-09-23 | Learning Obstacle Avoidance using Double DQN for Quadcopter Navigation | Sanket Gujar Team | 2509.18734 | null |
| 2025-09-23 | 3D Flow Diffusion Policy: Visuomotor Policy Learning via Generating Flow in 3D Space | Kyoobin Lee Team | 2509.18676 | null |
| 2025-09-24 | Do You Need Proprioceptive States in Visuomotor Policies? | Yang Gao Team | 2509.18644 | link |
| 2025-09-23 | Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training | Danfei Xu Team | 2509.18631 | null |
| 2025-09-23 | SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones | Mac Schwager Team | 2509.18610 | null |
| 2025-09-23 | Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills | Alois Knoll Team | 2509.18597 | null |
| 2025-09-23 | A scaling law for large-deformation contact in soft materials | Huajian Gao Team | 2509.18581 | null |
| 2025-09-22 | Robotic Skill Diversification via Active Mutation of Reward Functions in Reinforcement Learning During a Liquid Pouring Task | Luka Peternel Team | 2509.18463 | null |
| 2025-09-22 | Learning Geometry-Aware Nonprehensile Pushing and Pulling with Dexterous Hands | Daniel Seita Team | 2509.18455 | null |
| 2025-09-22 | PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction | Tapomayukh Bhattacharjee Team | 2509.18447 | null |
| 2025-09-22 | ByteWrist: A Parallel Robotic Wrist Enabling Flexible and Anthropomorphic Motion for Confined Spaces | Zeyu Ren Team | 2509.18084 | link |
| 2025-09-22 | Prepare Before You Act: Learning From Humans to Rearrange Initial States | Dylan P. Losey Team | 2509.18043 | null |
| 2025-09-22 | FinFlowRL: An Imitation-Reinforcement Learning Framework for Adaptive Stochastic Control in Finance | Ruixun Zhang Team | 2509.17964 | null |
| 2025-09-22 | ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion | Joydeep Biswas Team | 2509.17941 | link |
| 2025-09-22 | DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving | Zhaoxiang Zhang Team | 2509.17940 | null |
| 2025-09-23 | RoboSeek: You Need to Interact with Your Objects | Yatong Han Team | 2509.17783 | null |
| 2025-09-22 | MotionTrans: Human VR Data Enable Motion-Level Learning for Robotic Manipulation Policies | Yang Gao Team | 2509.17759 | null |
| 2025-09-22 | EigenSafe: A Spectral Framework for Learning-Based Stochastic Safety Filtering | H. Jin Kim Team | 2509.17750 | null |
| 2025-09-22 | DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning | Zidong Chen Team | 2509.17684 | null |
| 2025-09-22 | Learning Dexterous Manipulation with Quantized Hand State | Cewu Lu Team | 2509.17450 | null |
| 2025-09-22 | Fast Trajectory Planner with a Reinforcement Learning-based Controller for Robotic Manipulators | Hamidreza Kasaei Team | 2509.17381 | link |
| 2025-09-21 | Scalable Multi Agent Diffusion Policies for Coverage Control | Alejandro Ribeiro Team | 2509.17244 | null |
| 2025-09-21 | Ratatouille: Imitation Learning Ingredients for Real-world Social Robot Navigation | Timothy D. Barfoot Team | 2509.17204 | null |
| 2025-09-21 | MAST: Multi-Agent Spatial Transformer for Learning to Collaborate | Alejandro Ribeiro Team | 2509.17195 | null |
| 2025-09-21 | Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation | Hao Dong Team | 2509.17125 | null |
| 2025-09-21 | RoboManipBaselines: A Unified Framework for Imitation Learning in Robotic Manipulation across Real and Simulated Environments | Yukiyasu Domae Team | 2509.17057 | null |
| 2025-09-21 | FILIC: Dual-Loop Force-Guided Imitation Learning with Impedance Torque Control for Contact-Rich Manipulation Tasks | Guyue Zhou Team | 2509.17053 | null |
| 2025-09-21 | Generalized Momenta-Based Koopman Formalism for Robust Control of Euler-Lagrangian Systems | Jishnu Keshavan Team | 2509.17010 | null |
| 2025-09-21 | End2Race: Efficient End-to-End Imitation Learning for Real-Time F1Tenth Racing | Henry X. Liu Team | 2509.16894 | null |
| 2025-09-20 | Robot Learning with Sparsity and Scarcity | Jingxi Xu Team | 2509.16834 | null |
| 2025-09-19 | Efficient Detection of Objects Near a Robot Manipulator via Miniature Time-of-Flight Sensors | Michael Gleicher Team | 2509.16122 | null |
| 2025-09-19 | I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models | Mohamed Chetouani Team | 2509.16072 | null |
| 2025-09-19 | Compose by Focus: Scene Graph-based Atomic Skills | Heng Yang Team | 2509.16053 | null |
| 2025-09-19 | Learning Safety for Obstacle Avoidance via Control Barrier Functions | Calin A. Belta Team | 2509.16037 | null |
| 2025-09-19 | Improving Robotic Manipulation with Efficient Geometry-Aware Vision Encoder | Ian Reid Team | 2509.15880 | link |
| 2025-09-19 | All-Electric Heavy-Duty Robotic Manipulator: Actuator Configuration Optimization and Sensorless Control | Jouni Mattila Team | 2509.15778 | null |
| 2025-09-19 | GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation | Deli Zhao Team | 2509.15733 | null |
| 2025-09-19 | Imagination at Inference: Synthesizing In-Hand Views for Robust Visuomotor Policy Inference | Yoshihiko Nakamura Team | 2509.15717 | null |
| 2025-09-18 | Implicit Kinodynamic Motion Retargeting for Human-to-humanoid Imitation Learning | Haodong Zhang Team | 2509.15443 | null |
| 2025-09-18 | RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation | Xin Li Team | 2509.15212 | link |
| 2025-09-18 | Self-Improving Embodied Foundation Models | Igor Mordatch Team | 2509.15155 | null |
| 2025-09-18 | A Nonlinear Scaling-based Design of Control Lyapunov-barrier Function for Relative Degree 2 Case and its Application to Safe Feedback Linearization | Gyunghoon Park Team | 2509.15071 | null |
| 2025-09-18 | Reinforcement Learning Agent for a 2D Shooter Game | Hamza A. A. Gardi Team | 2509.15042 | null |
| 2025-09-19 | Affordance-Based Disambiguation of Surgical Instructions for Collaborative Robot-Assisted Surgery | Yasuhisa Hasegawa Team | 2509.14967 | null |
| 2025-09-18 | Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale | Florian Walter Team | 2509.14932 | null |
| 2025-09-18 | exUMI: Extensible Robot Teaching System with Action-aware Task-agnostic Tactile Representation | Yong-Lu Li Team | 2509.14688 | null |
| 2025-09-18 | SimCoachCorpus: A naturalistic dataset with language and trajectories for embodied teaching | Guy Rosman Team | 2509.14548 | null |
| 2025-09-18 | Learning to Pick: A Visuomotor Policy for Clustered Strawberry Picking | Chen Peng Team | 2509.14530 | null |
| 2025-09-17 | Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring | Constantinos Chamzas Team | 2509.14460 | null |
| 2025-09-17 | LeVR: A Modular VR Teleoperation Framework for Imitation Learning in Dexterous Manipulation | Han Liu Team | 2509.14349 | null |
| 2025-09-17 | MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies | Negar Mehr Team | 2509.14159 | null |
| 2025-09-17 | SeqVLA: Sequential Task Execution for Long-Horizon Manipulation with Completion-Aware Vision-Language-Action Model | Yiming Feng Team | 2509.14138 | null |
| 2025-09-17 | PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models | Dzmitry Tsetserukou Team | 2509.13903 | null |
| 2025-09-17 | Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach | Yangwei You Team | 2509.13774 | null |
| 2025-09-17 | Motion Adaptation Across Users and Tasks for Exoskeletons via Meta-Learning | Houcheng Li Team | 2509.13736 | null |
| 2025-09-17 | Reinforcement Learning for Robotic Insertion of Flexible Cables in Industrial Settings | Changjoo Nam Team | 2509.13731 | null |
| 2025-09-17 | HGACNet: Hierarchical Graph Attention Network for Cross-Modal Point Cloud Completion | I-Ming Chen Team | 2509.13692 | null |
| 2025-09-16 | TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning | Yunqing Hu Team | 2509.13579 | null |
| 2025-09-18 | StageACT: Stage-Conditioned Imitation for Robust Humanoid Door Opening | Shayegan Omidshafiei Team | 2509.13200 | null |
| 2025-09-16 | A Design Co-Pilot for Task-Tailored Manipulators | Matthias Althoff Team | 2509.13077 | null |
| 2025-09-16 | Deep Learning for Model-Free Prediction of Thermal States of Robot Joint Motors | Eric Guiffo Kaigom Team | 2509.12739 | null |
| 2025-09-16 | Safety filtering of robotic manipulation under environment uncertainty: a computational approach | Martin Servin Team | 2509.12674 | null |
| 2025-09-16 | ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation | Feng Zheng Team | 2509.12618 | null |
| 2025-09-16 | Robust Online Residual Refinement via Koopman-Guided Dynamics Modeling | Donglin Wang Team | 2509.12562 | null |
| 2025-09-16 | Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning | Sebastian W. Pattinson Team | 2509.12531 | null |
| 2025-09-15 | Geometric Red-Teaming for Robotic Manipulation | Zackory Erickson Team | 2509.12379 | null |
| 2025-09-15 | Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors | Anirudha Majumdar Team | 2509.12081 | null |
| 2025-09-15 | Imitation Learning as Return Distribution Matching | Alberto Maria Metelli Team | 2509.12026 | null |
| 2025-09-15 | Gesture-Based Robot Control Integrating Mm-wave Radar and Behavior Trees | Stephan Sigg Team | 2509.12008 | null |
| 2025-09-15 | Learning to Generate 4D LiDAR Sequences | Wei Tsang Ooi Team | 2509.11959 | link |
| 2025-09-15 | Learning Representations in Video Game Agents with Supervised Contrastive Imitation Learning | Tim Bradley Team | 2509.11880 | null |
| 2025-09-15 | Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer | Luhui Hu Team | 2509.11865 | null |
| 2025-09-17 | TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning | Donglin Wang Team | 2509.11839 | null |
| 2025-09-15 | Inference-stage Adaptation-projection Strategy Adapts Diffusion Policy to Cross-manipulators Scenarios | Alois Knoll Team | 2509.11621 | null |
| 2025-09-15 | RAPTOR: A Foundation Policy for Quadrotor Control | Giuseppe Loianno Team | 2509.11481 | null |
| 2025-09-17 | Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations | Xuanlin Li Team | 2509.11417 | link |
| 2025-09-14 | ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation | Yizhao Wang Team | 2509.11364 | null |
| 2025-09-14 | MEMBOT: Memory-Based Robot in Intermittent POMDP | Eyan Noronha Team | 2509.11225 | null |
| 2025-09-14 | SAMP: Spatial Anchor-based Motion Policy for Collision-Aware Robotic Manipulators | Jun Ma Team | 2509.11185 | null |
| 2025-09-14 | ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations | Jun Ma Team | 2509.11125 | null |
| 2025-09-16 | FEWT: Improving Humanoid Robot Perception with Frequency-Enhanced Wavelet-based Transformers | Zhigong Song Team | 2509.11109 | null |
| 2025-09-14 | End-to-End Visual Autonomous Parking via Control-Aided Attention | Chen Feng Team | 2509.11090 | null |
| 2025-09-14 | FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design | Rick Stevens Team | 2509.11044 | null |
| 2025-09-13 | ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation | Danfei Xu Team | 2509.10952 | null |
| 2025-09-11 | Self-Augmented Robot Trajectory: Efficient Imitation Learning via Safe Self-augmentation with Demonstrator-annotated Precision | Yukiyasu Domae Team | 2509.09893 | null |
| 2025-09-11 | Off Policy Lyapunov Stability in Reinforcement Learning | Daniela Constantinescu Team | 2509.09863 | null |
| 2025-09-11 | MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos | Yuke Zhu Team | 2509.09769 | null |
| 2025-09-11 | SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning | Ning Ding Team | 2509.09674 | null |
| 2025-09-11 | Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration | Wei Yang Team | 2509.09671 | null |
| 2025-09-11 | A Neuromorphic Incipient Slip Detection System using Papillae Morphology | Benjamin Ward-Cherrier Team | 2509.09546 | null |
| 2025-09-11 | KoopMotion: Learning Almost Divergence Free Koopman Flow Fields for Motion Planning | M. Ani Hsieh Team | 2509.09074 | null |
| 2025-09-11 | Joint Model-based Model-free Diffusion for Planning with Constraints | Shreyas Kousik Team | 2509.08775 | null |
| 2025-09-10 | SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation | Peter Stone Team | 2509.08757 | link |
| 2025-09-10 | PegasusFlow: Parallel Rolling-Denoising Score Sampling for Robot Diffusion Planner Flow Matching | Liang Ding Team | 2509.08435 | null |
| 2025-09-10 | Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor Integration | Huimin Lu Team | 2509.08354 | null |
| 2025-09-10 | Input-gated Bilateral Teleoperation: An Easy-to-implement Force Feedback Teleoperation Method for Low-cost Hardware | Tetsuya Ogata Team | 2509.08226 | null |
| 2025-09-09 | TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models | Hao Zhao Team | 2509.07962 | link |
| 2025-09-09 | Graph-Fused Vision-Language-Action for Policy Reasoning in Multi-Arm Robotic Manipulation | Yingbai Hu Team | 2509.07957 | null |
| 2025-09-09 | RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction | Aviral Kumar Team | 2509.07953 | null |
| 2025-09-09 | Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions | Nathan F. Lepora Team | 2509.07445 | null |
| 2025-09-08 | Quantum Machine Learning and Grover’s Algorithm for Quantum Optimization of Robotic Manipulators | Howard Li Team | 2509.07216 | null |
| 2025-09-08 | Design of Input-Output Observers for a Population of Systems with Bounded Frequency-Domain Variation using $DK$ -iteration | James Richard Forbes Team | 2509.07201 | null |
| 2025-09-08 | First Plan Then Evaluate: Use a Vectorized Motion Planner for Grasping | Tucker Hermans Team | 2509.07162 | null |
| 2025-09-08 | Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments | Deepak Pathak Team | 2509.06953 | null |
| 2025-09-10 | LLaDA-VLA: Vision Language Diffusion Action Models | Xiaoyan Sun Team | 2509.06932 | null |
| 2025-09-08 | Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention | Mohamed Zayaan S Team | 2509.06705 | null |
| 2025-09-08 | Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives | Zhenliang Ma Team | 2509.06656 | null |
| 2025-09-08 | Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster | Pavan Ramdya Team | 2509.06426 | null |
| 2025-09-07 | O $^3$ Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation | Yen-Ling Kuo Team | 2509.06233 | link |
| 2025-09-07 | Robotic Manipulation Framework Based on Semantic Keypoints for Packing Shoes of Different Sizes, Shapes, and Softness | Zhendong Dai Team | 2509.06048 | link |
| 2025-09-06 | TeleopLab: Accessible and Intuitive Teleoperation of a Robotic Manipulator for Remote Labs | John Liu Team | 2509.05547 | null |
| 2025-09-05 | OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation | Yu Xiang Team | 2509.05513 | null |
| 2025-09-04 | Long-Horizon Visual Imitation Learning via Plan and Code Reflection | Yunde Jia Team | 2509.05368 | null |
| 2025-09-08 | Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework | Ji-Rong Wen Team | 2509.05007 | null |
| 2025-09-05 | Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics | Toshiaki Tsuji Team | 2509.04737 | null |
| 2025-09-04 | Surformer v2: A Multimodal Classifier for Surface Understanding from Touch and Vision | Noorbakhsh Amiri Golilarz Team | 2509.04658 | null |
| 2025-09-04 | Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement | David Held Team | 2509.04645 | link) |
| 2025-09-04 | Action Chunking with Transformers for Image-Based Spacecraft Guidance and Control | Richard Linares Team | 2509.04628 | null |
| 2025-09-04 | In-Context Policy Adaptation via Cross-Domain Skill Diffusion | Honguk Woo Team | 2509.04535 | null |
| 2025-09-04 | EMMA: Scaling Mobile Manipulation via Egocentric Human Data | Danfei Xu Team | 2509.04443 | null |
| 2025-09-04 | Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models | Donglin Wang Team | 2509.04063 | null |
| 2025-09-04 | FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction | Jingtai Liu Team | 2509.04018 | null |
| 2025-09-04 | Weakly-Supervised Learning of Dense Functional Correspondences | Jiajun Wu Team | 2509.03893 | link |
| 2025-09-05 | Learning Multi-Stage Pick-and-Place with a Legged Mobile Manipulator | Wei Xu Team | 2509.03859 | null |
| 2025-09-03 | The Role of Embodiment in Intuitive Whole-Body Teleoperation for Mobile Manipulation | Georgia Chalvatzaki Team | 2509.03222 | null |
| 2025-09-03 | Autonomous Learning From Success and Failure: Goal-Conditioned Supervised Learning with Negative Feedback | Daniel A. Braun Team | 2509.03206 | null |
| 2025-09-03 | Forbal: Force Balanced 2-5 Degree of Freedom Robot Manipulator Built from a Five Bar Linkage | Matteo Bottin Team | 2509.03119 | null |
| 2025-09-02 | Generalizable Skill Learning for Construction Robots with Crowdsourced Natural Language Instructions, Composable Skills Standardization, and Large Language Model | Carol C. Menassa Team | 2509.02876 | null |
| 2025-09-02 | Power Grid Control with Graph-Based Distributed Reinforcement Learning | Marcello Restelli Team | 2509.02861 | null |
| 2025-09-04 | Plan Verification for LLM-Based Embodied Task Completion Agents | Gokhan Tur Team | 2509.02761 | null |
| 2025-09-02 | Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots | Bingyi Kang Team | 2509.02530 | link |
| 2025-09-02 | U-ARM : Ultra low-cost general teleoperation interface for robot manipulation | Bo Zhao Team | 2509.02437 | null |
| 2025-09-05 | Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance | Xuelong Li Team | 2509.02055 | null |
| 2025-09-01 | ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training | Dieter Fox Team | 2509.01819 | null |
| 2025-09-01 | Non-conflicting Energy Minimization in Reinforcement Learning based Robot Control | Stefan Lee Team | 2509.01765 | null |
| 2025-09-01 | Fail2Progress: Learning from Real-World Robot Failures with Stein Variational Inference | Tucker Hermans Team | 2509.01746 | null |
| 2025-09-01 | Articulated Object Estimation in the Wild | Abhinav Valada Team | 2509.01708 | null |
| 2025-09-01 | Data Retrieval with Importance Weights for Few-Shot Imitation Learning | Joey Hejna Team | 2509.01657 | null |
| 2025-09-01 | Disentangled Multi-Context Meta-Learning: Unlocking robust and Generalized Task Learning | Seongil Hong Team | 2509.01297 | null |
| 2025-08-31 | One-Step Model Predictive Path Integral for Manipulator Motion Planning Using Configuration Space Distance Fields | Kenji Kawashima Team | 2509.00836 | null |
| 2025-08-31 | An Effective Trajectory Planning and an Optimized Path Planning for a 6-Degree-of-Freedom Robot Manipulator | Masahiko Mikawa Team | 2509.00828 | null |
| 2025-08-31 | Inverse Kinematics for a 6-Degree-of-Freedom Robot Manipulator Using Comprehensive Gröbner Systems | Masahiko Mikawa Team | 2509.00823 | null |
| 2025-08-30 | Learning Dolly-In Filming From Demonstration Using a Ground-Based Robot | Wenbin Li Team | 2509.00574 | null |
| 2025-08-30 | NeuralSVCD for Efficient Swept Volume Collision Detection | Beomjoon Kim Team | 2509.00499 | null |
| 2025-08-29 | Can a mobile robot learn from a pedestrian model to prevent the sidewalk salsa? | David Abbink Team | 2508.21690 | null |
| 2025-08-29 | Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators | Thomas B. Schön Team | 2508.21677 | null |
| 2025-08-29 | Learning Agile Gate Traversal via Analytical Optimal Policy Gradient | Lin Zhao Team | 2508.21592 | null |
| 2025-08-29 | Estimated Informed Anytime Search for Sampling-Based Planning via Adaptive Sampler | Alois Knoll Team | 2508.21549 | null |
| 2025-08-29 | Few-Shot Neuro-Symbolic Imitation Learning for Long-Horizon Planning and Acting | Matthias Scheutz Team | 2508.21501 | null |
| 2025-08-29 | RoboInspector: Unveiling the Unreliability of Policy Code for LLM-enabled Robotic Manipulation | Yuanchao Shu Team | 2508.21378 | null |
| 2025-08-29 | Dynamics-Compliant Trajectory Diffusion for Super-Nominal Payload Manipulation | Alessandro Roncone Team | 2508.21375 | null |
| 2025-08-29 | Learning to Assemble the Soma Cube with Legal-Action Masked DQN and Safe ZYZ Regrasp on a Doosan M0609 | Sawoong Kim Team | 2508.21272 | null |
| 2025-08-28 | Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation | Davide Scaramuzza Team | 2508.21065 | null |
| 2025-08-28 | Rapid Mismatch Estimation via Neural Network Informed Variational Inference | Nadia Figueroa Team | 2508.21007 | link |
| 2025-08-29 | UltraTac: Integrated Ultrasound-Augmented Visuotactile Sensor for Enhanced Robotic Perception | Wenbo Ding Team | 2508.20982 | null |
| 2025-08-28 | Deep Fuzzy Optimization for Batch-Size and Nearest Neighbors in Optimal Robot Motion Planning | Alois Knoll Team | 2508.20884 | null |
| 2025-08-28 | Learning Primitive Embodied World Models: Towards Scalable Robotic Learning | Qinying Gu Team | 2508.20840 | null |
| 2025-08-28 | Non-expert to Expert Motion Translation Using Generative Adversarial Networks | Seiichiro Katsura Team | 2508.20740 | null |
| 2025-08-28 | SimShear: Sim-to-Real Shear-based Tactile Servoing | Nathan F. Lepora Team | 2508.20561 | null |
| 2025-08-31 | HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation | Huazhe Xu Team | 2508.20085 | null |
| 2025-08-28 | Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation | Donglin Wang Team | 2508.19958 | link |
| 2025-08-28 | Ego-centric Predictive Model Conditioned on Hand Trajectories | Mike Zheng Shou Team | 2508.19852 | null |
| 2025-08-27 | APT*: Asymptotically Optimal Motion Planning via Adaptively Prolated Elliptical R-Nearest Neighbors | Alois Knoll Team | 2508.19790 | null |
| 2025-08-27 | Impedance Primitive-augmented Hierarchical Reinforcement Learning for Sequential Tasks | Jens Kober Team | 2508.19607 | null |
| 2025-08-26 | Gentle Object Retraction in Dense Clutter Using Multimodal Force Sensing and Imitation Learning | Mark Cutkosky Team | 2508.19476 | null |
| 2025-08-26 | LaVA-Man: Learning Visual Action Representations for Robot Manipulation | Changjae Oh Team | 2508.19391 | null |
| 2025-08-26 | Inference of Human-derived Specifications of Object Placement via Demonstration | Julie A Shah Team | 2508.19367 | null |
| 2025-08-26 | MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation | Gao Huang Team | 2508.19236 | link |
| 2025-08-26 | LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding | Felix Heide Team | 2508.19204 | link |
| 2025-08-27 | AutoRing: Imitation Learning–based Autonomous Intraocular Foreign Body Removal Manipulation with Eye Surgical Robot | Jian Wu Team | 2508.19191 | null |
| 2025-08-28 | From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity | Antoine Cully Team | 2508.19172 | null |
| 2025-08-26 | Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games | Chiu-Chou Lin Team | 2508.19152 | null |
| 2025-08-26 | AS2FM: Enabling Statistical Model Checking of ROS 2 Systems for Robust Autonomy | Matteo Morelli Team | 2508.18820 | null |
| 2025-08-26 | HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation | Yanchao Yang Team | 2508.18802 | null |
| 2025-08-26 | Deep Sensorimotor Control by Imitating Predictive Models of Human Motion | Antonio Loquercio Team | 2508.18691 | link |
| 2025-08-26 | Integration of Robot and Scene Kinematics for Sequential Mobile Manipulation Planning | Song-Chun Zhu Team | 2508.18627 | null |
| 2025-08-25 | PneuGelSight: Soft Robotic Vision-Based Proprioception and Tactile Sensing | Wenzhen Yuan Team | 2508.18443 | null |
| 2025-08-25 | Maintenance automation: methods for robotics manipulation planning and execution | Alexander Verl Team | 2508.18399 | null |
| 2025-08-26 | FlowVLA: Thinking in Motion with a Visual Chain of Thought | Haoang Li Team | 2508.18269 | null |
| 2025-08-25 | No Need to Look! Locating and Grasping Objects by a Robot Arm Covered with Sensitive Skin | Matej Hoffmann Team | 2508.17986 | null |
| 2025-08-25 | SEBVS: Synthetic Event-based Visual Servoing for Robot Navigation and Manipulation | Bharatesh Chakravarthi Team | 2508.17643 | null |
| 2025-08-25 | GWM: Towards Scalable Gaussian World Models for Robotic Manipulation | Siyuan Huang Team | 2508.17600 | link |
| 2025-08-24 | LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations | Hao Su Team | 2508.17547 | null |
| 2025-08-24 | Variational Shape Inference for Grasp Diffusion on SE(3) | Aniket Bera Team | 2508.17482 | null |
| 2025-08-24 | ReviBranch: Deep Reinforcement Learning for Branch-and-Bound with Revived Trajectories | Jiaping Xiao Team | 2508.17452 | null |
| 2025-08-24 | Robotic Manipulation via Imitation Learning: Taxonomy, Evolution, Benchmark, and Challenges | Liming Chen Team | 2508.17449 | null |
| 2025-08-24 | OVITA: Open-Vocabulary Interpretable Trajectory Adaptations | Ravi Prakash Team | 2508.17260 | link |
| 2025-08-24 | 4D Visual Pre-training for Robot Learning | Huazhe Xu Team | 2508.17230 | null |
| 2025-08-21 | UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation | Binbin Xu Team | 2508.15972 | link |
| 2025-08-21 | Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning | Wenwu Zhu Team | 2508.15874 | null |
| 2025-08-21 | Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning | Houqiang Li Team | 2508.15327 | null |
| 2025-08-20 | A Vision-Based Shared-Control Teleoperation Scheme for Controlling the Robotic Arm of a Four-Legged Robot | Marcelo Becker Team | 2508.14994 | null |
| 2025-08-19 | Learning to Drive Ethically: Embedding Moral Reasoning into Autonomous Driving | Ostap Okhrin Team | 2508.14926 | null |
| 2025-08-20 | FBI: Learning Dexterous In-hand Manipulation with Dynamic Visuotactile Shortcut Policy | Cewu Lu Team | 2508.14441 | null |
| 2025-08-20 | Offline Imitation Learning upon Arbitrary Demonstrations by Pre-Training Dynamics Representations | Na Li Team | 2508.14383 | null |
| 2025-08-20 | Action-Constrained Imitation Learning | Ping-Chun Hsieh Team | 2508.14379 | null |
| 2025-08-20 | Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation | Ioannis Stamos Team | 2508.14358 | null |
| 2025-08-19 | Train Once, Deploy Anywhere: Realize Data-Efficient Dynamic Object Manipulation | Hengshuang Zhao Team | 2508.14042 | null |
| 2025-08-19 | Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation | Jianye Hao Team | 2508.13998 | null |
| 2025-08-19 | Toward Deployable Multi-Robot Collaboration via a Symbolically-Guided Decision Transformer | Paul Asunda Team | 2508.13877 | null |
| 2025-08-18 | Decoding Communications with Partial Information | Peter McBurney Team | 2508.13326 | null |
| 2025-08-18 | Precise Action-to-Video Generation Through Visual Action Prompts | Ruizhen Hu Team | 2508.13104 | link |
| 2025-08-18 | Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy | Zhi Hou Team | 2508.13103 | null |
| 2025-08-18 | Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey | Liqiang Nie Team | 2508.13073 | link |
| 2025-08-18 | PROD: Palpative Reconstruction of Deformable Objects through Elastostatic Signed Distance Functions | Hamza El-Kebir Team | 2508.12554 | null |
| 2025-08-17 | EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos | Hesheng Wang Team | 2508.12349 | null |
| 2025-08-17 | Bimanual Robot-Assisted Dressing: A Spherical Coordinate-Based Strategy for Tight-Fitting Garments | Jihong Zhu Team | 2508.12274 | null |
| 2025-08-17 | Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids | Shuran Song Team | 2508.12252 | null |
| 2025-08-16 | Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing | Melkior Ornik Team | 2508.12166 | null |
| 2025-08-16 | OASIS: Real-Time Opti-Acoustic Sensing for Intervention Systems in Unstructured Environments | Richard Camilli Team | 2508.12071 | null |
| 2025-08-16 | Fully Spiking Actor-Critic Neural Network for Robotic Manipulation | Guanghui Sun Team | 2508.12038 | null |
| 2025-08-16 | OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation | Xiaozhu Ju Team | 2508.11898 | null |
| 2025-08-15 | Limitation Learning: Catching Adverse Dialog with GAIL | Rahul Zalkikar Team | 2508.11767 | null |
| 2025-08-15 | MultiPark: Multimodal Parking Transformer with Next-Segment Prediction | Tong Qin Team | 2508.11537 | null |
| 2025-08-15 | Learning Differentiable Reachability Maps for Optimization-based Humanoid Motion Generation | Fumio Kanehiro Team | 2508.11275 | null |
| 2025-08-15 | Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation | Kwok Wai Samuel Au Team | 2508.11204 | null |
| 2025-08-15 | Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward | Yu-Gang Jiang Team | 2508.11143 | null |
| 2025-08-14 | Robot Policy Evaluation for Sim-to-Real Transfer: A Benchmarking Perspective | Fabio Ramos Team | 2508.11117 | null |
| 2025-08-14 | GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning | Ruohan Gao Team | 2508.11049 | null |
| 2025-08-14 | 3D FlowMatch Actor: Unified 3D Policy for Single- and Dual-Arm Manipulation | Katerina Fragkiadaki Team | 2508.11002 | null |
| 2025-08-15 | KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection | Lorenzo Natale Team | 2508.10511 | null |
| 2025-08-14 | Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning | Ping Kuang Team | 2508.10399 | null |
| 2025-08-14 | Leveraging OS-Level Primitives for Robotic Action Management | Haibo Chen Team | 2508.10259 | null |
| 2025-08-13 | Masquerade: Learning from In-the-wild Human Videos using Data-Editing | Jeannette Bohg Team | 2508.09976 | link |
| 2025-08-13 | Toward Human-Robot Teaming: Learning Handover Behaviors from 3D Scenes | Changjae Oh Team | 2508.09855 | null |
| 2025-08-13 | Physical Autoregressive Model for Robotic Manipulation without Action Pretraining | Guangrun Wang Team | 2508.09822 | null |
| 2025-08-13 | Immersive Teleoperation of Beyond-Human-Scale Robotic Manipulators: Challenges and Future Directions | Jouni Mattila Team | 2508.09700 | null |
| 2025-08-13 | CaRoBio: 3D Cable Routing with a Bio-inspired Gripper Fingernail | Fumin Zhang Team | 2508.09558 | null |
| 2025-08-13 | Reactive Model Predictive Contouring Control for Robot Manipulators | Jaeheung Park Team | 2508.09502 | null |
| 2025-08-13 | DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation | Liqiang Nie Team | 2508.09444 | null |
| 2025-08-13 | GeoVLA: Empowering 3D Representations in Vision-Language-Action Models | Jiale Cao Team | 2508.09071 | link |
| 2025-08-12 | Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion | Sehoon Ha Team | 2508.08982 | null |
| 2025-08-12 | Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation | Yang Li Team | 2508.08882 | null |
| 2025-08-12 | Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT | Yukiyasu Domae Team | 2508.08748 | null |
| 2025-08-12 | Towards Safe Imitation Learning via Potential Field-Guided Flow Matching | Yoshihiko Nakamura Team | 2508.08707 | null |
| 2025-08-12 | OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing | Hengdi Zhang Team | 2508.08706 | null |
| 2025-08-11 | ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction | Wenjun Mei Team | 2508.08170 | null |
| 2025-08-11 | AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies | Joyce Chai Team | 2508.08113 | null |
| 2025-08-13 | AgentWorld: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation | Lei Han Team | 2508.07770 | null |
| 2025-08-11 | GraphCoT-VLA: A 3D Spatial-Aware Reasoning Vision-Language-Action Model for Robotic Manipulation with Ambiguous Instructions | Hong Zhang Team | 2508.07650 | null |
| 2025-08-11 | AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning | Yang Liu Team | 2508.07626 | null |
| 2025-08-10 | Collision-Free Trajectory Planning and control of Robotic Manipulator using Energy-Based Artificial Potential Field (E-APF) | Manoranjan Sinha Team | 2508.07323 | null |
| 2025-08-10 | Multimodal Spiking Neural Network for Space Robotic Manipulation | Guanghui Sun Team | 2508.07287 | null |
| 2025-08-09 | DexFruit: Dexterous Manipulation and Gaussian Splatting Inspection of Fruit | Monroe Kennedy III Team | 2508.07118 | null |
| 2025-08-09 | From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving | Antonio Guillen-Perez Team | 2508.07029 | null |
| 2025-08-09 | Manipulator for people with limited abilities | Arkady Yuschenko Team | 2508.06969 | null |
| 2025-08-09 | Learning a Vision-Based Footstep Planner for Hierarchical Walking Control | Michael Posa Team | 2508.06779 | null |
| 2025-08-08 | Towards Balanced Behavior Cloning from Imbalanced Datasets | Dylan P. Losey Team | 2508.06319 | null |
| 2025-08-08 | Surrogate-Enhanced Modeling and Adaptive Modular Control of All-Electric Heavy-Duty Robotic Manipulators | Jouni Mattila Team | 2508.06313 | null |
| 2025-08-08 | ADPro: a Test-time Adaptive Diffusion Policy for Robot Manipulation via Manifold and Initial Noise Constraints | Liming Chen Team | 2508.06266 | null |
| 2025-08-08 | Incremental Language Understanding for Online Motion Planning of Robot Manipulators | Matthias Scheutz Team | 2508.06095 | null |
| 2025-08-08 | Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning | Jonghyun Choi Team | 2508.06042 | null |
| 2025-08-08 | PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation | Yao Mu Team | 2508.05976 | null |
| 2025-08-07 | Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation | Guanghui Ren Team | 2508.05635 | link |
| 2025-08-07 | Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling | Jiachen Li Team | 2508.05634 | link |
| 2025-08-07 | Robust adaptive fuzzy sliding mode control for trajectory tracking for of cylindrical manipulator | Nga Nguyen Thi Team | 2508.05584 | null |
| 2025-08-07 | Do Robots Really Need Anthropomorphic Hands? | Nicolás Navarro-Guerrero Team | 2508.05415 | null |
| 2025-08-07 | Real-Time Iteration Scheme for Diffusion Policy | Danica Kragic Team | 2508.05396 | null |
| 2025-08-07 | ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning | Jens Kober Team | 2508.05310 | null |
| 2025-08-07 | Learning to See and Act: Task-Aware View Planning for Robotic Manipulation | Liang Lin Team | 2508.05186 | link |
| 2025-08-07 | Cognitive Duality for Adaptive Web Agents | Zheng Hu Team | 2508.05081 | null |
| 2025-08-07 | Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning | Temitope Lukman Adebanjo Team | 2508.05077 | null |
| 2025-08-06 | INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM | Nikos Tsagarakis Team | 2508.04931 | link |
| 2025-08-06 | Optimization of sliding control parameters for a 3-dof robot arm using genetic algorithm (GA) | Le Tieu Nien Team | 2508.04009 | null |
| 2025-08-05 | Constraint-Preserving Data Generation for Visuomotor Policy Learning | Jeannette Bohg Team | 2508.03944 | link |
| 2025-08-05 | DiWA: Diffusion Policy Adaptation with World Models | Abhinav Valada Team | 2508.03645 | null |
| 2025-08-05 | ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow | Xiaodan Liang Team | 2508.03218 | null |
| 2025-08-05 | Safety-Aware Imitation Learning via MPC-Guided Disturbance Injection | Somil Bansal Team | 2508.03129 | null |
| 2025-08-07 | Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching | C. Karen Liu Team | 2508.03068 | null |
| 2025-08-05 | Aerobatic maneuvers in insect-scale flapping-wing aerial robots via deep-learned robust tube model predictive control | YuFeng Chen Team | 2508.03043 | null |
| 2025-08-04 | Learning User Interaction Forces using Vision for a Soft Finger Exosuit | Thomas George Thuruthel Team | 2508.02870 | null |
| 2025-08-04 | Manip4Care: Robotic Manipulation of Human Limbs for Solving Assistive Tasks | Ahmed H. Qureshi Team | 2508.02649 | null |
| 2025-08-04 | D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss | Haitao Wang Team | 2508.02644 | null |
| 2025-08-01 | On-Device Diffusion Transformer Policy for Efficient Robot Manipulation | Dong Xu Team | 2508.00697 | null |
| 2025-08-01 | HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning | Lorenzo Natale Team | 2508.00491 | null |
| 2025-08-01 | Energy Efficient Trajectory Control and Resource Allocation in Multi-UAV-assisted MEC via Deep Reinforcement Learning | Dusit Niyato Team | 2508.00261 | null |
| 2025-07-31 | RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping | Jianbing Shen Team | 2507.23734 | link |
| 2025-07-31 | villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models | Jiang Bian Team | 2507.23682 | link |
| 2025-08-01 | H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation | Jun Zhu Team | 2507.23523 | null |
| 2025-07-31 | Policy Learning from Large Vision-Language Model Feedback without Reward Modeling | Chang D. Yoo Team | 2507.23391 | null |
| 2025-07-30 | In-between Motion Generation Based Multi-Style Quadruped Robot Locomotion | Peng Lu Team | 2507.23053 | null |
| 2025-07-30 | Improving Generalization Ability of Robotic Imitation Learning by Resolving Causal Confusion in Observations | Brendan Tidd Team | 2507.22380 | null |
| 2025-07-29 | RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation | Pengcheng He Team | 2507.22219 | null |
| 2025-07-29 | A Nonlinear MPC Framework for Loco-Manipulation of Quadrupedal Robots with Non-Negligible Manipulator Dynamics | Kaveh Akbari Hamed Team | 2507.22042 | null |
| 2025-07-29 | From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning | Bolei Zhou Team | 2507.22028 | null |
| 2025-07-29 | DISCOVERSE: Efficient Robot Simulation in Complex High-Fidelity Environments | Guyue Zhou Team | 2507.21981 | null |
| 2025-07-29 | MoDeSuite: Robot Learning Task Suite for Benchmarking Mobile Manipulation with Deformable Objects | Joni Pajarinen Team | 2507.21796 | null |
| 2025-07-29 | Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning | Panpan Cai Team | 2507.21545 | null |
| 2025-07-29 | Model Predictive Adversarial Imitation Learning for Planning from Observation | Byron Boots Team | 2507.21533 | null |
| 2025-07-29 | Retrieve-Augmented Generation for Speeding up Diffusion Policy without Additional Training | Yutaka Matsuo Team | 2507.21452 | null |
| 2025-07-28 | Fluidically Innervated Lattices Make Versatile and Durable Tactile Sensors | Daniela Rus Team | 2507.21225 | null |
| 2025-07-28 | FMimic: Foundation Models are Fine-grained Action Learners from Human Videos | Yufeng Yue Team | 2507.20622 | null |
| 2025-07-28 | Learning Physical Interaction Skills from Human Demonstrations | Kwonjoon Lee Team | 2507.20445 | null |
| 2025-07-23 | ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents | Hesheng Wang Team | 2507.17462 | null |
| 2025-07-23 | Ctx2TrajGen: Traffic Context-Aware Microscale Vehicle Trajectories using Generative Adversarial Imitation Learning | Byeongjoon Noh Team | 2507.17418 | null |
| 2025-07-23 | Confounded Causal Imitation Learning with Instrumental Variables | Zhi Geng Team | 2507.17309 | null |
| 2025-07-23 | Prolonging Tool Life: Learning Skillful Use of General-purpose Tools through Lifespan-guided Reinforcement Learning | Takamitsu Matsubara Team | 2507.17275 | null |
| 2025-07-23 | Towards Human-level Intelligence via Human-like Whole-Body Manipulation | Zhaohui An Team | 2507.17141 | null |
| 2025-07-22 | Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots | Aitor Arrieta Team | 2507.17049 | null |
| 2025-07-19 | Sensor-Space Based Robust Kinematic Control of Redundant Soft Manipulator by Learning | Charlie C. L. Wang Team | 2507.16842 | null |
| 2025-07-22 | ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning | Fu-En Yang Team | 2507.16815 | null |
| 2025-07-22 | Equivariant Goal Conditioned Contrastive Reinforcement Learning | Robert Platt Team | 2507.16139 | null |
| 2025-07-21 | Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers | Iman Soltani Team | 2507.15833 | null |
| 2025-07-21 | Strong, Accurate, and Low-Cost Robot Manipulator | Donghyun Kim Team | 2507.15693 | null |
| 2025-07-21 | Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos | Zongqing Lu Team | 2507.15597 | null |
| 2025-07-22 | GR-3 Technical Report | Yichu Yang Team | 2507.15493 | null |
| 2025-07-20 | Learning-Based Modeling of a Magnetically Steerable Soft Suction Device for Endoscopic Endonasal Interventions | Eric Diller Team | 2507.15155 | null |
| 2025-07-20 | Reinforcement Learning for Flow-Matching Policies | Somayeh Sojoudi Team | 2507.15073 | null |
| 2025-07-20 | Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper | Yunzhu Li Team | 2507.15062 | null |
| 2025-07-20 | LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading | Lu Zhang Team | 2507.14995 | null |
| 2025-07-20 | Heterogeneous object manipulation on nonlinear soft surface through linear controller | Andres Faiña Team | 2507.14967 | null |
| 2025-07-20 | KGN-Pro: Keypoint-Based Grasp Prediction through Probabilistic 2D-3D Correspondence Learning | Guangyao Zhai Team | 2507.14820 | null |
| 2025-07-19 | BT-TL-DMPs: A Novel Robot TAMP Framework Combining Behavior Tree, Temporal Logic and Dynamical Movement Primitives | Yongchun Fang Team | 2507.14582 | null |
| 2025-07-18 | Improving Low-Cost Teleoperation: Augmenting GELLO with Force | Kai Arulkumaran Team | 2507.13602 | null |
| 2025-07-17 | The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner | Kai Chen Team | 2507.13332 | null |
| 2025-07-17 | ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning | Johannes A. Stork Team | 2507.13088 | null |
| 2025-07-17 | Generalist Bimanual Manipulation via Foundation Video Diffusion Models | Jun Zhu Team | 2507.12898 | null |
| 2025-07-17 | Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) | Jost Tobias Springenberg Team | 2507.12856 | null |
| 2025-07-17 | DEMONSTRATE: Zero-shot Language to Robotic Control via Multi-task Demonstration Learning | Melanie N. Zeilinger Team | 2507.12855 | null |
| 2025-07-17 | Learning to Predict Mobile Robot Stability in Off-Road Environments | Parikshit Maini Team | 2507.12731 | null |
| 2025-07-18 | EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos | Xiaolong Wang Team | 2507.12440 | null |
| 2025-07-16 | The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey | Jiming Chen Team | 2507.11840 | null |
| 2025-07-15 | Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification | Zsolt Kira Team | 2507.11662 | null |
| 2025-07-15 | MPC-based Coarse-to-Fine Motion Planning for Robotic Object Transportation in Cluttered Environments | Steven Liu Team | 2507.11211 | null |
| 2025-07-15 | A Robust Controller based on Gaussian Processes for Robotic Manipulators with Unknown Uncertainty | Ruggero Carli Team | 2507.11170 | null |
| 2025-07-15 | Enhancing Autonomous Manipulator Control with Human-in-loop for Uncertain Assembly Environments | Kazuya Yoshida Team | 2507.11006 | null |
| 2025-07-15 | Object-Centric Mobile Manipulation through SAM2-Guided Perception and Imitation Learning | Jun Morimoto Team | 2507.10899 | null |
| 2025-07-14 | Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection | Colin Bellinger Team | 2507.10814 | null |
| 2025-07-14 | rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding | Kaiyu Hang Team | 2507.10776 | null |
| 2025-07-14 | A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers | Arko Barman Team | 2507.10775 | null |
| 2025-07-14 | Vision Language Action Models in Robotic Manipulation: A Systematic Review | Irfan Hussain Team | 2507.10672 | null |
| 2025-07-16 | GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning | Dandan Tu Team | 2507.10628 | null |
| 2025-07-14 | MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation | Mengyuan Liu Team | 2507.10543 | null |
| 2025-07-14 | Prompt Informed Reinforcement Learning for Visual Coverage Path Planning | Venkat Margapuri Team | 2507.10284 | null |
| 2025-07-14 | Should We Ever Prefer Decision Transformer for Offline Reinforcement Learning? | Keith Ross Team | 2507.10174 | null |
| 2025-07-16 | MTF-Grasp: A Multi-tier Federated Learning Approach for Robotic Grasping | Monowar Bhuyan Team | 2507.10158 | null |
| 2025-07-13 | Learning to Control Dynamical Agents via Spiking Neural Networks and Metropolis-Hastings Sampling | Ali Al-Zawqari Team | 2507.09540 | null |
| 2025-07-13 | Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles | Keqiang Li Team | 2507.09537 | null |
| 2025-07-13 | SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation | Boyu Wang Team | 2507.09459 | null |
| 2025-07-12 | DAA*: Deep Angular A Star for Image-based Path Planning | Zhiwei Xu Team | 2507.09305 | null |
| 2025-07-15 | Learning and Transferring Better with Depth Information in Visual Reinforcement Learning | Jingdong Zhao Team | 2507.09180 | null |
| 2025-07-12 | PRAG: Procedural Action Generator | Karla Stepanova Team | 2507.09167 | null |
| 2025-07-12 | Towards Human-level Dexterity via Robot Learning | Gagan Khandate Team | 2507.09117 | null |
| 2025-07-11 | Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction | Max Simchowitz Team | 2507.09061 | null |
| 2025-07-11 | Behavioral Exploration: Learning to Explore via In-Context Adaptation | Sergey Levine Team | 2507.09041 | null |
| 2025-07-11 | Learning human-to-robot handovers through 3D scene reconstruction | Changjae Oh Team | 2507.08726 | null |
| 2025-07-11 | Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots | Yue Gao Team | 2507.08303 | null |
| 2025-07-11 | CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations | He Wang Team | 2507.08262 | null |
| 2025-07-10 | Imitation Learning for Obstacle Avoidance Using End-to-End CNN-Based Sensor Fusion | Raafat E. Shalaby Team | 2507.08112 | null |
| 2025-07-15 | EXPO: Stable Reinforcement Learning with Expressive Policies | Chelsea Finn Team | 2507.07986 | null |
| 2025-07-15 | Reinforcement Learning with Action Chunking | Sergey Levine Team | 2507.07969 | null |
| 2025-07-09 | Self-Wearing Adaptive Garments via Soft Robotic Unfurling | Allison M. Okamura Team | 2507.07221 | null |
| 2025-07-09 | Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand | Xinjun Sheng Team | 2507.06822 | null |
| 2025-07-09 | Learning safe, constrained policies via imitation learning: Connection to Probabilistic Inference and a Naive Algorithm | George A. Vouros Team | 2507.06780 | null |
| 2025-07-13 | Spatial-Temporal Aware Visuomotor Diffusion Policy Learning | Yanwei Fu Team | 2507.06710 | null |
| 2025-07-09 | Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement | Martin Riedmiller Team | 2507.06701 | null |
| 2025-07-09 | Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning | Jian Cheng Team | 2507.06628 | null |
| 2025-07-09 | Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic | Fabio Ramos Team | 2507.06625 | null |
| 2025-07-09 | Token Bottleneck: One Token to Remember Dynamics | Sangdoo Yun Team | 2507.06543 | null |
| 2025-07-08 | Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction | Alessio Del Bue Team | 2507.06404 | null |
| 2025-07-08 | EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow | Liang Wang Team | 2507.06224 | null |
| 2025-07-08 | Is Diversity All You Need for Scalable Robotic Manipulation? | Hongyang Li Team | 2507.06219 | null |
| 2025-07-08 | Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model | Toshiaki Tsuji Team | 2507.06174 | null |
| 2025-07-08 | Learning Agile Tensile Perching for Aerial Robots from Demonstrations | Basaran Bahadir Kocer Team | 2507.06172 | null |
| 2025-07-08 | SCCRUB: Surface Cleaning Compliant Robot Utilizing Bristles | Jeffrey Ian Lipton Team | 2507.06053 | null |
| 2025-07-08 | LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving | Jian Sun Team | 2507.05754 | null |
| 2025-07-08 | Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning | Daniel Rakita Team | 2507.05695 | null |
| 2025-07-08 | Integrating Diffusion-based Multi-task Learning with Online Reinforcement Learning for Robust Quadruped Robot Control | Bin Liang Team | 2507.05674 | null |
| 2025-07-08 | Stable Tracking-in-the-Loop Control of Cable-Driven Surgical Manipulators under Erroneous Kinematic Chains | Michael C. Yip Team | 2507.05663 | null |
| 2025-07-08 | DreamGrasp: Zero-Shot 3D Multi-Object Reconstruction from Partial-View Images for Robotic Manipulation | Frank Chongwoo Park Team | 2507.05627 | null |
| 2025-07-07 | Gaussian Process-Based Active Exploration Strategies in Vision and Touch | Nadia Figueroa Team | 2507.05522 | null |
| 2025-07-07 | A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation | Russ Tedrake Team | 2507.05331 | null |
| 2025-07-07 | VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting | Yanzhi Wang Team | 2507.05116 | null |
| 2025-07-07 | When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning | Sebastien Ourselin Team | 2507.05011 | null |
| 2025-07-07 | Training-free Generation of Temporally Consistent Rewards from VLMs | Jian Tang Team | 2507.04789 | null |
| 2025-07-07 | DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics | Mingsheng Shang Team | 2507.04661 | null |
| 2025-07-07 | PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation | Chee-Meng Chew Team | 2507.04633 | null |
| 2025-07-07 | Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts | Junjie Hu Team | 2507.04631 | null |
| 2025-07-06 | VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation | Lei Han Team | 2507.04524 | null |
| 2025-07-06 | DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Xin Jin Team | 2507.04447 | null |
| 2025-07-06 | Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks | Yi Fang Team | 2507.04331 | null |
| 2025-07-05 | Are Learning-Based Approaches Ready for Real-World Indoor Navigation? A Case for Imitation Learning | Sebastian Houben Team | 2507.04086 | null |
| 2025-07-05 | Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation | Yadan Luo Team | 2507.04049 | null |
| 2025-07-08 | RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot | Hao Dong Team | 2507.03930 | null |
| 2025-07-05 | DK-RRT: Deep Koopman RRT for Collision-Aware Motion Planning of Space Manipulators in Dynamic Debris Environments | Dezhi Yu Team | 2507.03878 | null |
| 2025-07-04 | Dexterous Teleoperation of 20-DoF ByteDexter Hand via Human Motion Retargeting | Zeyu Ren Team | 2507.03227 | null |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null |
| 2025-07-02 | Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN | Matthias Kerzel Team | 2507.02171 | null |
| 2025-07-02 | TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types | Wei-Shi Zheng Team | 2507.01857 | null |
| 2025-07-02 | S3D: A Spatial Steerable Surgical Drilling Framework for Robotic Spinal Fixation Procedures | Farshid Alambeigi Team | 2507.01779 | null |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null |
| 2025-07-01 | Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives | Bakir Lacevic Team | 2507.01198 | null |
| 2025-07-01 | Imitation Learning for Satellite Attitude Control under Unknown Perturbations | Xiaoli Bai Team | 2507.01161 | null |
| 2025-07-01 | SonoGym: High Performance Simulation for Challenging Surgical Tasks with Robotic Ultrasound | Philipp Fürnstahl Team | 2507.01152 | null |
| 2025-07-01 | Geometry-aware 4D Video Generation for Robot Manipulation | Shuran Song Team | 2507.01099 | null |
| 2025-07-01 | DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation | Pulkit Agrawal Team | 2507.01008 | null |
| 2025-07-04 | Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations | Yunzhu Li Team | 2507.00990 | null |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | null |
| 2025-07-01 | Learning Steerable Imitation Controllers from Unstructured Animal Motions | Stelian Coros Team | 2507.00677 | null |
| 2025-07-01 | RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation | Siddhartha Srinivasa Team | 2507.00435 | null |
| 2025-07-01 | Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning | Yang Gao Team | 2506.23944 | null |
| 2025-06-30 | World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation | Lin Shao Team | 2506.23919 | null |
| 2025-06-30 | Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning | Alexey Skrynnik Team | 2506.23793 | null |
| 2025-06-30 | PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? | Ransalu Senanayake Team | 2506.23725 | null |
| 2025-07-04 | ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation | Mac Schwager Team | 2506.23126 | null |
| 2025-06-29 | Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots | Yue Gao Team | 2506.23125 | null |
| 2025-06-28 | Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation | Navid Azizan Team | 2506.22827 | null |
| 2025-06-28 | SPI-BoTER: Error Compensation for Industrial Robots via Sparse Attention Masking and Hybrid Loss with Spatial-Physical Information | Yuqiang Wu Team | 2506.22788 | null |
| 2025-06-28 | Learning Efficient Robotic Garment Manipulation with Standardization | Bin He Team | 2506.22769 | null |
| 2025-06-28 | RoboPearls: Editable Video Simulation for Robot Manipulation | Xiaodan Liang Team | 2506.22756 | null |
| 2025-06-27 | Spherical Pendulum with Quad-Rotor Thrust Vectoring Actuation – A Novel Mechatronics and Control Benchmark Platform | Tsu-Chin Tsao Team | 2506.22410 | null |
| 2025-06-27 | RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation | Abhinav Valada Team | 2506.22007 | null |
| 2025-06-26 | Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation | Venkat Krovi Team | 2506.21732 | null |
| 2025-06-24 | Ark: An Open-source Python-based Framework for Robot Learning | Haitham Bou-Ammar Team | 2506.21628 | null |
| 2025-06-24 | FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models | Huiping Zhuang Team | 2506.21627 | null |
| 2025-06-26 | ACTLLM: Action Consistency Tuned Large Language Model | Chenliang Xu Team | 2506.21250 | null |
| 2025-07-02 | World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Xipeng Qiu Team | 2506.21230 | null |
| 2025-06-26 | UAIbot: Beginner-friendly web-based simulator for interactive robotics learning and research | Vinicius Mariano Gonçalves Team | 2506.21178 | null |
| 2025-06-26 | Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions | Cewu Lu Team | 2506.21057 | null |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null |
| 2025-06-25 | Learning-Based Distance Estimation for 360° Single-Sensor Setups | Andreas Zell Team | 2506.20586 | null |
| 2025-06-25 | Learn to Position – A Novel Meta Method for Robotic Positioning | Xiaoming Tao Team | 2506.20445 | null |
| 2025-06-25 | Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration | Quanquan Gu Team | 2506.20307 | null |
| 2025-06-24 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null |
| 2025-06-24 | T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models | Qingyao Wu Team | 2506.19498 | null |
| 2025-06-24 | Is an object-centric representation beneficial for robotic manipulation ? | Liming Chen Team | 2506.19408 | null |
| 2025-06-24 | Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference | Nutan Chen Team | 2506.19303 | null |
| 2025-06-25 | AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation | Hui Shen Team | 2506.19269 | null |
| 2025-06-24 | Robust Behavior Cloning Via Global Lipschitz Regularization | Sean B. Andersson Team | 2506.19250 | null |
| 2025-06-23 | CUPID: Curating Data your Robot Loves with Influence Functions | Jeannette Bohg Team | 2506.19121 | null |
| 2025-06-23 | Multimodal Anomaly Detection with a Mixture-of-Experts | Dongheui Lee Team | 2506.19077 | null |
| 2025-06-25 | FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation | Lillian Chin Team | 2506.18960 | null |
| 2025-06-23 | RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base | Xiangyang Xue Team | 2506.18856 | null |
| 2025-06-23 | SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives | Jia Pan Team | 2506.18825 | null |
| 2025-06-23 | Learning Point Correspondences In Radar 3D Point Clouds For Radar-Inertial Odometry | Jan Steinbrener Team | 2506.18580 | null |
| 2025-06-23 | Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots | Alessandro Di Nuovo Team | 2506.18365 | null |
| 2025-06-23 | Robotic Manipulation of a Rotating Chain with Bottom End Fixed | Quang-Cuong Pham Team | 2506.18355 | null |
| 2025-06-23 | Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies | Xiaolin Chang Team | 2506.18304 | null |
| 2025-06-23 | Learning Approach to Efficient Vision-based Active Tracking of a Flying Target by an Unmanned Aerial Vehicle | Souma Chowdhury Team | 2506.18264 | null |
| 2025-06-22 | RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation | Yao Mu Team | 2506.18088 | null |
| 2025-06-21 | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Xiao Li Team | 2506.17639 | null |
| 2025-06-21 | Imitation Learning for Active Neck Motion Enabling Robot Manipulation beyond the Field of View | Yasuo Kuniyoshi Team | 2506.17624 | null |
| 2025-06-20 | Kinematic Model Optimization via Differentiable Contact Manifold for In-Space Manipulation | Satyandra K. Gupta Team | 2506.17458 | null |
| 2025-06-20 | Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping | Jingjin Yu Team | 2506.17110 | null |
| 2025-06-24 | Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration | Marco Hutter Team | 2506.16986 | null |
| 2025-06-20 | Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections | Shuran Song Team | 2506.16685 | null |
| 2025-06-19 | CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity | Yunzhu Li Team | 2506.16652 | null |
| 2025-06-19 | Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control | Ran Tian Team | 2506.16565 | null |
| 2025-06-19 | An Optimization-Augmented Control Framework for Single and Coordinated Multi-Arm Robotic Manipulation | Ozgur S. Oguz Team | 2506.16555 | null |
| 2025-06-19 | Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining | Ding Zhao Team | 2506.16475 | null |
| 2025-06-19 | GoalLadder: Incremental Goal Discovery with Vision-Language Models | Shimon Whiteson Team | 2506.16396 | null |
| 2025-06-19 | CapsDT: Diffusion-Transformer for Capsule Robot Manipulation | Hongliang Ren Team | 2506.16263 | null |
| 2025-06-19 | ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models | Siyuan Huang Team | 2506.16211 | null |
| 2025-06-19 | FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation | Wei Tang Team | 2506.16201 | null |
| 2025-06-19 | ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation | Jitendra Malik Team | 2506.15953 | null |
| 2025-06-18 | Learning from Planned Data to Improve Robotic Pick-and-Place Planning Efficiency | Kensuke Harada Team | 2506.15920 | null |
| 2025-06-18 | Improving Robotic Manipulation: Techniques for Object Pose Estimation, Accommodating Positional Uncertainty, and Disassembly Tasks from Examples | Viral Rasik Galaiya Team | 2506.15865 | null |
| 2025-06-18 | Vision in Action: Learning Active Perception from Human Demonstrations | Shuran Song Team | 2506.15666 | null |
| 2025-06-18 | Learning Task-Agnostic Skill Bases to Uncover Motor Primitives in Animal Behaviors | Anqi Wu Team | 2506.15190 | null |
| 2025-06-18 | Robust Instant Policy: Leveraging Student’s t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation | Yukiyasu Domae Team | 2506.15157 | null |
| 2025-06-18 | TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality | Eiichi Yoshida Team | 2506.15146 | null |
| 2025-06-17 | RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills | Chuang Gan Team | 2506.14763 | null |
| 2025-06-17 | Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation | Mustafa Mukadam Team | 2506.14754 | null |
| 2025-06-17 | SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning | Shuo Wang Team | 2506.14648 | null |
| 2025-06-17 | Latent Action Diffusion for Cross-Embodiment Manipulation | Robert K. Katzschmann Team | 2506.14608 | null |
| 2025-06-19 | ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes | Hao Dong Team | 2506.14317 | null |
| 2025-06-17 | Steering Robots with Inference-Time Interactions | Yanwei Wang Team | 2506.14287 | null |
| 2025-06-17 | AMPLIFY: Actionless Motion Priors for Robot Learning from Videos | Animesh Garg Team | 2506.14198 | null |
| 2025-06-17 | Non-Overlap-Aware Egocentric Pose Estimation for Collaborative Perception in Connected Autonomy | Peng Gao Team | 2506.14180 | null |
| 2025-06-17 | GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation | Yebin Liu Team | 2506.14135 | null |
| 2025-06-16 | ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning | Abhishek Gupta Team | 2506.13867 | null |
| 2025-06-16 | Touch begins where vision ends: Generalizable policies for contact-rich manipulation | Raunaq Bhirangi Team | 2506.13762 | null |
| 2025-06-16 | Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins | Wei-Chiu Ma Team | 2506.13761 | null |
| 2025-06-16 | What Matters in Learning from Large-Scale Datasets for Robot Manipulation | Danfei Xu Team | 2506.13536 | null |
| 2025-06-16 | A Survey on Imitation Learning for Contact-Rich Tasks in Robotics | Arash Ajoudani Team | 2506.13498 | null |
| 2025-06-16 | Learning Swing-up Maneuvers for a Suspended Aerial Manipulation Platform in a Hierarchical Control Framework | Christian Ott Team | 2506.13478 | null |
| 2025-06-16 | VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation | Wei Pan Team | 2506.13428 | null |
| 2025-06-15 | SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration | Wenwu Zhu Team | 2506.12723 | null |
| 2025-06-15 | Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence | Andrea Bajcsy Team | 2506.12678 | null |
| 2025-06-15 | Goal-based Self-Adaptive Generative Adversarial Imitation Learning (Goal-SAGAIL) for Multi-goal Robotic Manipulation Tasks | George Vogiatzis Team | 2506.12676 | null |
| 2025-06-14 | AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Qingyao Wu Team | 2506.12374 | null |
| 2025-06-13 | Role of Uncertainty in Model Development and Control Design for a Manufacturing Process | Francis Assadian Team | 2506.12273 | null |
| 2025-06-13 | SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies | Danfei Xu Team | 2506.11948 | null |
| 2025-06-13 | mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity | Robert K. Katzschmann Team | 2506.11916 | null |
| 2025-06-13 | ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations | Maria Bauza Villalonga Team | 2506.11775 | null |
| 2025-06-13 | Control Architecture and Design for a Multi-robotic Visual Servoing System in Automated Manufacturing Environment | Rongfei Li Team | 2506.11387 | null |
| 2025-06-12 | Influence Functions for Data Attribution in Linear System Identification and LQR Control | Dongmei Chen Team | 2506.11293 | null |
| 2025-06-12 | Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation | Cordelia Schmid Team | 2506.11261 | null |
| 2025-06-12 | Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop | Angjoo Kanazawa Team | 2506.10968 | null |
| 2025-06-12 | GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation | Jiangmiao Pang Team | 2506.10966 | null |
| 2025-06-12 | Human-Robot Navigation using Event-based Cameras and Reinforcement Learning | Rodrigo Verschae Team | 2506.10790 | null |
| 2025-06-12 | Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success | Kapil Katyal Team | 2506.10359 | null |
| 2025-06-11 | Innovative Adaptive Imaged Based Visual Servoing Control of 6 DoFs Industrial Robot Manipulators | Francis Assadian Team | 2506.10240 | null |
| 2025-06-11 | One For All: LLM-based Heterogeneous Mission Planning in Precision Agriculture | Stefano Carpin Team | 2506.10106 | null |
| 2025-06-11 | eFlesh: Highly customizable Magnetic Touch Sensing using Cut-Cell Microstructures | Raunaq Bhirangi Team | 2506.09994 | null |
| 2025-06-11 | Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation | Xiao Ma Team | 2506.09990 | null |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null |
| 2025-06-11 | Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving | Chen Lv Team | 2506.09800 | null |
| 2025-06-11 | CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings | Davide Boscaini Team | 2506.09699 | null |
| 2025-06-11 | Advances on Affordable Hardware Platforms for Human Demonstration Acquisition in Agricultural Applications | Néstor García Team | 2506.09494 | null |
| 2025-06-11 | DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects | Hong Liu Team | 2506.09491 | null |
| 2025-06-11 | Time-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation | Le Wang Team | 2506.09422 | null |
| 2025-06-11 | Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation | Xiang Li Team | 2506.09384 | null |
| 2025-06-11 | ContextBuddy: AI-Enhanced Contextual Insights for Security Alert Investigation (Applied to Intrusion Detection) | Cecile Paris Team | 2506.09365 | null |
| 2025-06-10 | UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation | Li Fei-Fei Team | 2506.09284 | null |
| 2025-06-10 | Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism | Bolei Zhou Team | 2506.09176 | null |
| 2025-06-10 | FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency | Jian Tang Team | 2506.08822 | null |
| 2025-06-10 | Towards Biosignals-Free Autonomous Prosthetic Hand Control via Imitation Learning | Xianta Jiang Team | 2506.08795 | null |
| 2025-06-10 | Bayesian Inverse Physics for Neuro-Symbolic Robot Learning | Frank Kirchner Team | 2506.08756 | null |
| 2025-06-10 | Deep Reinforcement Learning-Based Motion Planning and PDE Control for Flexible Manipulators | Jouni Mattila Team | 2506.08639 | null |
| 2025-06-10 | RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping | Gitta Kutyniok Team | 2506.08632 | null |
| 2025-06-10 | Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots | Lijun Zhu Team | 2506.08416 | null |
| 2025-06-11 | HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation | Cong Wang Team | 2506.08296 | null |
| 2025-06-09 | ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving | Xinggang Wang Team | 2506.08052 | null |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null |
| 2025-06-09 | BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation | Xilin Chen Team | 2506.07530 | null |
| 2025-06-09 | Reinforcement Learning via Implicit Imitation Guidance | Chelsea Finn Team | 2506.07505 | null |
| 2025-06-09 | RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy | Hui Cheng Team | 2506.07490 | null |
| 2025-06-08 | CARoL: Context-aware Adaptation for Robot Learning | Xuan Wang Team | 2506.07006 | null |
| 2025-06-07 | SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game | Shanghang Zhang Team | 2506.06690 | null |
| 2025-06-07 | RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation | Si Liu Team | 2506.06677 | null |
| 2025-06-07 | Self-Adapting Improvement Loops for Robotic Learning | Chen Sun Team | 2506.06658 | null |
| 2025-06-06 | Enhancing Robot Safety via MLLM-Based Semantic Interpretation of Failure Data | Somil Bansal Team | 2506.06570 | null |
| 2025-06-06 | NeSyPack: A Neuro-Symbolic Framework for Bimanual Logistics Packing | Changliu Liu Team | 2506.06567 | null |
| 2025-06-06 | MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping | Farshad Khorrami Team | 2506.06535 | null |
| 2025-06-06 | 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model | Mingkui Tan Team | 2506.06199 | null |
| 2025-06-06 | Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization | Tingnan Zhang Team | 2506.06196 | null |
| 2025-06-10 | BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning | Rudolf Lioutikov Team | 2506.06072 | null |
| 2025-06-06 | Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning | Ping Luo Team | 2506.05985 | null |
| 2025-06-06 | Optimal Robotic Velcro Peeling with Force Feedback | Volkan Isler Team | 2506.05812 | null |
| 2025-06-06 | Where Do We Look When We Teach? Analyzing Human Gaze Behavior Across Demonstration Devices in Robot Imitation Learning | Hiroshi Bito Team | 2506.05808 | null |
| 2025-06-06 | FlowOE: Imitation Learning with Flow Policy from Ensemble RL Experts for Optimal Execution under Heston Volatility and Concave Market Impacts | Zhi Chen Team | 2506.05755 | null |
| 2025-06-06 | You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping | Xiangyang Xue Team | 2506.05719 | null |
| 2025-06-05 | A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$ : Robust Imitation via Learning to Search | Gokul Swamy Team | 2506.05294 | null |
| 2025-06-05 | LiPo: A Lightweight Post-optimization Framework for Smoothing Action Chunks Generated by Learned Policies | Suhan Park Team | 2506.05165 | null |
| 2025-06-05 | DemoSpeedup: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration | Huazhe Xu Team | 2506.05064 | null |
| 2025-06-06 | ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning | Jian Tang Team | 2506.04941 | null |
| 2025-06-05 | Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion | Qi Dou Team | 2506.04716 | null |
| 2025-06-05 | Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning | Wanxiang Che Team | 2506.04625 | null |
| 2025-06-04 | SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning | Aleksandr Panov Team | 2506.04505 | null |
| 2025-06-04 | Object-centric 3D Motion Field for Robot Learning from Human Videos | Pieter Abbeel Team | 2506.04227 | null |
| 2025-06-04 | Splatting Physical Scenes: End-to-End Real-to-Sim from Imperfect Robot Data | Leonard Hasenclever Team | 2506.04120 | null |
| 2025-06-04 | STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization | Liqiang Nie Team | 2506.03863 | link |
| 2025-06-04 | SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models | Jian Tang Team | 2506.03574 | null |
| 2025-06-05 | Confidence-Guided Human-AI Collaboration: Reinforcement Learning with Distributional Proxy Value Propagation for Autonomous Driving | Hu Chuan Team | 2506.03568 | link |
| 2025-06-03 | ORV: 4D Occupancy-centric Robot Video Generation | Hao Zhao Team | 2506.03079 | null |
| 2025-06-03 | Geometric Visual Servo Via Optimal Transport | Ashutosh Tiwari Team | 2506.02768 | null |
| 2025-06-03 | Rodrigues Network for Learning Robot Actions | Leonidas Guibas Team | 2506.02618 | null |
| 2025-06-03 | Reachability Weighted Offline Goal-conditioned Resampling | Joni Pajarinen Team | 2506.02577 | null |
| 2025-06-02 | Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning | Pheng-Ann Heng Team | 2506.01953 | null |
| 2025-06-02 | Feel the Force: Contact-Driven Learning from Humans | Lerrel Pinto Team | 2506.01944 | null |
| 2025-06-02 | Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control | Dahua Lin Team | 2506.01943 | null |
| 2025-06-02 | FreeTacMan: Robot-free Visuo-Tactile Data Collection System for Contact-rich Manipulation | Hongyang Li Team | 2506.01941 | null |
| 2025-06-02 | Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics | Matej Hoffmann Team | 2506.01756 | null |
| 2025-06-02 | Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning | Kang Liu Team | 2506.01710 | link |
| 2025-06-02 | WoMAP: World Models For Embodied Open-Vocabulary Object Localization | Anirudha Majumdar Team | 2506.01600 | null |
| 2025-06-02 | FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens | Yuexin Ma Team | 2506.01583 | null |
| 2025-06-02 | Trajectory First: A Curriculum for Discovering Diverse Policies | Marc Toussaint Team | 2506.01568 | null |
| 2025-06-02 | Variational Adaptive Noise and Dropout towards Stable Recurrent Neural Networks | Shingo Murata Team | 2506.01350 | null |
| 2025-06-01 | OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation | Valts Blukis Team | 2506.01196 | null |
| 2025-06-01 | HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control | Jeannette Bohg Team | 2506.01185 | null |
| 2025-06-01 | Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning | Jing Li Team | 2506.00782 | null |
| 2025-05-31 | XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity | Benjamin Busam Team | 2506.00599 | null |
| 2025-05-31 | Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents | Zhou Yu Team | 2506.00320 | null |
| 2025-05-30 | 3D Gaussian Splat Vulnerabilities | Polo Chau Team | 2506.00280 | null |
| 2025-05-30 | Bi-Manual Joint Camera Calibration and Scene Representation | Weiming Zhi Team | 2505.24819 | null |
| 2025-05-30 | MagicGripper: A Multimodal Sensor-Integrated Gripper for Contact-Rich Robotic Manipulation | Dandan Zhang Team | 2505.24382 | null |
| 2025-05-30 | Imitation Learning-Based Path Generation for the Complex Assembly of Deformable Objects | Christoffer Sloth Team | 2505.24339 | null |
| 2025-05-30 | SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping | Hao Dong Team | 2505.24305 | null |
| 2025-05-30 | Safety-Aware Robust Model Predictive Control for Robotic Arms in Dynamic Environments | Suwoong Lee Team | 2505.24209 | null |
| 2025-05-30 | Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control | Guanya Shi Team | 2505.24198 | null |
| 2025-05-29 | Mobi- $π$ : Mobilizing Your Robot Learning Policy | Jeannette Bohg Team | 2505.23692 | null |
| 2025-05-30 | Normalizing Flows are Capable Models for RL | Benjamin Eysenbach Team | 2505.23527 | null |
| 2025-05-29 | Optimization-based Posture Generation for Whole-body Contact Motion by Contact Point Search on the Body Surface | Masayuki Inaba Team | 2505.23501 | null |
| 2025-05-29 | Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents | Lichao Sun Team | 2505.23450 | null |
| 2025-05-29 | Enhanced DACER Algorithm with High Diffusion Efficiency | Shengbo Eben Li Team | 2505.23426 | null |
| 2025-05-29 | RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer | Zhizhong Su Team | 2505.23171 | null |
| 2025-05-28 | SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning | Yuke Zhu Team | 2505.22626 | null |
| 2025-05-28 | Hybrid Learning for Cold-Start-Aware Microservice Scheduling in Dynamic Edge Environments | Weijia Jia Team | 2505.22424 | link |
| 2025-05-28 | Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning | Marian Verhelst Team | 2505.22404 | null |
| 2025-05-28 | State and Input Constrained Adaptive Tracking Control of Uncertain Euler-Lagrange Systems with Robustness and Feasibility Analysis | Shubhendu Bhasin Team | 2505.22352 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-05-28 | Learning Compositional Behaviors from Demonstration and Language | Jiajun Wu Team | 2505.21981 | null |
| 2025-05-29 | ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null |
| 2025-05-28 | Streaming Flow Policy: Simplifying diffusion $/$ flow-matching policies by treating action trajectories as flow trajectories | Siddharth Ancha Team | 2505.21851 | null |
| 2025-05-27 | PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation | Tianmin Shu Team | 2505.21652 | null |
| 2025-05-30 | Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks | Bryan A. Plummer Team | 2505.21649 | null |
| 2025-05-27 | CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception | Tapomayukh Bhattacharjee Team | 2505.21495 | null |
| 2025-05-27 | EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation | Robert Platt Team | 2505.21351 | null |
| 2025-05-27 | EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild | Gonzalo Ferrer Team | 2505.21282 | null |
| 2025-05-27 | Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations | Tanvi Verma Team | 2505.21182 | null |
| 2025-05-27 | Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning | George Retsinas Team | 2505.20962 | null |
| 2025-05-27 | Learning Unified Force and Position Control for Legged Loco-Manipulation | Siyuan Huang Team | 2505.20829 | null |
| 2025-05-27 | Spatial RoboGrasp: Generalized Robotic Grasping Control Policy | Luhui Hu Team | 2505.20814 | null |
| 2025-05-27 | Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt | Jianyu Chen Team | 2505.20795 | null |
| 2025-05-28 | ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image | Ruohan Gao Team | 2505.20498 | null |
| 2025-05-26 | OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation | Farshad Khorrami Team | 2505.20425 | null |
| 2025-05-26 | Co-Design of Soft Gripper with Neural Physics | Xiaolong Wang Team | 2505.20404 | null |
| 2025-05-26 | EgoZero: Robot Learning from Smart Glasses | Lerrel Pinto Team | 2505.20290 | null |
| 2025-05-26 | URPlanner: A Universal Paradigm For Collision-Free Robotic Motion Planning Based on Deep Reinforcement Learning | Marcelo H. Ang Jr Team | 2505.20175 | null |
| 2025-05-27 | MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | Xiaodan Liang Team | 2505.20148 | link |
| 2025-05-26 | ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving | Dongbin Zhao Team | 2505.20024 | link |
| 2025-05-26 | Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$ -Realizable MDPs | Luca Viano Team | 2505.19946 | null |
| 2025-05-26 | TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning | Dongbin Zhao Team | 2505.19769 | null |
| 2025-05-26 | Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning | Jean-Baptiste Mouret Team | 2505.19717 | null |
| 2025-05-25 | Structured Reinforcement Learning for Combinatorial Decision-Making | Maximilian Schiffer Team | 2505.19053 | link |
| 2025-05-25 | WorldEval: World Model as Real-World Robot Policies Evaluator | Yi Xu Team | 2505.19017 | null |
| 2025-05-25 | Online Knowledge Distillation with Reward Guidance | Chen Jia Team | 2505.18952 | null |
| 2025-05-24 | Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning | Giovanni Beltrame Team | 2505.18858 | null |
| 2025-05-24 | On the Dual-Use Dilemma in Physical Reasoning and Force | Nikolaus Correll Team | 2505.18792 | null |
| 2025-05-24 | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Ziwei Wang Team | 2505.18719 | null |
| 2025-05-24 | MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations | Hong Thanh Nguyen Team | 2505.18595 | null |
| 2025-05-24 | Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning | Zhiyun Lin Team | 2505.18487 | null |
| 2025-05-24 | Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy | Yu She Team | 2505.18474 | null |
| 2025-05-24 | ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning | Yu She Team | 2505.18472 | null |
| 2025-05-23 | ProgRM: Build Better GUI Agents with Progress Rewards | Kai Yu Team | 2505.18121 | null |
| 2025-05-23 | Classification of assembly tasks combining multiple primitive actions using Transformers and xLSTMs | Pedro Neto Team | 2505.18012 | null |
| 2025-05-23 | Is Single-View Mesh Reconstruction Ready for Robotics? | Ingmar Posner Team | 2505.17966 | null |
| 2025-05-23 | SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data | Donghyun Kim Team | 2505.17695 | null |
| 2025-05-23 | Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning | Giorgia Ramponi Team | 2505.17610 | null |
| 2025-05-23 | Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy | Bin Zhao Team | 2505.17434 | null |
| 2025-05-23 | Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space | Hui Cheng Team | 2505.17389 | null |
| 2025-05-22 | ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems | Farhad Imani Team | 2505.17295 | null |
| 2025-05-22 | CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning | Limin Wang Team | 2505.17006 | null |
| 2025-05-22 | 3D Equivariant Visuomotor Policy Learning via Spherical Projection | Robin Walters Team | 2505.16969 | null |
| 2025-05-22 | Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only | Donglin Wang Team | 2505.16856 | null |
| 2025-05-22 | Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation | Soumik Sarkar Team | 2505.16547 | null |
| 2025-05-24 | ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models | Xiuying Chen Team | 2505.16517 | null |
| 2025-05-22 | Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2) | Junchi Yan Team | 2505.16394 | null |
| 2025-05-22 | TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation | Hengdi Zhang Team | 2505.16289 | null |
| 2025-05-22 | SEM: Enhancing Spatial Understanding for Robust Robot Manipulation | Zhizhong Su Team | 2505.16196 | null |
| 2025-05-22 | Tactile-based Reinforcement Learning for Adaptive Grasping under Observation Uncertainties | Yang Ye Team | 2505.16167 | null |
| 2025-05-21 | WaveTouch: Active Tactile Sensing Using Vibro-Feedback for Classification of Variable Stiffness and Infill Density Objects | Bakhtiyar Orazbayev Team | 2505.16062 | null |
| 2025-05-25 | Proactive Hierarchical Control Barrier Function-Based Safety Prioritization in Close Human-Robot Interaction Scenarios | Prashanth Krishnamurthy Team | 2505.16055 | null |
| 2025-05-21 | UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | Si Liu Team | 2505.15725 | null |
| 2025-05-21 | Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization | Junwei Liang Team | 2505.15660 | null |
| 2025-05-21 | FLARE: Robot Learning with Implicit World Modeling | Linxi Fan Team | 2505.15659 | null |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Ken Goldberg Team | 2505.15517 | null |
| 2025-05-21 | Guided Policy Optimization under Partial Observability | Zongqing Lu Team | 2505.15418 | link |
| 2025-05-21 | Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control | Jungwook Choi Team | 2505.15304 | null |
| 2025-05-21 | Learning-based Autonomous Oversteer Control and Collision Avoidance | Seung-Hyun Kong Team | 2505.15275 | null |
| 2025-05-21 | Filtering Learning Histories Enhances In-Context Reinforcement Learning | Santiago Paternain Team | 2505.15143 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | RoboCulture: A Robotics Platform for Automated Biological Experimentation | Milica Radisic Team | 2505.14941 | null |
| 2025-05-20 | Imitation Learning via Focused Satisficing | Brian Ziebart Team | 2505.14820 | null |
| 2025-05-20 | DORA: Object Affordance-Guided Reinforcement Learning for Dexterous Robotic Manipulation | Jianwei Zhang Team | 2505.14819 | null |
| 2025-05-20 | Vid2World: Crafting Video Diffusion Models to Interactive World Models | Mingsheng Long Team | 2505.14357 | null |
| 2025-05-20 | AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory | Ping Luo Team | 2505.14030 | null |
| 2025-05-20 | RLVR-World: Training World Models with Reinforcement Learning | Mingsheng Long Team | 2505.13934 | link |
| 2025-05-20 | Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning | Yutong Ban Team | 2505.13925 | null |
| 2025-05-20 | Learning to Insert for Constructive Neural Vehicle Routing Solver | Qingfu Zhang Team | 2505.13904 | null |
| 2025-05-20 | Structured Agent Distillation for Large Language Model | Yanzhi Wang Team | 2505.13820 | null |
| 2025-05-21 | Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation | Georgia Chalvatzaki Team | 2505.13667 | null |
| 2025-05-19 | TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion | Minh Nhat Vu Team | 2505.13549 | null |
| 2025-05-19 | GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation | Rose Hendrix Team | 2505.13441 | null |
| 2025-05-19 | KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture | R. James Cotton Team | 2505.13436 | null |
| 2025-05-19 | TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation | Jiangmiao Pang Team | 2505.12748 | null |
| 2025-05-19 | Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation | Chi-Wing Fu Team | 2505.12744 | null |
| 2025-05-19 | Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning | Taesup Moon Team | 2505.12737 | null |
| 2025-05-19 | DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories | Linxi Fan Team | 2505.12705 | null |
| 2025-05-19 | Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion | Qi Wu Team | 2505.12679 | null |
| 2025-05-19 | HIL: Hybrid Imitation Learning of Diverse Parkour Skills from Videos | Xue Bin Peng Team | 2505.12619 | null |
| 2025-05-18 | MTIL: Encoding Full History with Mamba for Temporal Imitation Learning | Zhouping Yin Team | 2505.12410 | link |
| 2025-05-18 | PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis | Zhipong Cai Team | 2505.12294 | null |
| 2025-05-20 | RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction | Bo Zhao Team | 2505.12224 | null |
| 2025-05-20 | Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study | Hae-Won Park Team | 2505.12222 | null |
| 2025-05-17 | L2D2: Robot Learning from 2D Drawings | Dylan P. Losey Team | 2505.12072 | null |
| 2025-05-17 | H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos | Shanghang Zhang Team | 2505.11920 | null |
| 2025-05-17 | GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation | Junwei Liang Team | 2505.11865 | null |
| 2025-05-17 | Learning IMU Bias with Diffusion Model | Guoquan Huang Team | 2505.11763 | null |
| 2025-05-16 | Zero-Shot Visual Generalization in Robot Manipulation | Gaurav Sukhatme Team | 2505.11719 | null |
| 2025-05-16 | Employing Laban Shape for Generating Emotionally and Functionally Expressive Trajectories in Robotic Manipulators | Alessandro Roncone Team | 2505.11716 | null |
| 2025-05-16 | EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video | Jian Zhang Team | 2505.11709 | null |
| 2025-05-16 | Grounded Task Axes: Zero-Shot Semantic Skill Generalization via Task-Axis Controllers and Visual Foundation Models | Oliver Kroemer Team | 2505.11680 | null |
| 2025-05-16 | SHIELD: Safety on Humanoids via CBFs In Expectation on Learned Dynamics | Aaron D. Ames Team | 2505.11494 | null |
| 2025-05-16 | Exploiting Radiance Fields for Grasp Generation on Novel Synthetic Views | Todor Stoyanov Team | 2505.11467 | null |
| 2025-05-16 | ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations | Jesse Zhang Team | 2505.10911 | null |
| 2025-05-16 | Counterfactual Behavior Cloning: Offline Imitation Learning from Imperfect Human Demonstrations | Dylan P. Losey Team | 2505.10760 | null |
| 2025-05-15 | Infinigen-Sim: Procedural Generation of Articulated Simulation Assets | Jia Deng Team | 2505.10755 | null |
| 2025-05-15 | Knowledge capture, adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation | Yan Jin Team | 2505.10522 | null |
| 2025-05-15 | IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning | Junshan Zhang Team | 2505.10442 | null |
| 2025-05-15 | NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning | Chengyuan Chen Team | 2505.10359 | null |
| 2025-05-15 | SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning | Axel Krieger Team | 2505.10251 | null |
| 2025-05-15 | Training People to Reward Robots | Matthew Howard Team | 2505.10151 | null |
| 2025-05-15 | EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation | Jianye Hao Team | 2505.10105 | null |
| 2025-05-15 | FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation | Qing Li Team | 2505.10075 | null |
| 2025-05-15 | APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots | Guillaume Sartoretti Team | 2505.10022 | null |
| 2025-05-15 | ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts | Yang Yu Team | 2505.10010 | link |
| 2025-05-16 | PointArena: Probing Multimodal Grounding Through Language-Guided Pointing | Ranjay Krishna Team | 2505.09990 | null |
| 2025-05-15 | Learning Diverse Natural Behaviors for Enhancing the Agility of Quadrupedal Robots | Chunlin Chen Team | 2505.09979 | null |
| 2025-05-14 | Learning Rock Pushability on Rough Planetary Terrain | Cagri Kilic Team | 2505.09833 | null |
| 2025-05-14 | Trailblazer: Learning offroad costmaps for long range planning | Srikanth Saripalli Team | 2505.09739 | null |
| 2025-05-14 | EnerVerse-AC: Envisioning Embodied Environments with Action Condition | Guanghui Ren Team | 2505.09723 | null |
| 2025-05-14 | ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation | Daniel Seita Team | 2505.09698 | null |
| 2025-05-14 | DataMIL: Selecting Data for Robot Imitation Learning with Datamodels | Roberto Martín-Martín Team | 2505.09603 | null |
| 2025-05-14 | Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware | Ken Goldberg Team | 2505.09601 | null |
| 2025-05-14 | VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation | Shuo Wang Team | 2505.09577 | null |
| 2025-05-14 | Learning Long-Context Diffusion Policies via Past-Token Prediction | Chelsea Finn Team | 2505.09561 | null |
| 2025-05-14 | Distilling Realizable Students from Unrealizable Teachers | Sanjiban Choudhury Team | 2505.09546 | null |
| 2025-05-14 | Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion | Qixin Cao Team | 2505.09424 | null |
| 2025-05-14 | Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model | Keith Ross Team | 2505.09308 | null |
| 2025-05-14 | Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation | Guillaume Sartoretti Team | 2505.09144 | null |
| 2025-05-14 | FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis | He Wang Team | 2505.09109 | null |
| 2025-05-14 | Imitation Learning for Adaptive Control of a Virtual Soft Exoglove | Letizia Gionfrida Team | 2505.09099 | null |
| 2025-05-13 | ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation | Dongyi Wang Team | 2505.08986 | null |
| 2025-05-13 | Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness | Wolfram Burgard Team | 2505.08627 | null |
| 2025-05-13 | Beyond Predefined Actions: Integrating Behavior Trees and Dynamic Movement Primitives for Robot Learning from Demonstration | Todor Stoyanov Team | 2505.08625 | null |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null |
| 2025-05-13 | Parameter Estimation using Reinforcement Learning Causal Curiosity: Limits and Challenges | Weisi Guo Team | 2505.08453 | null |
| 2025-05-13 | Adaptive Diffusion Policy Optimization for Robotic Manipulation | Zhuang Yang Team | 2505.08376 | null |
| 2025-05-13 | Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation | Qianchun Lu Team | 2505.08364 | null |
| 2025-05-13 | Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning | Biwei Huang Team | 2505.08361 | null |
| 2025-05-13 | HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands | Yunhui Liu Team | 2505.08213 | null |
| 2025-05-13 | CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding | Shuo Wang Team | 2505.08194 | null |
| 2025-05-12 | What Matters for Batch Online Reinforcement Learning in Robotics? | Chelsea Finn Team | 2505.08078 | null |
| 2025-05-12 | H $^{\mathbf{3}}$ DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning | Huazhe Xu Team | 2505.07819 | null |
| 2025-05-12 | Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models | Jia-Bin Huang Team | 2505.07815 | null |
| 2025-05-12 | Improving Trajectory Stitching with Flow Models | Ioannis Havoutis Team | 2505.07802 | null |
| 2025-05-12 | Guiding Data Collection via Factored Scaling Curves | Anirudha Majumdar Team | 2505.07728 | null |
| 2025-05-12 | GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion | Peng Yin Team | 2505.07455 | null |
| 2025-05-12 | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | Donglin Wang Team | 2505.07395 | null |
| 2025-05-11 | X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real | Sanjiban Choudhury Team | 2505.07096 | null |
| 2025-05-11 | YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action | Bailing Tian Team | 2505.06923 | null |
| 2025-05-10 | JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes | Harish Ravichandar Team | 2505.06771 | null |
| 2025-05-10 | Learned IMU Bias Prediction for Invariant Visual Inertial Odometry | Nikolay Atanasov Team | 2505.06748 | null |
| 2025-05-10 | ACORN: Adaptive Contrastive Optimization for Safe and Robust Fine-Grained Robotic Manipulation | Zixian Yue Team | 2505.06628 | null |
| 2025-05-10 | Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach | Xiaokang Yang Team | 2505.06482 | null |
| 2025-05-09 | Adaptive Wiping: Adaptive contact-rich manipulation through few-shot imitation learning with Force-Torque feedback and pre-trained object representations | Gentiane Venture Team | 2505.06451 | null |
| 2025-05-09 | VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction | Roni Sengupta Team | 2505.06219 | null |
| 2025-05-09 | Neuro-Symbolic Concepts | Jiajun Wu Team | 2505.06191 | null |
| 2025-05-07 | Efficient Sensorimotor Learning for Open-world Robot Manipulation | Yifeng Zhu Team | 2505.06136 | null |
| 2025-05-09 | Robot Learning Using Multi-Coordinate Elastic Maps | Reza Azadeh Team | 2505.06092 | null |
| 2025-05-09 | TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations | Abhinav Shrivastava Team | 2505.06079 | null |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null |
| 2025-05-09 | Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives | Mac Schwager Team | 2505.05787 | null |
| 2025-05-09 | FlowHFT: Flow Policy Induced Optimal High-Frequency Trading under Diverse Market Conditions | Steve Yang Team | 2505.05784 | null |
| 2025-05-08 | CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations | Stephen Tu Team | 2505.04999 | null |
| 2025-05-08 | CubeDAgger: Improved Robustness of Interactive Imitation Learning without Violation of Dynamic Stability | Taisuke Kobayashi Team | 2505.04897 | null |
| 2025-05-08 | D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation | Daniel Seita Team | 2505.04860 | null |
| 2025-05-07 | Steerable Scene Generation with Post Training and Inference-Time Search | Russ Tedrake Team | 2505.04831 | null |
| 2025-05-07 | Primal-dual algorithm for contextual stochastic combinatorial optimization | Axel Parmentier Team | 2505.04757 | null |
| 2025-05-07 | Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation | Henrik I. Christensen Team | 2505.04619 | null |
| 2025-05-06 | OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation | Donglin Wang Team | 2505.03912 | null |
| 2025-05-06 | AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control | Xiaolong Wang Team | 2505.03738 | null |
| 2025-05-06 | Meta-Optimization and Program Search using Language Models for Task and Motion Planning | Marc Toussaint Team | 2505.03725 | null |
| 2025-05-06 | Ergodic Generative Flows | Yinchuan Li Team | 2505.03561 | null |
| 2025-05-06 | RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation | Sifa Zheng Team | 2505.03344 | null |
| 2025-05-06 | The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning | Abhinav Valada Team | 2505.03296 | null |
| 2025-05-05 | Sim2Real Transfer for Vision-Based Grasp Verification | Markus Vincze Team | 2505.03046 | link |
| 2025-05-05 | Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks | Jia Deng Team | 2505.02915 | null |
| 2025-05-05 | Re-purposing a modular origami manipulator into an adaptive physical computer for machine learning and robotic perception | Suyi Li Team | 2505.02744 | null |
| 2025-05-05 | Spatiotemporal Non-Uniformity-Aware Online Task Scheduling in Collaborative Edge Computing for Industrial Internet of Things | Bo Lei Team | 2505.02597 | null |
| 2025-05-05 | Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning | Jianqiang Li Team | 2505.02483 | null |
| 2025-05-05 | MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans | Siyuan Huang Team | 2505.02388 | null |
| 2025-05-04 | Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning | Hao Su Team | 2505.02228 | null |
| 2025-05-04 | CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation | Hao Dong Team | 2505.02166 | null |
| 2025-05-04 | Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions | Mingyu Ding Team | 2505.02152 | null |
| 2025-05-03 | Act Natural! Extending Naturalistic Projection to Multimodal Behavior Scenarios | David Fridovich-Keil Team | 2505.01945 | null |
| 2025-05-07 | RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation | Xiaodan Liang Team | 2505.01709 | null |
| 2025-05-02 | FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research | Sayan Mitra Team | 2505.01383 | null |
| 2025-05-06 | Robotic Visual Instruction | Xianzheng Ma Team | 2505.00693 | null |
| 2025-05-01 | Towards Autonomous Micromobility through Scalable Urban Simulation | Bolei Zhou Team | 2505.00690 | null |
| 2025-05-01 | DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation | Yang Gao Team | 2505.00527 | null |
| 2025-05-01 | Optimal Interactive Learning on the Job via Facility Location Planning | George Konidaris Team | 2505.00490 | null |
| 2025-04-30 | LLM-based Interactive Imitation Learning for Robotic Manipulation | Stefan Wermter Team | 2504.21769 | null |
| 2025-04-30 | RoboGround: Robotic Manipulation with Grounded Vision-Language Priors | Zhou Zhao Team | 2504.21530 | null |
| 2025-04-30 | Provably-Safe, Online System Identification | Ram Vasudevan Team | 2504.21486 | null |
| 2025-04-29 | TesserAct: Learning 4D Embodied World Models | Chuang Gan Team | 2504.20995 | null |
| 2025-04-29 | XPG-RL: Reinforcement Learning with Explainable Priority Guidance for Efficiency-Boosted Mechanical Search | Elena Shrestha Team | 2504.20969 | null |
| 2025-04-29 | PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations | Xuguang Lan Team | 2504.20520 | null |
| 2025-04-29 | SPARK Hand: Scooping-Pinching Adaptive Robotic Hand with Kempe Mechanism for Vertical Passive Grasp in Environmental Constraints | Wenzeng Zhang Team | 2504.20506 | null |
| 2025-04-28 | UTTG_ A Universal Teleoperation Approach via Online Trajectory Generation | Hesheng Wang Team | 2504.19736 | null |
| 2025-04-28 | GPA-RAM: Grasp-Pretraining Augmented Robotic Attention Mamba for Spatial Task Learning | Mengyuan Liu Team | 2504.19683 | null |
| 2025-04-27 | PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies | Edward Adelson Team | 2504.19341 | null |
| 2025-04-29 | Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation | Marco Hutter Team | 2504.19322 | link |
| 2025-04-27 | Learning to Drive from a World Model | Yassine Yousfi Team | 2504.19077 | null |
| 2025-04-26 | RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning | Pieter Abbeel Team | 2504.18904 | null |
| 2025-04-26 | Imitation Learning for Autonomous Driving: Insights from Real-World Testing | Tufan Kumbasar Team | 2504.18847 | null |
| 2025-04-26 | Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots | Alfredo Weitzenfeld Team | 2504.18794 | null |
| 2025-04-26 | STDArm: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation | Yanyong Zhang Team | 2504.18792 | null |
| 2025-04-25 | Generalization Capability for Imitation Learning | Yixiao Wang Team | 2504.18538 | null |
| 2025-04-25 | Instrumentation for Better Demonstrations: A Case Study | Francis wyffels Team | 2504.18481 | null |
| 2025-04-25 | Action Flow Matching for Continual Robot Learning | Lantao Liu Team | 2504.18471 | null |
| 2025-04-25 | Design and Evaluation of a UGV-Based Robotic Platform for Precision Soil Moisture Remote Sensing | George Nikolakopoulos Team | 2504.18284 | null |
| 2025-04-28 | Implementation Analysis of Collaborative Robot Digital Twins in Physics Engines | Hans D. Schotten Team | 2504.18200 | null |
| 2025-04-25 | Offline Learning of Controllable Diverse Behaviors | Ludovic Denoyer Team | 2504.18160 | null |
| 2025-04-24 | CIVIL: Causal and Intuitive Visual Imitation Learning | Dylan P. Losey Team | 2504.17959 | null |
| 2025-04-24 | Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning | Prithviraj Ammanabrolu Team | 2504.17950 | null |
| 2025-04-24 | Learning Attentive Neural Processes for Planning with Pushing Actions | Nicholas Roy Team | 2504.17924 | null |
| 2025-04-24 | CaRL: Learning Scalable Planning Policies with Simple Rewards | Andreas Geiger Team | 2504.17838 | null |
| 2025-04-23 | Learning Underwater Active Perception in Simulation | Donald G. Dansereau Team | 2504.17817 | null |
| 2025-04-24 | Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation | Jiangmiao Pang Team | 2504.17784 | null |
| 2025-04-24 | Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control | Dong Xuan Team | 2504.17771 | null |
| 2025-04-24 | Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion Primitives | Han Ding Team | 2504.17216 | null |
| 2025-04-23 | Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators | Roberto Horowitz Team | 2504.17080 | null |
| 2025-04-23 | A Systematic Approach to Design Real-World Human-in-the-Loop Deep Reinforcement Learning: Salient Features, Challenges and Trade-offs | Younes Zerouali Team | 2504.17006 | null |
| 2025-04-23 | Latent Diffusion Planning for Imitation Learning | Chelsea Finn Team | 2504.16925 | null |
| 2025-04-23 | MOSAIC: A Skill-Centric Algorithmic Framework for Long-Horizon Manipulation Planning | Maxim Likhachev Team | 2504.16738 | null |
| 2025-04-23 | ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance | Shanghang Zhang Team | 2504.16464 | null |
| 2025-04-22 | Mass-Adaptive Admittance Control for Robotic Manipulators | Logan E. Beaver Team | 2504.16224 | null |
| 2025-04-22 | $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization | Ury Zhilinsky Team | 2504.16054 | null |
| 2025-04-22 | SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation | Xiangli Nie Team | 2504.15561 | null |
| 2025-04-22 | VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation | Matei Ciocarlie Team | 2504.15535 | null |
| 2025-04-22 | Few-Shot Vision-Language Action-Incremental Policy Learning | Weili Guan Team | 2504.15517 | null |
| 2025-04-21 | LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning | Boyuan Chen Team | 2504.15472 | null |
| 2025-04-23 | Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions | Peng Qi Team | 2504.15327 | null |
| 2025-04-21 | Immersive Teleoperation Framework for Locomanipulation Tasks | Dimitrios Kanoulas Team | 2504.15229 | null |
| 2025-04-21 | A Genetic Fuzzy-Enabled Framework on Robotic Manipulation for In-Space Servicing | Kelly Cohen Team | 2504.15226 | null |
| 2025-04-21 | A General Infrastructure and Workflow for Quadrotor Deep Reinforcement Learning and Reality Deployment | Huaping Liu Team | 2504.15129 | null |
| 2025-04-21 | SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks | Animesh Garg Team | 2504.14857 | null |
| 2025-04-20 | Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline | Hongsheng Li Team | 2504.14709 | null |
| 2025-04-24 | Latent Representations for Visual Proprioception in Inexpensive Robots | Ladislau Bölöni Team | 2504.14634 | null |
| 2025-04-18 | DiffOG: Differentiable Policy Trajectory Optimization with Generalizability | Yu She Team | 2504.13807 | null |
| 2025-04-18 | Imitation Learning with Precisely Labeled Human Demonstrations | Yilong Song Team | 2504.13803 | null |
| 2025-04-21 | SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM | Javier Civera Team | 2504.13713 | link |
| 2025-04-18 | Self-Mixing Laser Interferometry: In Search of an Ambient Noise-Resilient Alternative to Acoustic Sensing | Francis wyffels Team | 2504.13711 | null |
| 2025-04-18 | On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting | Jan Peters Team | 2504.13618 | null |
| 2025-04-18 | A Model-Based Approach to Imitation Learning through Multi-Step Predictions | Na Li Team | 2504.13413 | null |
| 2025-04-17 | RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins | Ping Luo Team | 2504.13059 | null |
| 2025-04-17 | Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator | E. Witrant Team | 2504.13056 | null |
| 2025-04-17 | Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation | Iman Soltani Team | 2504.12967 | null |
| 2025-04-17 | TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors | Yi Yang Team | 2504.12799 | null |
| 2025-04-17 | Trajectory Adaptation using Large Language Models | Ravi Prakash Team | 2504.12755 | null |
| 2025-04-17 | Embodied Neuromorphic Control Applied on a 7-DOF Robotic Manipulator | Lei Wang Team | 2504.12702 | link |
| 2025-04-21 | A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation | Xiaodan Liang Team | 2504.12636 | null |
| 2025-04-17 | Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration | Jeannette Bohg Team | 2504.12609 | null |
| 2025-04-16 | Adapting a World Model for Trajectory Following in a 3D Game | Raluca Georgescu Team | 2504.12299 | null |
| 2025-04-16 | Towards Forceful Robotic Foundation Models: a Literature Survey | Nikolaus Correll Team | 2504.11827 | null |
| 2025-04-17 | Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning | Fei Liu Team | 2504.11493 | null |
| 2025-04-15 | Next-Future: Sample-Efficient Policy Learning for Robotic-Arm Tasks | Suryansh Kumar Team | 2504.11247 | null |
| 2025-04-17 | CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image | Yi Zhu Team | 2504.11230 | null |
| 2025-04-15 | Superfast Configuration-Space Convex Set Computation on GPUs for Online Motion Planning | Daniela Rus Team | 2504.10783 | link |
| 2025-04-14 | Improving In-Context Learning with Reasoning Distillation | Xiang Gao Team | 2504.10647 | null |
| 2025-04-14 | Flying Hand: End-Effector-Centric Framework for Versatile Aerial Manipulation Teleoperation and Policy Learning | Guanya Shi Team | 2504.10334 | null |
| 2025-04-14 | Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation | Guoying Gu Team | 2504.10280 | null |
| 2025-04-14 | Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models | Hui Cheng Team | 2504.10041 | link |
| 2025-04-14 | Efficient Task-specific Conditional Diffusion Policies: Shortcut Model Acceleration and SO(3) Optimization | Wei Sui Team | 2504.09927 | null |
| 2025-04-12 | Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators | Marco M. Nicotra Team | 2504.09188 | null |
| 2025-04-11 | BiFlex: A Passive Bimodal Stiffness Flexible Wrist for Manipulation in Unstructured Environments | Roberto Martín-Martín Team | 2504.08706 | null |
| 2025-04-11 | Diffusion Models for Robotic Manipulation: A Survey | Rania Rayyes Team | 2504.08438 | null |
| 2025-04-10 | Echo: An Open-Source, Low-Cost Teleoperation System with Force Feedback for Dataset Collection in Robot Learning | Dzmitry Tsetserukou Team | 2504.07939 | null |
| 2025-04-10 | TOCALib: Optimal control library with interpolation for bimanual manipulation and obstacles avoidance | Aleksandr Panov Team | 2504.07708 | null |
| 2025-04-10 | Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction | Hesheng Wang Team | 2504.07375 | link |
| 2025-04-09 | Adaptive Vision-Guided Robotic Arm Control for Precision Pruning in Dynamic Orchard Environments | Manoj Karkee Team | 2504.07309 | null |
| 2025-04-09 | AssistanceZero: Scalably Solving Assistance Games | Anca Dragan Team | 2504.07091 | link |
| 2025-04-09 | Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation | Huazhe Xu Team | 2504.06961 | null |
| 2025-04-09 | Developing Modular Grasping and Manipulation Pipeline Infrastructure to Streamline Performance Benchmarking | Holly Yanco Team | 2504.06819 | null |
| 2025-04-09 | Interactive Expressive Motion Generation Using Dynamic Movement Primitives | Kai O. Arras Team | 2504.06735 | null |
| 2025-04-09 | Overcoming Dynamic Environments: A Hybrid Approach to Motion Planning for Manipulators | Gavin Paul Team | 2504.06596 | null |
| 2025-04-09 | CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving | Yanyong Zhang Team | 2504.06584 | link |
| 2025-04-09 | OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning | Tyler Fenstermaker Team | 2504.06538 | null |
| 2025-04-08 | ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface | Rui Chen Team | 2504.06156 | null |
| 2025-04-08 | MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos | Marc Pollefeys Team | 2504.06084 | null |
| 2025-04-08 | Learning-enhanced electronic skin for tactile sensing on deformable surface based on electrical impedance tomography | Yunjie Yang Team | 2504.05987 | null |
| 2025-04-08 | Stratified Expert Cloning with Adaptive Selection for User Retention in Large-Scale Recommender Systems | Yongqi Liu Team | 2504.05628 | null |
| 2025-04-08 | TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning | Stephen Xia Team | 2504.05585 | null |
| 2025-04-07 | SPARK-Remote: A Cost-Effective System for Remote Bimanual Robot Teleoperation | Karthik Desingh Team | 2504.05488 | null |
| 2025-04-07 | RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception | Jie Song Team | 2504.05287 | null |
| 2025-04-07 | Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation | Wei Zhang Team | 2504.05225 | link |
| 2025-04-07 | Wavelet Policy: Imitation Policy Learning in Frequency Domain with Wavelet Transforms | Hongrui Zhu Team | 2504.04991 | null |
| 2025-04-07 | Embodied Perception for Test-time Grasping Detection Adaptation with Knowledge Infusion | Fengyu Zhou Team | 2504.04795 | null |
| 2025-04-06 | Tool-as-Interface: Learning Robot Policies from Human Tool Usage through Imitation Learning | Katherine Driggs-Campbell Team | 2504.04612 | null |
| 2025-04-06 | Diffusion-Based Approximate MPC: Fast and Consistent Imitation of Multi-Modal Action Distributions | Katherine J. Kuchenbecker Team | 2504.04603 | null |
| 2025-04-06 | DexTOG: Learning Task-Oriented Dexterous Grasp with Language | Cewu Lu Team | 2504.04573 | null |
| 2025-04-06 | DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments | Lin Shao Team | 2504.04516 | null |
| 2025-04-06 | Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers | Yuke Zhu Team | 2504.04395 | null |
| 2025-04-05 | ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning | Robert K. Katzschmann Team | 2504.04259 | null |
| 2025-04-09 | Digital Gene: Learning about the Physical World through Analytic Concepts | Cewu Lu Team | 2504.04170 | null |
| 2025-04-04 | Dexterous Manipulation through Imitation Learning: A Survey | Hong Zhang Team | 2504.03515 | null |
| 2025-04-04 | GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction | Weiming Zhi Team | 2504.03129 | null |
| 2025-04-03 | Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets | Abhishek Gupta Team | 2504.02792 | null |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Shibiao Xu Team | 2504.02477 | null |
| 2025-04-02 | RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics | Qiang Nie Team | 2504.02069 | null |
| 2025-04-02 | Slot-Level Robotic Placement via Visual Imitation from Single Human Video | Arsalan Mousavian Team | 2504.01959 | null |
| 2025-04-02 | Learning with Imperfect Models: When Multi-step Prediction Mitigates Compounding Error | Nikolai Matni Team | 2504.01766 | null |
| 2025-04-02 | TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication | Karla Stepanova Team | 2504.01708 | null |
| 2025-04-02 | 8-DoFs Cable Driven Parallel Robots for Bimanual Teleportation | Josie Hughes Team | 2504.01554 | null |
| 2025-04-02 | Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers | Yuki Uranishi Team | 2504.01301 | null |
| 2025-04-02 | The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction | Matthew K. X. J Pan Team | 2504.01260 | null |
| 2025-04-01 | Energy Weighted Learning Progress Guided Interleaved Multi-Task Learning | Erhan Oztop Team | 2504.00707 | null |
| 2025-04-01 | Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs | Masaya Kinoshita Team | 2504.00614 | null |
| 2025-04-01 | Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation | Dong Wang Team | 2504.00420 | null |
| 2025-03-31 | CBIL: Collective Behavior Imitation Learning for Fish from Real Videos | Taku Komura Team | 2504.00234 | null |
| 2025-04-02 | Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation | Yuke Zhu Team | 2503.24361 | null |
| 2025-04-02 | AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World | Sergey Levine Team | 2503.24278 | link |
| 2025-03-31 | HACTS: a Human-As-Copilot Teleoperation System for Robot Learning | Jian Tang Team | 2503.24070 | null |
| 2025-03-31 | Learning 3D-Gaussian Simulators from RGB Videos | Georg Martius Team | 2503.24009 | null |
| 2025-03-31 | ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos | Dinesh Jayaraman Team | 2503.23877 | link |
| 2025-03-31 | Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning | Yue Wang Team | 2503.23835 | null |
| 2025-03-30 | Can Visuo-motor Policies Benefit from Random Exploration Data? A Case Study on Stacking | Florian T. Pokorny Team | 2503.23571 | null |
| 2025-08-26 | BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities | Li Fei-Fei Team | 2503.05652 | link |
| 2024-12-17 | TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning | Jeannette Bohg Team | 2412.10447 | link |
| 2025-01-08 | 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing | Yunzhu Li Team | 2410.24091 | null |
| 2024-10-24 | SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation | Ajay Mandlekar Team | 2410.18065 | null |
| 2024-11-05 | ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data | Shuran Song Team | 2406.19464 | link |
| 2023-10-31 | Learning Robot Manipulation from Cross-Morphology Demonstration | Gaurav Sukhatme Team | 2304.03833 | null |
| 2022-11-17 | ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds | David Held Team | 2211.09006 | link |
| 2022-11-16 | Learning and Retrieval from Prior Data for Skill-based Imitation Learning | Yuke Zhu Team | 2210.11435 | null |
| 2023-03-09 | VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors | Yuke Zhu Team | 2210.11339 | null |
| 2022-10-12 | Self-Supervised Learning of Multi-Object Keypoints for Robotic Manipulation | Abhinav Valada Team | 2205.08316 | null |
| 2022-11-21 | R3M: A Universal Visual Representation for Robot Manipulation | Abhinav Gupta Team | 2203.12601 | null |
| 2022-02-07 | BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning | Chelsea Finn Team | 2202.02005 | null |
| 2021-11-02 | Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation | Chelsea Finn Team | 2109.01115 | null |
| 2021-06-11 | Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration | Edward Johns Team | 2105.06411 | link |
| 2022-03-09 | Interactive Imitation Learning in State-Space | Jens Kober Team | 2008.00524 | null |
| 2020-05-19 | On-Policy Robot Imitation Learning from a Converging Supervisor | Ken Goldberg Team | 1907.03423 | null |
| 2018-11-08 | RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation | Li Fei-Fei Team | 1811.02790 | null |
| 2018-10-09 | Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning | Chelsea Finn Team | 1810.03043 | null |
| 2017-10-27 | Learning Robotic Manipulation of Granular Media | Sergey Levine Team | 1709.02833 | null |
VLM
| Publish Date | Title | Authors | Code | ||
|---|---|---|---|---|---|
| 2025-11-20 | Learning to Think Fast and Slow for Visual Language Models | Kaiyang Zhou Team | 2511.16670 | null | |
| 2025-11-20 | Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO | Jing Liao Team | 2511.16669 | link | |
| 2025-11-20 | Cognitive Foundations for Reasoning and Their Manifestation in LLMs | Yulia Tsvetkov Team | 2511.16660 | null | |
| 2025-11-20 | InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy | Jiangmiao Pang Team | 2511.16651 | null | |
| 2025-11-20 | Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization | Xiaozhu Ju Team | 2511.16602 | null | |
| 2025-11-20 | TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding | Qin Jin Team | 2511.16595 | link | |
| 2025-11-20 | Contrastive vision-language learning with paraphrasing and negation | Artur d’Avila Garcez Team | 2511.16527 | null | |
| 2025-11-20 | MiMo-Embodied: X-Embodied Foundation Model Technical Report | Long Chen Team | 2511.16518 | link | |
| 2025-11-20 | Arctic-Extract Technical Report | Wojciech Jaśkowski Team | 2511.16470 | null | |
| 2025-11-20 | LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs | Loïc Barthe Team | 2511.16454 | null | |
| 2025-11-20 | VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference | Bo Zhao Team | 2511.16449 | null | |
| 2025-11-20 | Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation | Weifeng Liu Team | 2511.16435 | null | |
| 2025-11-20 | TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models | Chaochao Chen Team | 2511.16423 | null | |
| 2025-11-20 | The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks | Jianfeng Ma Team | 2511.16347 | null | |
| 2025-11-20 | FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models | Mingsheng Shang Team | 2511.16233 | null | |
| 2025-11-20 | Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions | Yoichi Sato Team | 2511.16221 | null | |
| 2025-11-20 | FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks | Wentao Zhang Team | 2511.16216 | null | |
| 2025-11-20 | When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models | Yaochu Jin Team | 2511.16203 | null | |
| 2025-11-20 | From Performance to Understanding: A Vision for Explainable Automated Algorithm Design | Thomas Bäck Team | 2511.16201 | null | |
| 2025-11-20 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Zhijie Deng Team | 2511.16175 | null | |
| 2025-11-19 | Think Visually, Reason Textually: Vision-Language Synergy in ARC | Jiaqi Wang Team | 2511.15703 | null | |
| 2025-11-19 | MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping | Jun Zhang Team | 2511.15690 | null | |
| 2025-11-19 | Walrus: A Cross-Domain Foundation Model for Continuum Dynamics | Shirley Ho Team | 2511.15684 | null | |
| 2025-11-19 | VisPlay: Self-Evolving Vision-Language Models from Images | Yonghui Yang Team | 2511.15661 | null | |
| 2025-11-19 | Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning | Da-Wei Zhou Team | 2511.15633 | null | |
| 2025-11-19 | The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification | Didac Suris Team | 2511.15622 | null | |
| 2025-11-19 | When to Think and When to Look: Uncertainty-Guided Lookback | Chenliang Xu Team | 2511.15613 | null | |
| 2025-11-19 | SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models | Xipeng Qiu Team | 2511.15605 | null | |
| 2025-11-19 | AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning | Chinmay Gondhalekar Team | 2511.15578 | null | |
| 2025-11-19 | Computer-Use Agents as Judges for Generative User Interface | Mike Zheng Shou Team | 2511.15567 | link | |
| 2025-11-19 | Multimodal Evaluation of Russian-language Architectures | Alena Fenogenova Team | 2511.15552 | null | |
| 2025-11-19 | Learning to Expand Images for Efficient Visual Autoregressive Modeling | Tao Huang Team | 2511.15499 | null | |
| 2025-11-19 | SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome | Mohammad Lotfollahi Team | 2511.15464 | null | |
| 2025-11-19 | D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models | Kentaro Yoshioka Team | 2511.15411 | null | |
| 2025-11-19 | Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models | Hao Wang Team | 2511.15390 | null | |
| 2025-11-19 | Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training | Jianfei Yang Team | 2511.15379 | null | |
| 2025-11-19 | C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models | Daehyung Park Team | 2511.15333 | null | |
| 2025-11-19 | What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs | Fan Li Team | 2511.15316 | null | |
| 2025-11-19 | Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models | Morteza Saberi Team | 2511.15311 | null | |
| 2025-11-19 | Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language | Daniel Cremers Team | 2511.15308 | null | |
| 2025-11-18 | ARC Is a Vision Problem! | Kaiming He Team | 2511.14761 | link | |
| 2025-11-18 | UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning | Afshin Dehghan Team | 2511.14760 | null | |
| 2025-11-18 | $π^{*}_{0.6}$ : a VLA That Learns From Experience | Zhiyuan Zhou Team | 2511.14759 | null | |
| 2025-11-18 | Vision Large Language Models Are Good Noise Handlers in Engagement Analysis | Xiaobai Li Team | 2511.14749 | null | |
| 2025-11-18 | Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge | Günter Klambauer Team | 2511.14744 | null | |
| 2025-11-18 | Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer | Ankush Kumar Team | 2511.14691 | null | |
| 2025-11-18 | NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards | Soujanya Poria Team | 2511.14659 | link | |
| 2025-11-18 | Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities | Inigo Zubeldia Team | 2511.14631 | null | |
| 2025-11-18 | Is Your VLM for Autonomous Driving Safety-Ready? A Comprehensive Benchmark for Evaluating External and In-Cabin Risks | Xiaoshuai Hao Team | 2511.14592 | null | |
| 2025-11-18 | OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models | Huan Wang Team | 2511.14582 | link | |
| 2025-11-18 | Task Addition and Weight Disentanglement in Closed-Vocabulary Models | Pascal Frossard Team | 2511.14569 | null | |
| 2025-11-18 | Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM | Siyuan Cheng Team | 2511.14499 | null | |
| 2025-11-18 | Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding | Min-Ling Zhang Team | 2511.14446 | null | |
| 2025-11-18 | Watchdogs and Oracles: Runtime Verification Meets Large Language Models for Autonomous Systems | Angelo Ferrando Team | 2511.14435 | null | |
| 2025-11-18 | Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition | Abhinav Valada Team | 2511.14391 | null | |
| 2025-11-18 | O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model | Anirban Chakraborty Team | 2511.14368 | null | |
| 2025-11-18 | ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding | Jionglong Su Team | 2511.14336 | null | |
| 2025-11-18 | When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling | Jacopo Mauro Team | 2511.14334 | null | |
| 2025-11-18 | Step by Step Network | Gao Huang Team | 2511.14329 | null | |
| 2025-11-18 | Segmentwise Pruning in Audio-Language Models | Jean-François Bonastre Team | 2511.14293 | null | |
| 2025-11-17 | Scaling Spatial Intelligence with Multimodal Foundation Models | Lei Yang Team | 2511.13719 | link | |
| 2025-11-17 | TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models | Ying-Cong Chen Team | 2511.13704 | link | |
| 2025-11-17 | Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation | Joseph K J Team | 2511.13689 | null | |
| 2025-11-17 | Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting | Haoji Hu Team | 2511.13684 | null | |
| 2025-11-17 | Part-X-MLLM: Part-aware 3D Multimodal Large Language Model | Chunchao Guo Team | 2511.13647 | null | |
| 2025-11-17 | CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding | Daivik Patel Team | 2511.13644 | null | |
| 2025-11-17 | CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product | Jiayi Cen Team | 2511.13626 | null | |
| 2025-11-17 | FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI | Jiangtao Gong Team | 2511.13524 | null | |
| 2025-11-17 | Language-Guided Invariance Probing of Vision-Language Models | Jae Joong Lee Team | 2511.13494 | null | |
| 2025-11-17 | Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling | Pascal Frossard Team | 2511.13478 | null | |
| 2025-11-17 | Trust in Vision-Language Models: Insights from a Participatory User Workshop | Viola Schiaffonati Team | 2511.13458 | null | |
| 2025-11-17 | Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline | Ziqian Lu Team | 2511.13442 | null | |
| 2025-11-17 | VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task | Xilin Chen Team | 2511.13420 | null | |
| 2025-11-17 | Attention Grounded Enhancement for Visual Document Retrieval | Keping Bi Team | 2511.13415 | null | |
| 2025-11-17 | Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA) | Ciaran Eising Team | 2511.13397 | null | |
| 2025-11-17 | Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model | Fei Kong Team | 2511.13387 | null | |
| 2025-11-17 | Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce’s Manuscripts with Vision-Language Models | Dario Rodighiero Team | 2511.13378 | null | |
| 2025-11-17 | Tab-PET: Graph-Based Positional Encodings for Tabular Transformers | Mehul Motani Team | 2511.13338 | null | |
| 2025-11-17 | TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing | Hyunwoo J. Kim Team | 2511.13283 | null | |
| 2025-11-17 | Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation | Wenbo Ding Team | 2511.13269 | null | |
| 2025-11-14 | DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding | Jinsung Yoon Team | 2511.11552 | null | |
| 2025-11-14 | Bridging Hidden States in Vision-Language Models | Jacob Fein-Ashley Team | 2511.11526 | null | |
| 2025-11-14 | Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities | Jingyuan Chen Team | 2511.11512 | null | |
| 2025-11-14 | PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models | Manish Bhattarai Team | 2511.11502 | null | |
| 2025-11-14 | Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective | Ngan Le Team | 2511.11478 | null | |
| 2025-11-14 | Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents | Fabrizio Battiloro Team | 2511.11468 | null | |
| 2025-11-14 | VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation | Klaus Maier-Hein Team | 2511.11450 | null | |
| 2025-11-14 | From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs | Giuseppe Riccardi Team | 2511.11440 | null | |
| 2025-11-14 | VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models | Wenqiang Lei Team | 2511.11438 | null | |
| 2025-11-14 | Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs | Bruno Martins Team | 2511.11427 | null | |
| 2025-11-14 | BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning | De-Chuan Zhan Team | 2511.11421 | null | |
| 2025-11-14 | Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models | Baoliang Chen Team | 2511.11410 | null | |
| 2025-11-14 | MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model | Bo Yan Team | 2511.11407 | null | |
| 2025-11-14 | DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding | Sunando Sengupta Team | 2511.11313 | null | |
| 2025-11-14 | EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment | Hongyi Zhang Team | 2511.11301 | null | |
| 2025-11-14 | AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models | Volker Tresp Team | 2511.11299 | link | |
| 2025-11-14 | Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation | Xi Zheng Team | 2511.11298 | null | |
| 2025-11-14 | GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving | Abhinav Valada Team | 2511.11266 | null | |
| 2025-11-14 | Discovering Meaningful Units with Visually Grounded Semantics from Image Captions | James Henderson Team | 2511.11262 | null | |
| 2025-11-14 | CountSteer: Steering Attention for Object Counting in Diffusion Models | Hyunsoo Cho Team | 2511.11253 | null | |
| 2025-11-13 | Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling | Jinguo Zhu Team | 2511.10648 | null | |
| 2025-11-13 | Querying Labeled Time Series Data with Scenario Programs | Sanjit A Seshia Team | 2511.10627 | null | |
| 2025-11-13 | Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals | Pawan Goyal Team | 2511.10615 | null | |
| 2025-11-13 | Impact of Layer Norm on Memorization and Generalization in Transformers | Jung-Eun Kim Team | 2511.10566 | null | |
| 2025-11-13 | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded | Ziwei Liu Team | 2511.10560 | link | |
| 2025-11-13 | SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation | Liqiang Nie Team | 2511.10518 | link | |
| 2025-11-13 | LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components | Jianbo Feng Team | 2511.10394 | null | |
| 2025-11-13 | MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns | Xiang Bai Team | 2511.10390 | null | |
| 2025-11-13 | Rethinking Visual Information Processing in Multimodal LLMs | Amit Kumar K C Team | 2511.10301 | null | |
| 2025-11-13 | Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models | Pekka Marttinen Team | 2511.10292 | null | |
| 2025-11-13 | PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning | Jey Han Lau Team | 2511.10279 | null | |
| 2025-11-13 | Causal-HalBench: Uncovering LVLMs Object Hallucinations Through Causal Intervention | Xiang Wang Team | 2511.10268 | null | |
| 2025-11-13 | Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis | Min Cao Team | 2511.10254 | null | |
| 2025-11-13 | TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding | Beihao Xia Team | 2511.10241 | null | |
| 2025-11-13 | Intilligence Foundation Model: A New Perspective to Approach Artificial General Intelligence | Yao Zhao Team | 2511.10119 | null | |
| 2025-11-13 | MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models | Xiao Bai Team | 2511.10098 | null | |
| 2025-11-13 | How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders | Dianbo Liu Team | 2511.10094 | null | |
| 2025-11-13 | SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition | Zitong Yu Team | 2511.10091 | null | |
| 2025-11-13 | GridPrune: From “Where to Look” to “What to Select” in Visual Token Pruning for MLLMs | Pengwei Wang Team | 2511.10081 | null | |
| 2025-11-13 | VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System | Joonhyuk Kang Team | 2511.10074 | null | |
| 2025-11-10 | Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective | Somil Bansal Team | 2511.07410 | null | |
| 2025-11-10 | CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video | David Bull Team | 2511.07290 | null | |
| 2025-11-10 | Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation | Jaekoo Lee Team | 2511.07238 | null | |
| 2025-11-10 | Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use | Rachid Chelouah Team | 2511.07171 | null | |
| 2025-11-10 | ClusterMine: Robust Label-Free Visual Out-Of-Distribution Detection via Concept Mining from Text Corpora | Markus Kollmann Team | 2511.07068 | link | |
| 2025-11-10 | CoLM: Collaborative Large Models via A Client-Server Paradigm | Hongyuan Zhang Team | 2511.06991 | null | |
| 2025-11-10 | RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation | Yu Zhang Team | 2511.06899 | null | |
| 2025-11-10 | Flexible Concept Bottleneck Model | Rui Zhang Team | 2511.06678 | null | |
| 2025-11-10 | HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment | Shiguo Lian Team | 2511.06653 | null | |
| 2025-11-10 | NOVO: Bridging LLaVA and SAM with Visual-only Prompts for Reasoning Segmentation | Yeong-Jun Cho Team | 2511.06651 | null | |
| 2025-11-10 | How Do VLAs Effectively Inherit from VLMs? | Jiang Bian Team | 2511.06619 | null | |
| 2025-11-09 | A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving | Xiaopeng Li Team | 2511.06496 | null | |
| 2025-11-09 | Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models | Sabine Süsstrunk Team | 2511.06490 | null | |
| 2025-11-09 | GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding | Riad Souissi Team | 2511.06348 | null | |
| 2025-11-09 | ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning | Moazzem Hossain Team | 2511.06316 | null | |
| 2025-11-09 | TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks | Bo Xu Team | 2511.06283 | null | |
| 2025-11-09 | WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation | Jie Tang Team | 2511.06251 | null | |
| 2025-11-09 | Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation | Winston H. Hsu Team | 2511.06240 | null | |
| 2025-11-09 | MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition | Vijaykrishnan Narayanan Team | 2511.06225 | null | |
| 2025-11-09 | Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models | Alexander Htet Kyaw Team | 2511.06201 | null | |
| 2025-11-07 | Visual Spatial Tuning | Hengshuang Zhao Team | 2511.05491 | null | |
| 2025-11-07 | Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval | Hongda Shen Team | 2511.05325 | null | |
| 2025-11-07 | Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings | Furong Huang Team | 2511.05017 | null | |
| 2025-11-07 | iFlyBot-VLM Technical Report | Jia Pan Team | 2511.04976 | null | |
| 2025-11-07 | A benchmark multimodal oro-dental dataset for large vision-language models | Muhammad Saqib Team | 2511.04948 | null | |
| 2025-11-06 | Conformalized Non-uniform Sampling Strategies for Accelerated Sampling-based Motion Planning | Yiannis Kantaros Team | 2511.04835 | null | |
| 2025-11-06 | IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs | Shubham Agarwal Team | 2511.04727 | null | |
| 2025-11-05 | SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking | Dacheng Tao Team | 2511.04711 | null | |
| 2025-11-06 | SAFe-Copilot: Unified Shared Autonomy Framework | Daniela Rus Team | 2511.04664 | null | |
| 2025-11-06 | Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm | Xipeng Qiu Team | 2511.04570 | null | |
| 2025-11-06 | Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment | Bo Zhao Team | 2511.04555 | link | |
| 2025-11-07 | ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai | Kunat Pipatanakul Team | 2511.04479 | null | |
| 2025-11-06 | GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents | Dongmei Zhang Team | 2511.04307 | null | |
| 2025-11-07 | On the Brittleness of CLIP Text Encoders | Luca Rossetto Team | 2511.04247 | link | |
| 2025-11-06 | Text to Sketch Generation with Multi-Styles | Lei Xu Team | 2511.04123 | null | |
| 2025-11-05 | Context informs pragmatic interpretation in vision-language models | Michael C. Frank Team | 2511.03908 | null | |
| 2025-11-05 | Contamination Detection for VLMs using Multi-Modal Semantic Perturbation | Yong Jae Lee Team | 2511.03774 | null | |
| 2025-11-05 | GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-trained Robot Policy Enhancement | Jiachen Li Team | 2511.03400 | null | |
| 2025-11-05 | Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models | Seokju Lee Team | 2511.03367 | null | |
| 2025-11-04 | LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation | Jinyoung Yeo Team | 2511.03001 | null | |
| 2025-11-04 | SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics | Leonid Sigal Team | 2511.02996 | null | |
| 2025-11-04 | XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations | Jian Tang Team | 2511.02776 | null | |
| 2025-11-04 | Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes | Yi Jiang Team | 2511.02503 | null | |
| 2025-11-04 | RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning | Conghui He Team | 2511.02384 | null | |
| 2025-11-04 | The Pervasive Blind Spot: Benchmarking VLM Inference Risks on Everyday Personal Videos | Hewu Li Team | 2511.02367 | null | |
| 2025-11-04 | CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning | Han Yan Team | 2511.02360 | null | |
| 2025-11-04 | LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation | Changhyun Choi Team | 2511.02239 | link | |
| 2025-11-04 | Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models | Randall Davis Team | 2511.02162 | null | |
| 2025-11-03 | Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion | Dung D. Le Team | 2511.02113 | null | |
| 2025-11-03 | TRACE: Textual Reasoning for Affordance Coordinate Extraction | Matthew S. Brown Team | 2511.01999 | null | |
| 2025-11-03 | Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing | Tao Qi Team | 2511.01952 | null | |
| 2025-11-04 | Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models | Mingwei Shen Team | 2511.01831 | null | |
| 2025-11-03 | SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art | Alona Strugatski Team | 2511.01817 | null | |
| 2025-11-03 | GenDexHand: Generative Simulation for Dexterous Hands | Yi Ma Team | 2511.01791 | null | |
| 2025-11-03 | 3EED: Ground Everything Everywhere in 3D | Ziwei Liu Team | 2511.01755 | link | |
| 2025-11-03 | UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback | Fan Wang Team | 2511.01678 | null | |
| 2025-11-03 | Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers | Naeemullah Khan Team | 2511.01617 | null | |
| 2025-11-03 | Analyzing Sustainability Messaging in Large-Scale Corporate Social Media | Marcel Worring Team | 2511.01550 | null | |
| 2025-11-03 | AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models | Spandan Roy Team | 2511.01472 | null | |
| 2025-11-03 | HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA | Shihong Xia Team | 2511.01463 | null | |
| 2025-11-03 | When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA | Mobarak I. Hoque Team | 2511.01458 | null | |
| 2025-10-31 | PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting | Tyler J. Bradshaw Team | 2510.27680 | null | |
| 2025-10-31 | Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning | Jiaqi Wang Team | 2510.27606 | null | |
| 2025-10-31 | From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration | Kaipeng Zhang Team | 2510.27452 | null | |
| 2025-10-31 | Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds | Mehrtash Harandi Team | 2510.27391 | null | |
| 2025-10-31 | FOCUS: Efficient Keyframe Selection for Long Video Understanding | Yang You Team | 2510.27280 | null | |
| 2025-10-31 | T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis | Mohammad Yaqub Team | 2510.27265 | null | |
| 2025-10-31 | ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models | Tengxiang Zhang Team | 2510.27256 | null | |
| 2025-11-03 | Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes | Seong-Whan Lee Team | 2510.27255 | null | |
| 2025-10-31 | Generating Accurate and Detailed Captions for High-Resolution Images | Jiyoung Jung Team | 2510.27164 | null | |
| 2025-10-30 | MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation | Xiaohui Xie Team | 2510.26996 | null | |
| 2025-10-30 | MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models | Ziliang Chen Team | 2510.26937 | null | |
| 2025-10-30 | NaviTrace: Evaluating Embodied Navigation of Vision-Language Models | Jonas Frey Team | 2510.26909 | null | |
| 2025-10-30 | Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations | Jane Cleland-Huang Team | 2510.26905 | null | |
| 2025-10-30 | Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench | Xi Yang Team | 2510.26865 | link | |
| 2025-11-03 | ChartAB: A Benchmark for Chart Grounding & Dense Alignment | Tianyi Zhou Team | 2510.26781 | null | |
| 2025-10-30 | SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models | Chris Thomas Team | 2510.26769 | null | |
| 2025-10-30 | All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles | Abolfazl Razi Team | 2510.26641 | null | |
| 2025-10-30 | Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing | Xuanjing Huang Team | 2510.26474 | null | |
| 2025-11-03 | Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition | ShengJun Huang Team | 2510.26466 | null | |
| 2025-10-30 | Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection | Chengjie Wang Team | 2510.26464 | null | |
| 2025-10-30 | A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models | Muhammad Haris Khan Team | 2510.26441 | null | |
| 2025-10-30 | MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders | Marco Grangetto Team | 2510.26411 | null | |
| 2025-10-30 | Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual | Peerat Limkonchotiwat Team | 2510.26271 | null | |
| 2025-10-30 | Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models | Shigeru Kitazawa Team | 2510.26241 | null | |
| 2025-10-30 | MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction | Ali Diba Team | 2510.26151 | null | |
| 2025-10-30 | GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks | Qing Li Team | 2510.26098 | null | |
| 2025-10-30 | Dynamic VLM-Guided Negative Prompting for Diffusion Models | Yoonseok Choi Team | 2510.26052 | null | |
| 2025-10-29 | CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments | Antoine Bosselut Team | 2510.26006 | null | |
| 2025-10-30 | PairUni: Pairwise Training for Unified Multimodal Language Models | Zhuochen Wang Team | 2510.25682 | null | |
| 2025-10-29 | ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents | Bela Gipp Team | 2510.25668 | null | |
| 2025-10-29 | Don’t Blind Your VLA: Aligning Visual Representations for OOD Generalization | Aleksandr I. Panov Team | 2510.25616 | null | |
| 2025-10-29 | Using VLM Reasoning to Constrain Task and Motion Planning | Zachary Kingston Team | 2510.25548 | null | |
| 2025-10-29 | Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media | Josef van Genabith Team | 2510.25413 | null | |
| 2025-10-29 | SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning | Wei Pan Team | 2510.25191 | null | |
| 2025-10-29 | Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models | Usman Naseem Team | 2510.25179 | null | |
| 2025-10-29 | Learning Spatial-Aware Manipulation Ordering | Jian Pu Team | 2510.25138 | null | |
| 2025-10-29 | NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies | Jinghui Lu Team | 2510.25122 | null | |
| 2025-10-29 | Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection | Hyunwoo J. Kim Team | 2510.25094 | null | |
| 2025-10-29 | DRIP: Dynamic patch Reduction via Interpretable Pooling | Sachin Kumar Team | 2510.25067 | null | |
| 2025-10-28 | Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8 | Ching Yee Suen Team | 2510.25032 | null | |
| 2025-10-28 | SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving | Mykel J. Kochenderfer Team | 2510.24949 | null | |
| 2025-10-28 | Finding Culture-Sensitive Neurons in Vision-Language Models | Ivan Titov Team | 2510.24942 | null | |
| 2025-10-28 | Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning | Arnold W. Schumann Team | 2510.24650 | null | |
| 2025-10-28 | OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows | Lingpeng Kong Team | 2510.24411 | null | |
| 2025-10-28 | What do vision-language models see in the context? Investigating multimodal in-context learning | Sandra Avila Team | 2510.24331 | null | |
| 2025-10-28 | Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning | Ivan Kitanovski Team | 2510.24321 | null | |
| 2025-10-28 | ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model | Rui Yan Team | 2510.24285 | null | |
| 2025-10-28 | Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models | Yue Gao Team | 2510.24242 | null | |
| 2025-10-28 | V-SAT: Video Subtitle Annotation Tool | Vishwanathan Raman Team | 2510.24180 | null | |
| 2025-10-28 | Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning | Xubo Luo Team | 2510.24152 | null | |
| 2025-10-28 | Compositional Image Synthesis with Inference-Time Scaling | Namhyuk Ahn Team | 2510.24133 | link | |
| 2025-10-28 | HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology | Vandita Singh Team | 2510.24115 | null | |
| 2025-10-28 | PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI | Philip Dames Team | 2510.24109 | null | |
| 2025-10-28 | Enhancing CLIP Robustness via Cross-Modality Alignment | Hanwang Zhang Team | 2510.24038 | null | |
| 2025-10-28 | Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks | Hannah Kerner Team | 2510.24010 | null | |
| 2025-10-28 | Reasoning Visual Language Model for Chest X-Ray Analysis | Daguang Xu Team | 2510.23968 | null | |
| 2025-10-27 | Latent Chain-of-Thought for Visual Reasoning | Zhiqiang Tao Team | 2510.23925 | null | |
| 2025-10-27 | Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices | Madesh Kuppusamy Team | 2510.23775 | null | |
| 2025-10-27 | RobotArena $\infty$ : Scalable Robot Benchmarking via Real-to-Sim Translation | Katerina Fragkiadaki Team | 2510.23571 | link | |
| 2025-10-28 | VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation | Cordelia Schmid Team | 2510.23497 | null | |
| 2025-10-27 | On the Faithfulness of Visual Thinking: Measurement and Enhancement | Guisong Xia Team | 2510.23482 | null | |
| 2025-10-27 | A Video Is Not Worth a Thousand Words | Michael Wray Team | 2510.23253 | null | |
| 2025-10-27 | Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports | Curtis P. Langlotz Team | 2510.23217 | null | |
| 2025-10-27 | DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification | Angelo Broere Team | 2510.23203 | null | |
| 2025-10-27 | Evaluation of Vision-LLMs in Surveillance Video | Jelte P. Mense Team | 2510.23190 | null | |
| 2025-10-27 | Finding 3D Scene Analogies with Multimodal Foundation Models | Young Min Kim Team | 2510.23184 | null | |
| 2025-10-27 | Revisiting Multimodal Positional Encoding in Vision-Language Models | Shuai Bai Team | 2510.23095 | null | |
| 2025-10-27 | Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models | Donald MacDonald Team | 2510.23066 | null | |
| 2025-10-27 | VoMP: Predicting Volumetric Mechanical Property Fields | Maria Shugrina Team | 2510.22975 | link | |
| 2025-10-28 | HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment | Zhen Li Team | 2510.22917 | null | |
| 2025-10-26 | Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models | Jiong Tang Team | 2510.22868 | null | |
| 2025-10-26 | Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models | Kaito Tanaka Team | 2510.22838 | null | |
| 2025-10-26 | VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions | Taehwan Kim Team | 2510.22798 | link | |
| 2025-10-26 | Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models | Mingkun Xu Team | 2510.22785 | null | |
| 2025-10-26 | MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion | Chien-Sheng Wu Team | 2510.22768 | null | |
| 2025-10-26 | Jarvis: Towards Personalized AI Assistant via Personal KV-Cache Retrieval | Wentao Zhang Team | 2510.22765 | null | |
| 2025-10-26 | S-Chain: Structured Visual Chain-of-Thought For Medicine | Anh Totti Nguyen Team | 2510.22728 | null | |
| 2025-10-26 | Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring | Prathamesh Mayekar Team | 2510.22702 | null | |
| 2025-10-24 | A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection | Peter Henderson Team | 2510.21679 | null | |
| 2025-10-24 | Modest-Align: Data-Efficient Alignment for Vision-Language Models | Zuozhu Liu Team | 2510.21606 | null | |
| 2025-10-24 | Head Pursuit: Probing Attention Specialization in Multimodal Transformers | Alberto Cazzaniga Team | 2510.21518 | null | |
| 2025-10-24 | MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection | Jie Qin Team | 2510.21449 | null | |
| 2025-10-24 | Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings | Fakhri Karray Team | 2510.21424 | null | |
| 2025-10-24 | Bridging the gap to real-world language-grounded visual concept learning | Seunghoon Hong Team | 2510.21412 | null | |
| 2025-10-24 | VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set | Shuhui Wang Team | 2510.21323 | null | |
| 2025-10-24 | Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models | Taesup Kim Team | 2510.21175 | null | |
| 2025-10-24 | Generalizable Hierarchical Skill Learning via Object-Centric Representation | Robert Platt Team | 2510.21121 | null | |
| 2025-10-24 | SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation | Joseph Yitan Cheng Team | 2510.21120 | null | |
| 2025-10-24 | MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning | Dong In Kim Team | 2510.21093 | null | |
| 2025-10-24 | Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung’s Disease | Adrian D. C. Chan Team | 2510.21083 | null | |
| 2025-10-24 | ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models | Jimmy Chiun Team | 2510.21069 | null | |
| 2025-10-23 | 3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models | Pranav Rajpurkar Team | 2510.20967 | null | |
| 2025-10-23 | Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation | Shengjie Wang Team | 2510.20812 | null | |
| 2025-10-23 | Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models | Linfeng Zhang Team | 2510.20707 | link | |
| 2025-10-23 | Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward | Chenliang Xu Team | 2510.20696 | null | |
| 2025-10-23 | Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging | Bjoern Menze Team | 2510.20639 | null | |
| 2025-10-23 | Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models | Lan-Zhe Guo Team | 2510.20477 | null | |
| 2025-10-23 | GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments? | Yingchun Wang Team | 2510.20333 | null | |
| 2025-10-23 | Breakdance Video classification in the age of Generative AI | Michelle Munson Team | 2510.20287 | null | |
| 2025-10-23 | Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding | Sangyoun Lee Team | 2510.20244 | null | |
| 2025-10-23 | Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context | Sibei Yang Team | 2510.20229 | null | |
| 2025-10-24 | Surfer 2: The Next Generation of Cross-Platform Computer Use Agents | Jevgenij Zubovskij Team | 2510.19949 | null | |
| 2025-10-22 | Semantic World Models | Abhishek Gupta Team | 2510.19818 | null | |
| 2025-10-22 | olmOCR 2: Unit Test Rewards for Document OCR | Kyle Lo Team | 2510.19817 | link | |
| 2025-10-22 | Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models | Xuelong Li Team | 2510.19802 | null | |
| 2025-10-22 | MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom | Shaohua Kevin Zhou Team | 2510.19626 | link | |
| 2025-10-22 | XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography | Mauricio Reyes Team | 2510.19599 | null | |
| 2025-10-22 | Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection | Qiben Yan Team | 2510.19574 | null | |
| 2025-10-22 | A Matter of Time: Revealing the Structure of Time in Vision-Language Models | Matthias Zeppelzauer Team | 2510.19559 | null | |
| 2025-10-22 | **[De | Re]constructing VLMs’ Reasoning in Counting** | Giuseppe Riccardi Team | 2510.19555 | null |
| 2025-10-22 | CARES: Context-Aware Resolution Selector for VLMs | Eli Schwartz Team | 2510.19496 | null | |
| 2025-10-22 | Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes | Baining Guo Team | 2510.19400 | link | |
| 2025-10-22 | A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP | Wei Yu Chen Team | 2510.19333 | null | |
| 2025-10-22 | Unified Reinforcement and Imitation Learning for Vision-Language Models | Yueh-Hua Wu Team | 2510.19307 | link | |
| 2025-10-22 | Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models | Changhyun Choi Team | 2510.19268 | null | |
| 2025-10-22 | Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression | Evangelos E. Papalexakis Team | 2510.19160 | null | |
| 2025-10-21 | PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions | Kathleen McKeown Team | 2510.19060 | link | |
| 2025-10-21 | Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts | Hyunjung Shim Team | 2510.19001 | null | |
| 2025-10-21 | DSI-Bench: A Benchmark for Dynamic Spatial Intelligence | Zhou Zhao Team | 2510.18873 | null | |
| 2025-10-21 | FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning | Jagath C. Rajapakse Team | 2510.18837 | null | |
| 2025-10-21 | Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation | Elvis Hsieh Team | 2510.18751 | null | |
| 2025-10-21 | Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents | Mike Zheng Shou Team | 2510.18703 | link | |
| 2025-10-21 | Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views | Ruqi Huang Team | 2510.18632 | null | |
| 2025-10-21 | CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent | Xing Sun Team | 2510.18596 | null | |
| 2025-10-21 | CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder | Hye Won Chung Team | 2510.18583 | null | |
| 2025-10-21 | Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation | Yan-Ann Chen Team | 2510.18502 | null | |
| 2025-10-21 | StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking | Donglin Yu Team | 2510.18483 | null | |
| 2025-10-21 | Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation | Cristina España-Bonet Team | 2510.18439 | null | |
| 2025-10-21 | ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization | Hongyi Wen Team | 2510.18433 | null | |
| 2025-10-21 | Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding | Xian Wu Team | 2510.18321 | null | |
| 2025-10-21 | StreamingTOM: Streaming Token Compression for Efficient Video Understanding | Huan Wang Team | 2510.18269 | null | |
| 2025-10-21 | UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding | Xuelong Li Team | 2510.18262 | null | |
| 2025-10-21 | RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology | Bjoern Menze Team | 2510.18188 | null | |
| 2025-10-20 | Online In-Context Distillation for Low-Resource Vision Language Models | Karteek Alahari Team | 2510.18117 | null | |
| 2025-10-20 | HouseTour: A Virtual Real Estate A(I)gent | Iro Armeni Team | 2510.18054 | null | |
| 2025-10-20 | SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection | Johannes Betz Team | 2510.18034 | null | |
| 2025-10-21 | Glyph: Scaling Context Windows via Visual-Text Compression | Minlie Huang Team | 2510.17800 | null | |
| 2025-10-20 | SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference | Zhijian Liu Team | 2510.17777 | null | |
| 2025-10-20 | Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs | Hanghang Tong Team | 2510.17771 | null | |
| 2025-10-20 | VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models | Ruqi Zhang Team | 2510.17759 | null | |
| 2025-10-20 | Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs | Rachid Chelouah Team | 2510.17651 | null | |
| 2025-10-20 | SARSteer: Safeguarding Large Audio Language Models via Safe-Ablated Refusal Steering | Li Liu Team | 2510.17633 | null | |
| 2025-10-20 | MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning | Adiba Mahbub Proma Team | 2510.17590 | null | |
| 2025-10-20 | Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation | Zhicheng Dou Team | 2510.17354 | null | |
| 2025-10-20 | Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations | Omri Azencot Team | 2510.17313 | null | |
| 2025-10-20 | FineVision: Open Data Is All You Need | Andrés Marafioti Team | 2510.17269 | null | |
| 2025-10-20 | ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models | Guoming Tang Team | 2510.17197 | null | |
| 2025-10-20 | SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving | Shaohua Wu Team | 2510.17191 | null | |
| 2025-10-20 | OmniVIC: A Self-Improving Variable Impedance Controller with Vision-Language In-Context Learning for Safe Robotic Manipulation | Arash Ajoudani Team | 2510.17150 | link | |
| 2025-10-20 | Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey | Jian Cheng Team | 2510.17111 | null | |
| 2025-10-19 | Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding | Yutong Zhong Team | 2510.17034 | null | |
| 2025-10-19 | Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models? | Renfen Hu Team | 2510.16924 | null | |
| 2025-10-19 | VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents | Manling Li Team | 2510.16907 | null | |
| 2025-10-19 | Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding | Xiaowei He Team | 2510.16870 | null | |
| 2025-10-19 | Region in Context: Text-condition Image editing with Human-like semantic reasoning | Phan Xuan Tan Team | 2510.16772 | null | |
| 2025-10-19 | See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models | Xike Xie Team | 2510.16769 | null | |
| 2025-10-17 | BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models | Damayanthi Herath Team | 2510.15866 | null | |
| 2025-10-17 | Neuro-Symbolic Spatial Reasoning in Segmentation | Shaogang Gong Team | 2510.15841 | null | |
| 2025-10-17 | Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models | Xiting Wang Team | 2510.15430 | null | |
| 2025-10-17 | Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs | Goh Man Fye Team | 2510.15418 | null | |
| 2025-10-17 | Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing | Yuan Qi Team | 2510.15349 | null | |
| 2025-10-16 | From Pixels to Words – Towards Native Vision-Language Primitives at Scale | Ziwei Liu Team | 2510.14979 | null | |
| 2025-10-16 | Learning an Image Editing Model without Image Editing Pairs | Xun Huang Team | 2510.14978 | link | |
| 2025-10-16 | RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks | Jiachen Li Team | 2510.14968 | null | |
| 2025-10-16 | RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning | Haoran Li Team | 2510.14828 | null | |
| 2025-10-16 | CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection | Hyunjung Shim Team | 2510.14792 | null | |
| 2025-10-16 | Free-Grained Hierarchical Recognition | Stella X. Yu Team | 2510.14737 | null | |
| 2025-10-16 | Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference | Andrew Tao Team | 2510.14624 | null | |
| 2025-10-16 | Talking Points: Describing and Localizing Pixels | Shai Avidan Team | 2510.14583 | null | |
| 2025-10-16 | Exploring Cross-Modal Flows for Few-Shot Learning | Long Chen Team | 2510.14543 | null | |
| 2025-10-17 | PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model | Yanjun Ma Team | 2510.14528 | link | |
| 2025-10-16 | Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models | Ziyu Zhao Team | 2510.14526 | null | |
| 2025-10-16 | Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control | Yuanchun Shi Team | 2510.14388 | null | |
| 2025-10-16 | Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding | Jinkyu Kim Team | 2510.14304 | link | |
| 2025-10-15 | Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models | Miguel Arana-Catania Team | 2510.13993 | null | |
| 2025-10-15 | VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models | Srijan Das Team | 2510.13808 | null | |
| 2025-10-15 | Generative Universal Verifier as Multimodal Meta-Reasoner | Yujiu Yang Team | 2510.13804 | null | |
| 2025-10-15 | Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models | Xiaowei Huang Team | 2510.13394 | null | |
| 2025-10-15 | DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning | Hang Zhao Team | 2510.13375 | null | |
| 2025-10-15 | Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity | Jubal Chandy Jacob Team | 2510.13364 | null | |
| 2025-10-15 | Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models | Andre Rusli Team | 2510.13359 | null | |
| 2025-10-15 | Self-Augmented Visual Contrastive Decoding | Vivek Gupta Team | 2510.13315 | null | |
| 2025-10-15 | MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models | Min Zhang Team | 2510.13276 | null | |
| 2025-10-15 | Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs | Bohyung Han Team | 2510.13251 | null | |
| 2025-10-15 | What “Not” to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging | Hyunjung Shim Team | 2510.13232 | null | |
| 2025-10-15 | SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs | Usman Naseem Team | 2510.13190 | null | |
| 2025-10-15 | DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models | Jose M. Alvarez Team | 2510.13108 | null | |
| 2025-10-15 | VLA-0: Building State-of-the-Art VLAs with Zero Modification | Fabio Ramos Team | 2510.13054 | null | |
| 2025-10-14 | SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding | Thomas Seidl Team | 2510.13016 | null | |
| 2025-10-14 | UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles | Ufuk Topcu Team | 2510.12992 | null | |
| 2025-10-14 | Scope: Selective Cross-modal Orchestration of Visual Perception Experts | Perouz Taslakian Team | 2510.12974 | null | |
| 2025-10-14 | Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation | Bo Du Team | 2510.12953 | null | |
| 2025-10-14 | Unifying Vision-Language Latents for Zero-label Image Caption Enhancement | Woo Seong Chung Team | 2510.12931 | null | |
| 2025-10-14 | UniFusion: Vision-Language Model as Unified Encoder in Image Generation | Ajinkya Kale Team | 2510.12789 | link | |
| 2025-10-15 | SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model | Chao Feng Team | 2510.12709 | null | |
| 2025-10-14 | ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning | Tong Zhang Team | 2510.12693 | null | |
| 2025-10-14 | VISaGE: Understanding Visual Generics and Exceptions | Emily Allaway Team | 2510.12548 | null | |
| 2025-10-14 | A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation | Luping Zhou Team | 2510.12444 | null | |
| 2025-10-14 | Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda | Nuno F. Rodrigues Team | 2510.12400 | null | |
| 2025-10-14 | Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector | Yiwei Wang Team | 2510.12287 | null | |
| 2025-10-14 | Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model | Haoang Li Team | 2510.12276 | null | |
| 2025-10-14 | HoneyBee: Data Recipes for Vision-Language Reasoners | Ramakanth Pasunuru Team | 2510.12225 | null | |
| 2025-10-14 | Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos | Yu Yamaguchi Team | 2510.12190 | null | |
| 2025-10-14 | ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation | Renjie Wan Team | 2510.12119 | null | |
| 2025-10-13 | Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval | Vyas Raina Team | 2510.12014 | null | |
| 2025-10-13 | Learning Dynamics of VLM Finetuning | Keze Wang Team | 2510.11978 | null | |
| 2025-10-13 | Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection | Marcos Zampieri Team | 2510.11852 | link | |
| 2025-10-13 | Data or Language Supervision: What Makes CLIP Better than DINO? | Serena Yeung-Levy Team | 2510.11835 | null | |
| 2025-10-13 | CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images | Xihui Liu Team | 2510.11718 | null | |
| 2025-10-13 | Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation | Mac Schwager Team | 2510.11689 | null | |
| 2025-10-13 | EvoCAD: Evolutionary CAD Code Generation with Vision Language Models | Niki van Stein Team | 2510.11631 | null | |
| 2025-10-13 | mmWalk: Towards Multi-modal Multi-view Walking Assistance | Rainer Stiefelhagen Team | 2510.11520 | link | |
| 2025-10-13 | Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion | Guangmang Cui Team | 2510.11456 | null | |
| 2025-10-13 | Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications | Yingqiang Gao Team | 2510.11314 | null | |
| 2025-10-13 | When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models | Samer Al-Hamadani Team | 2510.11302 | null | |
| 2025-10-13 | $Δ\mathrm{Energy}$ : Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization | Nanyang Ye Team | 2510.11296 | null | |
| 2025-10-13 | Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering | Thomas Seidl Team | 2510.11295 | null | |
| 2025-10-13 | Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations | Keno K. Bressem Team | 2510.11196 | null | |
| 2025-10-13 | BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models | Roy Ka-Wei Lee Team | 2510.11178 | null | |
| 2025-10-13 | Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning | Zhi Hou Team | 2510.11027 | null | |
| 2025-10-13 | GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation | Jing Zhang Team | 2510.11020 | null | |
| 2025-10-13 | COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models | Aidong Zhang Team | 2510.11012 | null | |
| 2025-10-13 | Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization | Guangdong Bai Team | 2510.10982 | null | |
| 2025-10-13 | Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning | Aidong Zhang Team | 2510.10973 | null | |
| 2025-10-13 | IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation | Jing Tang Team | 2510.10969 | null | |
| 2025-10-13 | MC#: Mixture Compressor for Mixture-of-Experts Large Models | Xiaojuan Qi Team | 2510.10962 | null | |
| 2025-10-13 | FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model | Yuhui Yin Team | 2510.10921 | null | |
| 2025-10-13 | Topological Alignment of Shared Vision-Language Embedding Space | Jae-Hun Jung Team | 2510.10889 | null | |
| 2025-10-10 | StreamingVLM: Real-Time Understanding for Infinite Video Streams | Song Han Team | 2510.09608 | null | |
| 2025-10-10 | VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation | Caifeng Shan Team | 2510.09607 | link | |
| 2025-10-10 | Vision Language Models: A Survey of 26K Papers | Fengming Lin Team | 2510.09586 | null | |
| 2025-10-10 | D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models | Wonjun Hwang Team | 2510.09473 | null | |
| 2025-10-10 | Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models | Jiao Ran Team | 2510.09358 | link | |
| 2025-10-10 | Spotlight on Token Perception for Multimodal Reinforcement Learning | Yu Cheng Team | 2510.09285 | link | |
| 2025-10-10 | Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy | Daniel Truhn Team | 2510.09256 | link | |
| 2025-10-10 | Zero-shot image privacy classification with Vision-Language Models | Andrea Cavallaro Team | 2510.09253 | null | |
| 2025-10-10 | Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation | Subrahmanyam Murala Team | 2510.09228 | null | |
| 2025-10-10 | MCMC: Bridging Rendering, Optimization and Generative AI | Wenzel Jakob Team | 2510.09078 | null | |
| 2025-10-10 | On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models | Se Young Chun Team | 2510.09008 | null | |
| 2025-10-10 | Unleashing Perception-Time Scaling to Multimodal Reasoning Models | Minghui Qiu Team | 2510.08964 | null | |
| 2025-10-10 | PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning | Takashi Matsubara Team | 2510.08919 | null | |
| 2025-10-09 | CDE: Concept-Driven Exploration for Reinforcement Learning | Joseph Campbell Team | 2510.08851 | null | |
| 2025-10-09 | FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation | Zhihua Wei Team | 2510.08849 | null | |
| 2025-10-09 | D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition | Yun Fu Team | 2510.08818 | null | |
| 2025-10-09 | Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization | Zhengzhong Tu Team | 2510.08789 | null | |
| 2025-10-09 | MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning | Salman Khan Team | 2510.08567 | null | |
| 2025-10-09 | SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models | Yueting Zhuang Team | 2510.08531 | link | |
| 2025-10-09 | To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models | Leonid Sigal Team | 2510.08510 | link | |
| 2025-10-09 | MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration | Guangtao Zhai Team | 2510.08508 | null | |
| 2025-10-09 | The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping | Esam Ghaleb Team | 2510.08482 | null | |
| 2025-10-09 | Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling | Paula Buttery Team | 2510.08470 | null | |
| 2025-10-09 | VideoVerse: How Far is Your T2V Generator from a World Model? | Lei Zhang Team | 2510.08398 | null | |
| 2025-10-09 | Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception | Ciaran Eising Team | 2510.08352 | null | |
| 2025-10-09 | Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness | Hai Zhao Team | 2510.08238 | null | |
| 2025-10-09 | Approximate Domain Unlearning for Vision-Language Models | Go Irie Team | 2510.08132 | null | |
| 2025-10-09 | CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning | Rongrong Ji Team | 2510.08003 | null | |
| 2025-10-09 | Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation | Jianhua Sun Team | 2510.07975 | null | |
| 2025-10-09 | Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents | Jun Zhu Team | 2510.07809 | null | |
| 2025-10-09 | GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models | Long Zeng Team | 2510.07791 | null | |
| 2025-10-09 | IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction | Liqiang Nie Team | 2510.07778 | null | |
| 2025-10-09 | Multimodal Safety Evaluation in Generative Agent Social Simulations | Bernard Ghanem Team | 2510.07709 | null | |
| 2025-10-09 | Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models | Fuzhi Tang Team | 2510.07632 | null | |
| 2025-10-08 | Cross-Modal Attention Guided Unlearning in Vision-Language Models | Xintao Wu Team | 2510.07567 | null | |
| 2025-10-08 | Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices | Jimmy Huang Team | 2510.07545 | null | |
| 2025-10-09 | TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics | Shanghang Zhang Team | 2510.07181 | null | |
| 2025-10-08 | Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models | Benoit Macq Team | 2510.07135 | null | |
| 2025-10-08 | TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription | Haoyu Wang Team | 2510.07098 | null | |
| 2025-10-08 | Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications | Yuke Zhu Team | 2510.07077 | link | |
| 2025-10-08 | Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration | Yuan Fang Team | 2510.07035 | null | |
| 2025-10-08 | Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness | Brian Bartoldson Team | 2510.06790 | null | |
| 2025-10-08 | TTRV: Test-Time Reinforcement Learning for Vision Language Models | M. Jehanzeb Mirza Team | 2510.06783 | null | |
| 2025-10-08 | ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory | Zora Zhiruo Wang Team | 2510.06664 | null | |
| 2025-10-08 | VUGEN: Visual Understanding priors for GENeration | Jakob Verbeek Team | 2510.06529 | null | |
| 2025-10-07 | ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations | Yujun Cai Team | 2510.06292 | null | |
| 2025-10-06 | Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals | Beenish Moalla Chaudhry Team | 2510.06280 | null | |
| 2025-10-07 | Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA | Junfeng Yang Team | 2510.06067 | null | |
| 2025-10-07 | Medical Vision Language Models as Policies for Robotic Surgery | Martin Radfar Team | 2510.06064 | null | |
| 2025-10-07 | Data Factory with Minimal Human Effort Using VLMs | Andrew Markham Team | 2510.05722 | null | |
| 2025-10-07 | Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM | Zheng Zhang Team | 2510.05544 | null | |
| 2025-10-06 | Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization | Ariel Gera Team | 2510.05038 | null | |
| 2025-10-06 | Efficient Navigation in Unknown Indoor Environments with Vision-Language Models | J. P. How Team | 2510.04991 | null | |
| 2025-10-06 | ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts | Dan Pei Team | 2510.04710 | null | |
| 2025-10-06 | Conditional Representation Learning for Customized Tasks | Xi Peng Team | 2510.04564 | null | |
| 2025-10-06 | More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models | Jun Luo Team | 2510.04532 | null | |
| 2025-10-06 | VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery | Hao Tang Team | 2510.04479 | null | |
| 2025-10-06 | MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models | Gyeongyeon Hwang Team | 2510.04477 | null | |
| 2025-10-06 | A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering | Chen Chen Team | 2510.04428 | null | |
| 2025-10-06 | Your Vision-Language Model Can’t Even Count to 20: Exposing the Failures of VLMs in Compositional Counting | Jiahao Zhang Team | 2510.04401 | null | |
| 2025-10-05 | AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents | Bin Xiao Team | 2510.04257 | null | |
| 2025-10-05 | ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context | Jinwoo Shin Team | 2510.04246 | link | |
| 2025-10-05 | Zoom-In to Sort AI-Generated Images Out | Jianfu Zhang Team | 2510.04225 | null | |
| 2025-10-05 | Automating construction safety inspections using a multi-modal vision-language RAG framework | Daniel Dias-da-Costa Team | 2510.04145 | null | |
| 2025-10-07 | AgriGPT-VL: Agricultural Vision-Language Understanding Suite | Shijian Li Team | 2510.04002 | null | |
| 2025-10-04 | No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models | Serena Yeung-Levy Team | 2510.03978 | null | |
| 2025-10-04 | Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models | Chris Thomas Team | 2510.03903 | null | |
| 2025-10-04 | Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert | Chunhua Shen Team | 2510.03896 | null | |
| 2025-10-04 | Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models | Durga Toshniwal Team | 2510.03840 | null | |
| 2025-10-04 | Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models | Zeynep Akata Team | 2510.03721 | null | |
| 2025-10-04 | MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations | Jingliang Duan Team | 2510.03666 | null | |
| 2025-10-03 | Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning | Yang Zhang Team | 2510.03182 | null | |
| 2025-10-03 | SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus | Caifeng Shan Team | 2510.03160 | null | |
| 2025-10-03 | Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights | Konstantina Nikita Team | 2510.02922 | null | |
| 2025-10-03 | Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting | Mostafa Tavassolipour Team | 2510.02913 | null | |
| 2025-10-03 | Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis | Xin Gao Team | 2510.02815 | null | |
| 2025-10-03 | MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding | Yujiu Yang Team | 2510.02790 | null | |
| 2025-10-03 | OTR: Synthesizing Overlay Text Dataset for Text Removal | Kota Yamaguchi Team | 2510.02787 | link | |
| 2025-10-03 | Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models | Prahitha Movva Team | 2510.02780 | null | |
| 2025-10-03 | AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding | Mohammed Bennamoun Team | 2510.02778 | null | |
| 2025-10-03 | Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models | Zhen Lei Team | 2510.02750 | null | |
| 2025-10-03 | Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation – Technical Report for IROS 2025 RoboSense Challenge Track 4 | Xiaoshuai Hao Team | 2510.02728 | null | |
| 2025-10-03 | ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks | Bo Li Team | 2510.02677 | null | |
| 2025-10-02 | Exploring OCR-augmented Generation for Bilingual VQA | Sunho Park Team | 2510.02543 | null | |
| 2025-10-02 | Multimodal Function Vectors for Spatial Relations | Hongjing Lu Team | 2510.02528 | null | |
| 2025-10-02 | From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens | Freda Shi Team | 2510.02292 | link | |
| 2025-10-02 | microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification | Muhammad Haris Khan Team | 2510.02270 | null | |
| 2025-10-02 | Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents | Zhuosheng Zhang Team | 2510.02204 | null | |
| 2025-10-02 | GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation | Heng Tao Shen Team | 2510.02186 | null | |
| 2025-10-02 | Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting | Jing Zhang Team | 2510.02155 | null | |
| 2025-10-02 | Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving | Chun Jason Xue Team | 2510.01795 | null | |
| 2025-10-02 | Accelerating Attention with Basis Decomposition | Jialin Zhao Team | 2510.01718 | null | |
| 2025-10-02 | Contrastive Representation Regularization for Vision-Language-Action Models | Jinwoo Shin Team | 2510.01711 | null | |
| 2025-10-02 | VaPR – Vision-language Preference alignment for Reasoning | Nanyun Peng Team | 2510.01700 | null | |
| 2025-10-02 | Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning | Wentao Zhang Team | 2510.01681 | null | |
| 2025-10-02 | Source-Free Cross-Domain Continual Learning | Kutluyil Dogancay Team | 2510.01649 | null | |
| 2025-10-02 | FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models | Bihan Wen Team | 2510.01642 | link | |
| 2025-10-02 | ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models | Murali Emani Team | 2510.01582 | null | |
| 2025-10-03 | Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed | Sanmi Koyejo Team | 2510.01494 | null | |
| 2025-10-01 | VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs | Gonzalo Ferrer Team | 2510.01483 | null | |
| 2025-10-01 | Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories | Baharan Mirzasoleiman Team | 2510.01454 | link | |
| 2025-10-01 | GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings | Rakesh Kumar Team | 2510.01448 | null | |
| 2025-10-01 | VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation | Amirreza Shaban Team | 2510.01388 | null | |
| 2025-10-01 | Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models | Feng Zhao Team | 2510.01304 | null | |
| 2025-10-01 | Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs | Paul Whatmough Team | 2510.01185 | null | |
| 2025-09-30 | MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation | Shanghang Zhang Team | 2509.26642 | null | |
| 2025-09-30 | Query-Kontext: An Unified Multimodal Model for Image Generation and Editing | Jingdong Wang Team | 2509.26641 | null | |
| 2025-09-30 | Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces | Ivan Titov Team | 2509.26594 | null | |
| 2025-09-30 | The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows | Emerson Murphy-Hill Team | 2509.26557 | null | |
| 2025-09-30 | Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation | Varun Jampani Team | 2509.26555 | link | |
| 2025-09-30 | Zero-Shot Decentralized Federated Learning | Giovanni Bellitto Team | 2509.26462 | link | |
| 2025-09-30 | SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval | Huei-Fang Yang Team | 2509.26330 | null | |
| 2025-09-30 | ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation | Antonio Liotta Team | 2509.26278 | null | |
| 2025-09-30 | Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document | David Naccache Team | 2509.26235 | null | |
| 2025-09-30 | TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos | Vasileios Mezaris Team | 2509.26208 | link | |
| 2025-09-30 | SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies | Xue Li Team | 2509.26039 | null | |
| 2025-10-01 | AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment | Weisi Lin Team | 2509.26006 | null | |
| 2025-09-30 | Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations | Antonino Furnari Team | 2509.26004 | null | |
| 2025-09-30 | Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline | Zhun Zhong Team | 2509.25991 | null | |
| 2025-09-30 | NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving | Johannes Betz Team | 2509.25944 | null | |
| 2025-09-30 | VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs | Tiancheng Zhao Team | 2509.25916 | null | |
| 2025-10-01 | LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models | Yongjun Shen Team | 2509.25896 | null | |
| 2025-09-30 | DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning | Jing Zhang Team | 2509.25866 | null | |
| 2025-09-30 | MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification | Daoqiang Zhang Team | 2509.25863 | null | |
| 2025-09-30 | Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation | Hao Chen Team | 2509.25852 | null | |
| 2025-09-29 | TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models | Nanyun Peng Team | 2509.25143 | null | |
| 2025-09-29 | Visual serial processing deficits explain divergences in human and VLM reasoning | Thomas L. Griffiths Team | 2509.25142 | null | |
| 2025-09-29 | GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning | Salman Khan Team | 2509.25026 | link | |
| 2025-09-29 | World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | Qing Zhang Team | 2509.24948 | null | |
| 2025-09-29 | From Code to Action: Hierarchical Learning of Diffusion-VLM Policies | Daniel Dijkman Team | 2509.24917 | null | |
| 2025-09-29 | Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models | Sungeun Hong Team | 2509.24837 | null | |
| 2025-09-29 | IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks | Ville Kyrki Team | 2509.24768 | null | |
| 2025-09-29 | IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? | Botian Shi Team | 2509.24709 | null | |
| 2025-09-29 | Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs | Elia Bruni Team | 2509.24640 | null | |
| 2025-09-30 | Inducing Dyslexia in Vision Language Models | Martin Schrimpf Team | 2509.24597 | null | |
| 2025-09-29 | TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models | Joey Tianyi Zhou Team | 2509.24566 | null | |
| 2025-09-29 | CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D | Matin Mirzababaei Team | 2509.24528 | null | |
| 2025-09-29 | PhysiAgent: An Embodied Agent Framework in Physical World | Xianyuan Zhan Team | 2509.24524 | null | |
| 2025-09-29 | GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training | Hao Dong Team | 2509.24494 | null | |
| 2025-09-29 | Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks | Kai Chen Team | 2509.24473 | null | |
| 2025-09-29 | AXIS: Explainable Time Series Anomaly Detection with Large Language Models | Chen Zhang Team | 2509.24378 | null | |
| 2025-09-29 | SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm | Jiankun Wang Team | 2509.24321 | null | |
| 2025-09-30 | FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting | Yu Cheng Team | 2509.24304 | null | |
| 2025-09-29 | ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning | Yang You Team | 2509.24219 | null | |
| 2025-09-29 | Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection | Donghyun Kim Team | 2509.24192 | null | |
| 2025-09-26 | See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation | Yu-Lun Liu Team | 2509.22653 | link | |
| 2025-09-26 | CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning | Dahua Lin Team | 2509.22647 | link | |
| 2025-09-26 | Hierarchical Representation Matching for CLIP-based Class-Incremental Learning | Da-Wei Zhou Team | 2509.22645 | null | |
| 2025-09-26 | WoW: Towards a World omniscient World model Through Embodied Interaction | Jian Tang Team | 2509.22642 | null | |
| 2025-09-26 | SPARK: Synergistic Policy And Reward Co-Evolving Framework | Jiaqi Wang Team | 2509.22624 | link | |
| 2025-09-26 | Color Names in Vision-Language Models | Javier Vazquez-Corral Team | 2509.22524 | null | |
| 2025-09-26 | Guiding Evolution of Artificial Life Using Vision-Language Models | Frederico Wieser Team | 2509.22447 | null | |
| 2025-09-26 | Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding | Mrinmaya Sachan Team | 2509.22437 | link) | |
| 2025-09-26 | RAU: Reference-based Anatomical Understanding with Vision Language Models | Shanhui Sun Team | 2509.22404 | null | |
| 2025-09-26 | Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach | Zijing Zhou Team | 2509.22378 | null | |
| 2025-09-26 | Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models | Andreas Fischer Team | 2509.22283 | link | |
| 2025-09-26 | Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks | Shangyang Li Team | 2509.22258 | null | |
| 2025-09-26 | A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation | Cheng Deng Team | 2509.22229 | null | |
| 2025-09-26 | Polysemous Language Gaussian Splatting via Matching-based Mask Lifting | Ge Li Team | 2509.22225 | null | |
| 2025-09-26 | Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models | Bo Yang Team | 2509.22221 | null | |
| 2025-09-26 | Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting | Anirudha Majumdar Team | 2509.22195 | null | |
| 2025-09-26 | MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing | Conghui He Team | 2509.22186 | link | |
| 2025-09-26 | Multilingual Vision-Language Models, A Survey | Jindřich Libovický Team | 2509.22123 | null | |
| 2025-09-26 | Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics | Stefan K. Ehrlich Team | 2509.22014 | null | |
| 2025-09-26 | CoFFT: Chain of Foresight-Focus Thought for Visual Language Models | Mike Zheng Shou Team | 2509.22010 | null | |
| 2025-09-25 | Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization | Guihai Chen Team | 2509.21301 | null | |
| 2025-09-25 | DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding | Mehrnoosh Sadrzadeh Team | 2509.21287 | null | |
| 2025-09-25 | Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication | Alexander Nagaev Team | 2509.21262 | null | |
| 2025-09-25 | Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation | Mohammad Hossein Rohban Team | 2509.21257 | null | |
| 2025-09-25 | Learning to Look: Cognitive Attention Alignment with Vision-Language Models | Nidhi Rastogi Team | 2509.21247 | null | |
| 2025-09-25 | TABLET: A Large-Scale Dataset for Robust Visual Table Understanding | Mirella Lapata Team | 2509.21205 | null | |
| 2025-09-25 | Human-like Navigation in a World Built for Humans | Shenlong Wang Team | 2509.21189 | link | |
| 2025-09-25 | Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy | Chokri Mraidha Team | 2509.21173 | null | |
| 2025-09-25 | Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning | Mingyu Hu Team | 2509.21126 | null | |
| 2025-09-25 | Cross-Modal Instructions for Robot Motion Generation | Weiming Zhi Team | 2509.21107 | null | |
| 2025-09-25 | Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models | Robert Jenssen Team | 2509.21102 | null | |
| 2025-09-25 | SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials | Lu Cheng Team | 2509.21079 | null | |
| 2025-09-25 | Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos | Alka Maurya Team | 2509.20961 | null | |
| 2025-09-25 | Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery | M. Ali Nasseri Team | 2509.20941 | null | |
| 2025-09-25 | MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases | Diange Yang Team | 2509.20843 | null | |
| 2025-09-25 | DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation | Ved Umrajkar Team | 2509.20792 | null | |
| 2025-09-25 | Provenance Analysis of Archaeological Artifacts via Multimodal RAG Systems | Ruiliang Liu Team | 2509.20769 | null | |
| 2025-09-25 | Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models | Meenakshi Khosla Team | 2509.20751 | null | |
| 2025-09-25 | Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery | Ali Mostafavi Team | 2509.20628 | null | |
| 2025-09-24 | InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On | Karim Bouyarmane Team | 2509.20524 | null | |
| 2025-09-24 | A co-evolving agentic AI system for medical imaging analysis | Zhi Huang Team | 2509.20279 | null | |
| 2025-09-24 | Universal Camouflage Attack on Vision-Language Models for Autonomous Driving | Wenqi Ren Team | 2509.20196 | null | |
| 2025-09-24 | EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models | Dacheng Tao Team | 2509.20146 | null | |
| 2025-09-24 | A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA | Yova Kementchedjhieva Team | 2509.20119 | null | |
| 2025-09-24 | Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving | Xianpeng Lang Team | 2509.20109 | null | |
| 2025-09-24 | Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning | Jiajun Liu Team | 2509.20077 | null | |
| 2025-09-25 | OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving | Jun Ma Team | 2509.19973 | null | |
| 2025-09-24 | Generalist Robot Manipulation beyond Action Labeled Data | Danda Pani Paudel Team | 2509.19958 | null | |
| 2025-09-24 | Benchmarking Gaslighting Attacks Against Speech Large Language Models | Pan Zhou Team | 2509.19858 | null | |
| 2025-09-24 | CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition | Monica S. Lam Team | 2509.19768 | null | |
| 2025-09-24 | Logics-Parsing Technical Report | Minggang Wu Team | 2509.19760 | null | |
| 2025-09-24 | Formal Safety Verification and Refinement for Generative Motion Planners via Certified Local Stabilization | Glen Chou Team | 2509.19688 | null | |
| 2025-09-24 | Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment | Shaina Raza Team | 2509.19659 | null | |
| 2025-09-23 | Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models | Tianyu Jiang Team | 2509.19595 | null | |
| 2025-09-23 | iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning | Abhishek Aich Team | 2509.19552 | null | |
| 2025-09-23 | Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation | Chi-Guhn Lee Team | 2509.19524 | null | |
| 2025-09-23 | DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture | Sriparna Saha Team | 2509.19274 | null | |
| 2025-09-23 | Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs | Yova Kementchedjhieva Team | 2509.19207 | null | |
| 2025-09-23 | Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions | Georgios Tzimiropoulos Team | 2509.19203 | null | |
| 2025-09-23 | Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models | Xiaojie Wang Team | 2509.19191 | null | |
| 2025-09-23 | FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation | Jianwei Zhang Team | 2509.19102 | link | |
| 2025-09-23 | ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests? | Jiahao Cui Team | 2509.19070 | null | |
| 2025-09-23 | Pure Vision Language Action (VLA) Models: A Comprehensive Survey | Qingguo Zhou Team | 2509.19012 | null | |
| 2025-09-23 | Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards | Xinlong Wang Team | 2509.19003 | null | |
| 2025-09-23 | No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning | Joel Luís Carbonera Team | 2509.18938 | null | |
| 2025-09-23 | How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective | Huchuan Lu Team | 2509.18905 | null | |
| 2025-09-23 | Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography | Giovanni Colavizza Team | 2509.18839 | null | |
| 2025-09-23 | Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models | Dinesh Manocha Team | 2509.18763 | null | |
| 2025-09-23 | Knowledge Transfer from Interaction Learning | Shugong Xu Team | 2509.18733 | null | |
| 2025-09-23 | What Makes You Unique? Attribute Prompt Composition for Object Re-Identification | Huchuan Lu Team | 2509.18715 | null | |
| 2025-09-23 | RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images | Quan Wang Team | 2509.18711 | null | |
| 2025-09-23 | NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment | Vijaykrishnan Narayanan Team | 2509.18672 | null | |
| 2025-09-23 | Learning neuroimaging models from health system-scale data | Todd Hollon Team | 2509.18638 | null | |
| 2025-09-23 | SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones | Mac Schwager Team | 2509.18610 | null | |
| 2025-09-23 | VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation | Ufuk Topcu Team | 2509.18592 | link | |
| 2025-09-22 | Losing the Plot: How VLM responses degrade on imperfect charts | Mahantesh Halappanavar Team | 2509.18425 | null | |
| 2025-09-22 | NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning | Sandeep Chinchali Team | 2509.18041 | null | |
| 2025-09-22 | Robust and Resilient Soft Robotic Object Insertion with Compliance-Enabled Contact Formation and Failure Recovery | Yoshitaka Ushiku Team | 2509.17666 | null | |
| 2025-09-22 | SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models | Jieping Ye Team | 2509.17664 | null | |
| 2025-09-22 | From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge | Paula Ramos Team | 2509.17615 | null | |
| 2025-09-22 | COLA: Context-aware Language-driven Test-time Adaptation | Zhihe Lu Team | 2509.17598 | null | |
| 2025-09-22 | Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models | Seong Jae Hwang Team | 2509.17588 | null | |
| 2025-09-23 | Visual Instruction Pretraining for Domain-Specific Foundation Models | Jian Yang Team | 2509.17562 | null | |
| 2025-09-22 | ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding | Xiaoyu Qin Team | 2509.17481 | null | |
| 2025-09-22 | Training-Free Label Space Alignment for Universal Domain Adaptation | Donghyun Kim Team | 2509.17452 | null | |
| 2025-09-23 | Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration | Yueming Jin Team | 2509.17429 | null | |
| 2025-09-22 | Vision Language Models Are Not (Yet) Spelling Correctors | Bojun Zhang Team | 2509.17418 | null | |
| 2025-09-22 | Mano Report | Shuo Wang Team | 2509.17336 | null | |
| 2025-09-22 | UIPro: Unleashing Superior Interaction Capability For GUI Agents | Zhaoxiang Zhang Team | 2509.17328 | null | |
| 2025-09-22 | OpenGVL - Benchmarking Visual Temporal Progress for Data Curation | Krzysztof Walas Team | 2509.17321 | null | |
| 2025-09-21 | FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions | Zhongyuan Wang Team | 2509.17177 | null | |
| 2025-09-21 | MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors | Soumyabrata Dev Team | 2509.17084 | null | |
| 2025-09-21 | CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner | Xiaomeng Li Team | 2509.17065 | null | |
| 2025-09-21 | AgriDoctor: A Multimodal Intelligent Assistant for Agriculture | Liang Wang Team | 2509.17044 | null | |
| 2025-09-21 | Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning | Jun Ma Team | 2509.17042 | null | |
| 2025-09-21 | When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration | Jun Li Team | 2509.17024 | null | |
| 2025-09-19 | Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks | Evangelos E. Papalexakis Team | 2509.16163 | null | |
| 2025-09-19 | Randomized Smoothing Meets Vision-Language Models | Chih-Hong Cheng Team | 2509.16088 | null | |
| 2025-09-19 | I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models | Mohamed Chetouani Team | 2509.16072 | null | |
| 2025-09-19 | Compose by Focus: Scene Graph-based Atomic Skills | Heng Yang Team | 2509.16053 | null | |
| 2025-09-19 | CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models | Wushao Wen Team | 2509.15803 | null | |
| 2025-09-19 | Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation | He Sun Team | 2509.15772 | null | |
| 2025-09-19 | GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning | Zhaojian Li Team | 2509.15738 | null | |
| 2025-09-19 | Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance | Xiangyang Xue Team | 2509.15704 | null | |
| 2025-09-19 | ORIC: Benchmarking Object Recognition in Incongruous Context for Large Vision-Language Models | Hao Su Team | 2509.15695 | null | |
| 2025-09-19 | SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models | Nima Mesgarani Team | 2509.15661 | null | |
| 2025-09-19 | PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models | Byung-Cheol Min Team | 2509.15607 | null | |
| 2025-09-18 | SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters | Andy Couturier Team | 2509.15490 | null | |
| 2025-09-18 | Comparing Computational Pathology Foundation Models using Representational Similarity Analysis | William Lotter Team | 2509.15482 | null | |
| 2025-09-18 | ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models | Nathaniel D. Bastian Team | 2509.15435 | null | |
| 2025-09-18 | SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models | Andrew Yates Team | 2509.15432 | null | |
| 2025-09-18 | CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization | Xin Lin Team | 2509.15330 | null | |
| 2025-09-18 | Calibration-Aware Prompt Learning for Medical Vision-Language Models | Muhammad Haris Khan Team | 2509.15226 | null | |
| 2025-09-19 | ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data | Wenhai Wang Team | 2509.15221 | null | |
| 2025-09-18 | What’s the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques | Grigorios Tsoumakas Team | 2509.15211 | null | |
| 2025-09-18 | MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation | Guodong Ding Team | 2509.15154 | null | |
| 2025-09-18 | Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models | Yanqing Zhang Team | 2509.15076 | null | |
| 2025-09-18 | QuizRank: Picking Images by Quizzing VLMs | Eytan Adar Team | 2509.15059 | null | |
| 2025-09-18 | PRISM: Product Retrieval In Shopping Carts using Hybrid Matching | Jiajing Chen Team | 2509.14985 | null | |
| 2025-09-18 | EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence | Qinghua Huang Team | 2509.14977 | null | |
| 2025-09-19 | Affordance-Based Disambiguation of Surgical Instructions for Collaborative Robot-Assisted Surgery | Yasuhisa Hasegawa Team | 2509.14967 | null | |
| 2025-09-18 | MARIC: Multi-Agent Reasoning for Image Classification | Seunghyun Lee Team | 2509.14860 | null | |
| 2025-09-18 | V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models | Ming Jiang Team | 2509.14837 | null | |
| 2025-09-18 | Frame Sampling Strategies Matter: A Benchmark for small vision language models | Mounim A. El Yacoubi Team | 2509.14769 | null | |
| 2025-09-18 | Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark | Rashid Mushkani Team | 2509.14574 | null | |
| 2025-09-18 | VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models | Yuxin Ma Team | 2509.14571 | null | |
| 2025-09-17 | CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks | Negar Mehr Team | 2509.14380 | null | |
| 2025-09-17 | Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark | Vishal M. Patel Team | 2509.14227 | null | |
| 2025-09-19 | TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning | Guitao Cao Team | 2509.14172 | null | |
| 2025-09-17 | VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement | Fei Richard Yu Team | 2509.14060 | null | |
| 2025-09-17 | Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation | Minh Hoai Team | 2509.13939 | null | |
| 2025-09-17 | Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration | Xiaoqiang Li Team | 2509.13919 | null | |
| 2025-09-17 | EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics | Jielei Wang Team | 2509.13858 | null | |
| 2025-09-17 | Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models | Longwen Gao Team | 2509.13836 | null | |
| 2025-09-17 | Iterative Prompt Refinement for Safer Text-to-Image Generation | Byung-Jun Lee Team | 2509.13760 | null | |
| 2025-09-17 | Reinforcement Learning for Robotic Insertion of Flexible Cables in Industrial Settings | Changjoo Nam Team | 2509.13731 | null | |
| 2025-09-17 | DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring | Xiaomin Lin Team | 2509.13666 | null | |
| 2025-09-16 | Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation | Samer Al-Hamadani Team | 2509.13590 | null | |
| 2025-09-16 | Using Visual Language Models to Control Bionic Hands: Assessment of Object Perception and Grasp Inference | Cedomir Stefanovic Team | 2509.13572 | null | |
| 2025-09-16 | EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing | Mingyuan Zhou Team | 2509.13399 | null | |
| 2025-09-16 | 3D Aware Region Prompted Vision Language Model | Sifei Liu Team | 2509.13317 | link | |
| 2025-09-16 | Image Realness Assessment and Localization with Multimodal Features | Somdyuti Paul Team | 2509.13289 | null | |
| 2025-09-16 | ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement | Giuseppe Carenini Team | 2509.13282 | null | |
| 2025-09-16 | RadGame: An AI-Powered Platform for Radiology Education | Pranav Rajpurkar Team | 2509.13270 | null | |
| 2025-09-16 | HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models | Xiangyang Xue Team | 2509.13067 | null | |
| 2025-09-16 | Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models | Jingdong Wang Team | 2509.13031 | null | |
| 2025-09-16 | Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models | Chong Feng Team | 2509.12897 | null | |
| 2025-09-16 | Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents | Haiyang Zhang Team | 2509.12876 | null | |
| 2025-09-16 | Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models | Xingjun Ma Team | 2509.12724 | null | |
| 2025-09-16 | AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models | Jin Huang Team | 2509.12715 | null | |
| 2025-09-15 | Evaluating Robustness of Vision-Language Models Under Noisy Conditions | Alireza Team | 2509.12492 | null | |
| 2025-09-15 | An integrated process for design and control of lunar robotics using AI and simulation | Martin Servin Team | 2509.12367 | null | |
| 2025-09-15 | Open-ended Hierarchical Streaming Video Understanding with Vision Language Models | Seon Joo Kim Team | 2509.12145 | null | |
| 2025-09-15 | Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models | Jiajun Zhang Team | 2509.12132 | null | |
| 2025-09-16 | Embodied Navigation Foundation Model | He Wang Team | 2509.12129 | link | |
| 2025-09-15 | Lost in Embeddings: Information Loss in Vision-Language Models | Anders Søgaard Team | 2509.11986 | null | |
| 2025-09-15 | Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding | Yijun Chen Team | 2509.11961 | null | |
| 2025-09-15 | Bridging Vision Language Models and Symbolic Grounding for Video Question Answering | Daisy Zhe Wang Team | 2509.11862 | null | |
| 2025-09-15 | Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation | Michael Louis Iuzzolino Team | 2509.11840 | null | |
| 2025-09-15 | SpecVLM: Fast Speculative Decoding in Vision-Language Models | Emad Barsoum Team | 2509.11815 | null | |
| 2025-09-15 | Igniting VLMs toward the Embodied Space | Zach Xu Team | 2509.11766 | null | |
| 2025-09-15 | EMeRALDS: Electronic Medical Record Driven Automated Lung Nodule Detection and Classification in Thoracic CT Images | Syed Muhammad Anwar Team | 2509.11714 | null | |
| 2025-09-15 | How Auxiliary Reasoning Unleashes GUI Grounding in VLMs | Manni Duan Team | 2509.11548 | null | |
| 2025-09-15 | LVLMs are Bad at Overhearing Human Referential Communication | Susan E. Brennan Team | 2509.11514 | null | |
| 2025-09-14 | CEMTM: Contextual Embedding-based Multimodal Topic Modeling | Giuseppe Carenini Team | 2509.11465 | null | |
| 2025-09-14 | Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations | Xuanlin Li Team | 2509.11417 | link | |
| 2025-09-14 | ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation | Yizhao Wang Team | 2509.11364 | null | |
| 2025-09-14 | Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations | Weiming Hu Team | 2509.11287 | null | |
| 2025-09-14 | The System Description of CPS Team for Track on Driving with Language of CVPR 2024 Autonomous Grand Challenge | Dehui Du Team | 2509.11071 | null | |
| 2025-09-14 | ViScratch: Using Large Language Models and Gameplay Videos for Automated Feedback in Scratch | Jialu Zhang Team | 2509.11065 | null | |
| 2025-09-13 | Language-based Color ISP Tuning | Jiro Takatori Team | 2509.10765 | null | |
| 2025-09-12 | TASC: Task-Aware Shared Control for Teleoperated Manipulation | Renaud Detry Team | 2509.10416 | null | |
| 2025-09-12 | Towards Understanding Visual Grounding in Visual Language Models | Eda B. Özyiğit Team | 2509.10345 | null | |
| 2025-09-12 | Detecting Text Manipulation in Images using Vision Language Models | Sébastien Marcel Team | 2509.10278 | link | |
| 2025-09-12 | MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation | Xiaoming Wei Team | 2509.10260 | null | |
| 2025-09-12 | Towards Reliable and Interpretable Document Question Answering via VLMs | Simone Marinai Team | 2509.10129 | null | |
| 2025-09-12 | VARCO-VISION-2.0 Technical Report | Youngjune Kim Team | 2509.10105 | null | |
| 2025-09-12 | Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration | Wayne Zhang Team | 2509.10059 | null | |
| 2025-09-12 | LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA | Jianshu Li Team | 2509.10026 | null | |
| 2025-09-11 | How well can LLMs provide planning feedback in grounded environments? | Victor Zhong Team | 2509.09790 | null | |
| 2025-09-11 | FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark | Hongsheng Li Team | 2509.09680 | link | |
| 2025-09-11 | Compositional Concept Generalization with Variational Quantum Circuits | Mehrnoosh sadrzadeh Team | 2509.09541 | null | |
| 2025-09-11 | Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift | Dwarikanath Mahapatra Team | 2509.09397 | null | |
| 2025-09-11 | VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model | Donglin Wang Team | 2509.09372 | null | |
| 2025-09-11 | Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning | Abderrezzak Debilou Team | 2509.09356 | null | |
| 2025-09-11 | Image Recognition with Vision and Language Embeddings of VLMs | Jiri Matas Team | 2509.09311 | null | |
| 2025-09-11 | Visual Programmability: A Guide for Code-as-Thought in Chart Understanding | Ya Zhang Team | 2509.09286 | null | |
| 2025-09-11 | Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis | Kuo Feng Hung Team | 2509.09254 | null | |
| 2025-09-11 | Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios | Yao Zhu Team | 2509.09172 | null | |
| 2025-09-11 | Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention | Fumio Okura Team | 2509.09116 | null | |
| 2025-09-10 | COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation | Umair Hassan Team | 2509.09014 | link | |
| 2025-09-10 | Can Vision-Language Models Solve Visual Math Equations? | Mrinmaya Sachan Team | 2509.09013 | null | |
| 2025-09-10 | Generalized User-Oriented Image Semantic Coding Empowered by Large Vision-Language Model | Vincent W. S. Wong Team | 2509.08913 | null | |
| 2025-09-10 | Recurrence Meets Transformers for Universal Multimodal Retrieval | Rita Cucchiara Team | 2509.08897 | null | |
| 2025-09-10 | RewardDance: Reward Scaling in Visual Generation | Weilin Huang Team | 2509.08826 | null | |
| 2025-09-10 | RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation | Hao Zhao Team | 2509.08820 | link | |
| 2025-09-10 | SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation | Peter Stone Team | 2509.08757 | link | |
| 2025-09-10 | TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making | Xiu Li Team | 2509.08500 | null | |
| 2025-09-10 | A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models | Zhou Ni Team | 2509.08490 | null | |
| 2025-09-11 | Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics | Pierre Baldi Team | 2509.08461 | null | |
| 2025-09-10 | Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis | Charmgil Hong Team | 2509.08338 | null | |
| 2025-09-10 | Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models | Monali Deshmukh Team | 2509.08270 | null | |
| 2025-09-10 | Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features | Donald E. Brown Team | 2509.08266 | null | |
| 2025-09-10 | Vector embedding of multi-modal texts: a tool for discovery? | Sachith Withana Team | 2509.08216 | null | |
| 2025-09-09 | Privacy Preserving Semantic Communications Using Vision Language Models: A Segmentation and Generation Approach | Qianqian Zhang Team | 2509.08142 | null | |
| 2025-09-09 | Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images | Marc Haraoui Team | 2509.07966 | null | |
| 2025-09-09 | Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer’s Disease | Xiaochen Yang Team | 2509.07613 | null | |
| 2025-09-09 | Fine-Tuning Vision-Language Models for Visual Navigation Assistance | Xi Wang Team | 2509.07488 | null | |
| 2025-09-09 | DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis | Alois C. Knoll Team | 2509.07463 | null | |
| 2025-09-09 | SpecifyUI: Supporting Iterative UI Design Intent Expression through Structured Specifications and Generative AI | Liuqing Chen Team | 2509.07334 | null | |
| 2025-09-10 | LLaDA-VLA: Vision Language Diffusion Action Models | Xiaoyan Sun Team | 2509.06932 | null | |
| 2025-09-08 | D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning | Nagendra Kumar Team | 2509.06771 | null | |
| 2025-09-08 | Embodied Hazard Mitigation using Vision-Language Models for Autonomous Mobile Robots | Aliasghar Arab Team | 2509.06768 | null | |
| 2025-09-08 | Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization | Janis Dalins Team | 2509.06759 | null | |
| 2025-09-08 | Focusing by Contrastive Attention: Enhancing VLMs’ Visual Reasoning | Xueqi Cheng Team | 2509.06461 | null | |
| 2025-09-08 | When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection | Muhammad Ashad Kabir Team | 2509.06427 | null | |
| 2025-09-08 | Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models | Inyong Yun Team | 2509.06415 | null | |
| 2025-09-08 | Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning | Dong Liang Team | 2509.06409 | null | |
| 2025-09-08 | Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing | Ha Young Kim Team | 2509.06336 | null | |
| 2025-09-08 | Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes | Mohammad Akbari Team | 2509.06266 | null | |
| 2025-09-07 | PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology | Hujun Yin Team | 2509.06105 | null | |
| 2025-09-07 | Analysis of Blood Report Images Using General Purpose Vision-Language Models | Hamid Beigy Team | 2509.06033 | null | |
| 2025-09-07 | ZLATTE: A Geometry-Aware, Learning-Free Framework for Language-Driven Trajectory Reshaping in Human-Robot Interaction | Luis Figueredo Team | 2509.06031 | null | |
| 2025-09-07 | Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance | Tal Arbel Team | 2509.05978 | null | |
| 2025-09-06 | Towards an Automated Framework to Audit Youth Safety on TikTok | Francesco Pierri Team | 2509.05838 | null | |
| 2025-09-06 | Do Vision-Language Models See Visualizations Like Humans? Alignment in Chart Categorization | Torsten Möller Team | 2509.05718 | null | |
| 2025-09-06 | Knowledge-Augmented Vision Language Models for Underwater Bioacoustic Spectrogram Analysis | Kazuhiro Nakadai Team | 2509.05703 | null | |
| 2025-09-05 | VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation | Noel E. O’Connor Team | 2509.05154 | null | |
| 2025-09-05 | GenAI-based test case generation and execution in SDV platform | Alois Knoll Team | 2509.05112 | null | |
| 2025-09-05 | MM-DREX: Multimodal-Driven Dynamic Routing of LLM Experts for Financial Trading | Fei Wu Team | 2509.05080 | null | |
| 2025-09-05 | Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework | Guangmang Cui Team | 2509.05000 | null | |
| 2025-09-05 | SynGen-Vision: Synthetic Data Generation for training industrial vision models | Nitish Bhardwaj Team | 2509.04894 | null | |
| 2025-09-05 | TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution | Guihua Shan Team | 2509.04834 | null | |
| 2025-09-05 | FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph | John E. Taylor Team | 2509.04772 | null | |
| 2025-09-05 | Dynamic Group Detection using VLM-augmented Temporal Groupness Graph | Norimichi Ukita Team | 2509.04758 | null | |
| 2025-09-04 | Guideline-Consistent Segmentation via Multi-Agent Refinement | James Davis Team | 2509.04687 | null | |
| 2025-09-04 | TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection | Mong Li Lee Team | 2509.04448 | link | |
| 2025-09-05 | GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization | Yixuan Li Team | 2509.04334 | null | |
| 2025-09-04 | Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding | Juntao Li Team | 2509.04243 | null | |
| 2025-09-04 | An Automated, Scalable Machine Learning Model Inversion Assessment Pipeline | Nathaniel D. Bastian Team | 2509.04214 | null | |
| 2025-09-04 | Real Time FPGA Based Transformers & VLMs for Vision Tasks: SOTA Designs and Optimizations | Ashiyana Abdul Majeed Team | 2509.04162 | null | |
| 2025-09-04 | Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection | C. L. Philip Chen Team | 2509.03961 | null | |
| 2025-09-04 | Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model | Hyunseung Choo Team | 2509.03895 | null | |
| 2025-09-04 | Weakly-Supervised Learning of Dense Functional Correspondences | Jiajun Wu Team | 2509.03893 | link | |
| 2025-09-04 | Expedition & Expansion: Leveraging Semantic Representations for Goal-Directed Exploration in Continuous Cellular Automata | Cédric Colas Team | 2509.03863 | null | |
| 2025-09-04 | Measuring How (Not Just Whether) VLMs Build Common Ground | Malihe Alikhani Team | 2509.03805 | null | |
| 2025-09-04 | Causality-guided Prompt Learning for Vision-language Models via Visual Granulation | Qiulei Dong Team | 2509.03803 | null | |
| 2025-09-04 | MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting | Xiaofeng Yang Team | 2509.03800 | null | |
| 2025-09-03 | Singular Value Few-shot Adaptation of Vision-Language Models | Yiming Xiao Team | 2509.03740 | null | |
| 2025-09-03 | E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition | Anupam Purwar Team | 2509.03615 | null | |
| 2025-09-05 | Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens | Eunho Yang Team | 2509.03025 | null | |
| 2025-09-03 | KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models | Hong Chen Team | 2509.02966 | null | |
| 2025-09-02 | A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation | Pedro J. Moreno Team | 2509.02864 | null | |
| 2025-09-02 | Challenges in Understanding Modality Conflict in Vision-Language Models | David Jensen Team | 2509.02805 | null | |
| 2025-09-02 | 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model | Yi Yang Team | 2509.02659 | null | |
| 2025-08-31 | Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation | Slava Voloshynovskiy Team | 2509.02615 | null | |
| 2025-09-02 | Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception | Bin He Team | 2509.02324 | null | |
| 2025-09-02 | RS-OOD: A Vision-Language Augmented Framework for Out-of-Distribution Detection in Remote Sensing | Yao Zhu Team | 2509.02273 | null | |
| 2025-09-02 | E-THER: A PCT-Grounded Dataset for Benchmarking Empathic AI | Syed Afaq Ali Shah Team | 2509.02100 | null | |
| 2025-09-02 | Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models | Hiroshi Sasaki Team | 2509.01959 | null | |
| 2025-09-02 | RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events | Feng Zhang Team | 2509.01907 | null | |
| 2025-09-02 | Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models | Yiming Xiao Team | 2509.01895 | null | |
| 2025-09-01 | MoTo: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation | Haibin Yan Team | 2509.01658 | link | |
| 2025-09-01 | Unified Supervision For Vision-Language Modeling in 3D Computed Tomography | Xueyan Mei Team | 2509.01554 | null | |
| 2025-09-01 | Variation-aware Vision Token Dropping for Faster Large Vision-Language Models | Honggang Chen Team | 2509.01552 | link | |
| 2025-09-01 | Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models | Zhiming Tan Team | 2509.01350 | null | |
| 2025-09-03 | Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation | Yanyun Qu Team | 2509.01275 | null | |
| 2025-09-01 | ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization | Trung-Nghia Le Team | 2509.01259 | null | |
| 2025-09-01 | POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion | Jie Zhou Team | 2509.01215 | null | |
| 2025-09-01 | Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation | Akihiro Sugimoto Team | 2509.01209 | null | |
| 2025-08-29 | VoCap: Video Object Captioning and Segmentation from Any Prompt | Cordelia Schmid Team | 2508.21809 | null | |
| 2025-08-29 | CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models | Rodrigo Ventura Team | 2508.21732 | null | |
| 2025-08-29 | How Well Do Vision–Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images | Yoonjin Yoon Team | 2508.21565 | null | |
| 2025-08-29 | HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones | Shaozi Li Team | 2508.21539 | null | |
| 2025-08-28 | OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning | Xinglong Wu Team | 2508.21066 | link | |
| 2025-08-28 | CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification | Liqiang Nie Team | 2508.21046 | link | |
| 2025-08-28 | Learning Primitive Embodied World Models: Towards Scalable Robotic Learning | Qinying Gu Team | 2508.20840 | null | |
| 2025-08-28 | Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation | Binod Bhattarai Team | 2508.20830 | null | |
| 2025-08-28 | Evaluating Compositional Generalisation in VLMs and Diffusion Models | Martha Lewis Team | 2508.20783 | null | |
| 2025-09-02 | Occlusion Robustness of CLIP for Military Vehicle Classification | Hugo J. Kuijf Team | 2508.20760 | null | |
| 2025-08-28 | “Humor, Art, or Misinformation?”: A Multimodal Dataset for Intent-Aware Synthetic Image Detection | Panagiotis C. Petrantonakis Team | 2508.20670 | null | |
| 2025-08-28 | Towards Mechanistic Defenses Against Typographic Attacks in CLIP | Wojciech Samek Team | 2508.20570 | null | |
| 2025-08-28 | MedGR $^2$ : Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning | Shangyang Li Team | 2508.20549 | null | |
| 2025-08-28 | MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models | Yuankai Huo Team | 2508.20345 | null | |
| 2025-08-28 | GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs | Haohan Wang Team | 2508.20325 | null | |
| 2025-08-27 | A Novel Framework for Automated Explain Vision Model Using Vision-Language Models | Truong Son Hy Team | 2508.20227 | null | |
| 2025-08-27 | Segmentation Assisted Incremental Test Time Adaptation in an Open World | Soma Biswas Team | 2508.20029 | null | |
| 2025-08-27 | SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control | Ping Luo Team | 2508.20018 | null | |
| 2025-08-27 | GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity | Yixuan Li Team | 2508.19972 | null | |
| 2025-08-27 | Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models | Shoaib Ehsan Team | 2508.19967 | null | |
| 2025-08-27 | KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts | Hyunjun Eun Team | 2508.19944 | null | |
| 2025-08-28 | NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks | Somak Aditya Team | 2508.19724 | null | |
| 2025-08-27 | InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning | Bo Zheng Team | 2508.19679 | null | |
| 2025-08-27 | Self-Rewarding Vision-Language Model via Reasoning Decomposition | Dong Yu Team | 2508.19652 | null | |
| 2025-08-27 | FakeSV-VLM: Taming VLM for Detecting Fake Short-Video News via Progressive Mixture-Of-Experts Adapter | Zhun Zhong Team | 2508.19639 | null | |
| 2025-08-26 | LaVA-Man: Learning Visual Action Representations for Robot Manipulation | Changjae Oh Team | 2508.19391 | null | |
| 2025-08-26 | Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments | Pierre Baldi Team | 2508.19376 | null | |
| 2025-08-26 | AT-CXR: Uncertainty-Aware Agentic Triage for Chest X-rays | Yiyu Shi Team | 2508.19322 | null | |
| 2025-10-01 | Object Detection with Multimodal Large Vision-Language Models: An In-depth Review | Manoj Karkee Team | 2508.19294 | null | |
| 2025-08-26 | Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs | Keping Bi Team | 2508.19111 | null | |
| 2025-08-26 | ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval | Xiaoguang Zhao Team | 2508.19024 | null | |
| 2025-08-26 | Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone | Amit Sheth Team | 2508.18989 | null | |
| 2025-08-28 | Enhancing Document VQA Models via Retrieval-Augmented Generation | Ernest Valveny Team | 2508.18984 | null | |
| 2025-08-26 | Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models | Yong Xia Team | 2508.18886 | null | |
| 2025-08-26 | Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models | Guowen Xu Team | 2508.18805 | null | |
| 2025-08-26 | Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods | Robby T. Tan Team | 2508.18753 | null | |
| 2025-08-26 | Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning | Zuozhu Liu Team | 2508.18687 | null | |
| 2025-08-26 | PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality | Chaowei Xiao Team | 2508.18649 | null | |
| 2025-08-25 | CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering | Mohammad Ariful Haque Team | 2508.18430 | null | |
| 2025-08-25 | Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models | Jingbo Zhu Team | 2508.18381 | null | |
| 2025-08-25 | SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation | Ziwei Wang Team | 2508.18268 | link | |
| 2025-08-25 | MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs | Qi Qian Team | 2508.18264 | link | |
| 2025-08-25 | SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models | Ashton Anderson Team | 2508.18179 | null | |
| 2025-08-25 | Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance | Liyong Ren Team | 2508.18177 | null | |
| 2025-08-25 | ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation | Ye Li Team | 2508.18050 | null | |
| 2025-08-25 | PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration | Zhen Wang Team | 2508.18040 | null | |
| 2025-08-25 | Alternating Training-based Label Smoothing Enhances Prompt Generalization | Yu Zhang Team | 2508.17846 | null | |
| 2025-08-25 | PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models | Dan Zeng Team | 2508.17807 | null | |
| 2025-08-25 | F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model | Jinchao Zhang Team | 2508.17714 | null | |
| 2025-08-25 | Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing | Yogesh Kumar Team | 2508.17686 | null | |
| 2025-08-25 | Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection | Ruixuan Wang Team | 2508.17667 | null | |
| 2025-08-25 | Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning | Chunping Qiu Team | 2508.17638 | null | |
| 2025-08-25 | TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints | Xuan-Huong Nguyen Team | 2508.17595 | null | |
| 2025-08-25 | MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation | Wojciech Matusik Team | 2508.17568 | null | |
| 2025-08-24 | MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models | Venkatram Vishwanath Team | 2508.17467 | null | |
| 2025-08-24 | Multi-Level LVLM Guidance for Untrimmed Video Action Recognition | Yunjie Guo Team | 2508.17442 | null | |
| 2025-08-24 | Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models | Qinghua Hu Team | 2508.17417 | null | |
| 2025-08-24 | Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis | Tom Hope Team | 2508.17394 | null | |
| 2025-08-26 | Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs | Gaurav Harit Team | 2508.17334 | null | |
| 2025-08-24 | Explain Before You Answer: A Survey on Compositional Visual Reasoning | Hamid Rezatofighi Team | 2508.17298 | null | |
| 2025-08-22 | Modular Embedding Recomposition for Incremental Learning | Simone Calderara Team | 2508.16463 | null | |
| 2025-08-22 | Structuring GUI Elements through Vision Language Models: Towards Action Space Generation | Jingdong Chen Team | 2508.16271 | null | |
| 2025-08-22 | RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution | Gui-Song Xia Team | 2508.16158 | null | |
| 2025-08-22 | Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection | Chao-Chun Chen Team | 2508.16157 | null | |
| 2025-08-22 | Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation | Hasan Mahmud Team | 2508.16076 | null | |
| 2025-08-22 | Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants | Jinchao Zhang Team | 2508.16070 | null | |
| 2025-08-21 | Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification | Ruining Deng Team | 2508.15960 | null | |
| 2025-08-21 | Semantic-Aware Ship Detection with Vision-Language Integration | Xiaomeng Huang Team | 2508.15930 | null | |
| 2025-08-21 | VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos | Zihan Xu Team | 2508.15903 | null | |
| 2025-08-21 | LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions | Yulong Bian Team | 2508.15688 | null | |
| 2025-08-21 | Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation | Alexey K. Kovalev Team | 2508.15663 | null | |
| 2025-08-21 | DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding | Sourav Medya Team | 2508.15297 | null | |
| 2025-08-21 | Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images | Jin Tae Kwak Team | 2508.15256 | null | |
| 2025-08-21 | Pathology-Informed Latent Diffusion Model for Anomaly Detection in Lymph Node Metastasis | Jin Tae Kwak Team | 2508.15236 | null | |
| 2025-08-21 | See it. Say it. Sorted: Agentic System for Compositional Diagram Generation | Ed Li Team | 2508.15222 | null | |
| 2025-08-21 | ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following | Taeyang Yoon Team | 2508.15164 | null | |
| 2025-08-20 | MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs | Yunsi Fei Team | 2508.15036 | null | |
| 2025-08-20 | WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion | Won-Ki Jeong Team | 2508.14537 | null | |
| 2025-08-19 | Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference | Simon Gottschalk Team | 2508.14280 | null | |
| 2025-08-19 | CLIPSym: Delving into Symmetry Detection with CLIP | Raymond A. Yeh Team | 2508.14197 | null | |
| 2025-08-19 | LENS: Learning to Segment Anything with Unified Reinforced Reasoning | Xinggang Wang Team | 2508.14153 | link | |
| 2025-08-19 | Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation | Jianye Hao Team | 2508.13998 | null | |
| 2025-08-19 | Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks | Junsuk Choe Team | 2508.13744 | link | |
| 2025-08-19 | Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance | Bin Xiao Team | 2508.13739 | null | |
| 2025-08-19 | Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture | Giuseppe Serra Team | 2508.13713 | null | |
| 2025-08-19 | ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions? | Daeyoung Kim Team | 2508.13680 | null | |
| 2025-08-19 | Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation | Lin Ma Team | 2508.13587 | null | |
| 2025-08-21 | DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup | Guiguang Ding Team | 2508.13560 | link | |
| 2025-08-19 | Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models | Sridevi Bonthu Team | 2508.13524 | null | |
| 2025-08-19 | STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models | Tien-Huy Nguyen Team | 2508.13470 | null | |
| 2025-08-19 | CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models | Sergey Levine Team | 2508.13446 | null | |
| 2025-08-19 | Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference | Jidong J. Yang Team | 2508.13439 | null | |
| 2025-08-19 | Mitigating Easy Option Bias in Multiple-Choice Question Answering | Basura Fernando Team | 2508.13428 | null | |
| 2025-08-18 | Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving | Linfeng Zhang Team | 2508.13305 | null | |
| 2025-08-18 | CardAIc-Agents: A Multimodal Framework with Hierarchical Adaptation for Cardiac Care Support | Jinming Duan Team | 2508.13256 | null | |
| 2025-08-18 | Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey | Liqiang Nie Team | 2508.13073 | link | |
| 2025-08-18 | IntelliCap: Intelligent Guidance for Consistent View Sampling | Shohei Mori Team | 2508.13043 | link | |
| 2025-08-18 | Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination | Lihua Zhang Team | 2508.12957 | null | |
| 2025-08-18 | RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph | Yunquan Sun Team | 2508.12916 | null | |
| 2025-08-18 | Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning | Ruixuan Wang Team | 2508.12877 | null | |
| 2025-08-18 | Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models | Ruixuan Wang Team | 2508.12861 | null | |
| 2025-08-18 | HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks | Yu Wang Team | 2508.12778 | null | |
| 2025-08-18 | Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection | Wei Zhou Team | 2508.12711 | null | |
| 2025-08-18 | WP-CLIP: Leveraging CLIP to Predict Wölfflin’s Principles in Visual Art | Feng Liu Team | 2508.12668 | link | |
| 2025-08-18 | SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer | Zheng Yang Team | 2508.12638 | null | |
| 2025-08-18 | ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving | Ziran Wang Team | 2508.12603 | null | |
| 2025-08-18 | Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models | Chris Ngo Team | 2508.12587 | null | |
| 2025-08-18 | REVEAL – Reasoning and Evaluation of Visual Evidence through Aligned Language | Yash Butala Team | 2508.12543 | null | |
| 2025-08-17 | LangVision-LoRA-NAS: Neural Architecture Search for Variable LoRA Rank in Vision Language Models | Venkatram Vishwanath Team | 2508.12512 | null | |
| 2025-08-17 | Standardization of Neuromuscular Reflex Analysis – Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt-oss Reasoning LLM Enabled Decision Support System | Kasun De Zoysa Team | 2508.12473 | null | |
| 2025-08-17 | M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following | Yanfei Qian Team | 2508.12458 | null | |
| 2025-08-17 | X-Ray-CoT: Interpretable Chest X-ray Diagnosis with Vision-Language Models via Chain-of-Thought Reasoning | Shaoqing Tang Team | 2508.12455 | null | |
| 2025-08-17 | LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving | Li Zhang Team | 2508.12404 | null | |
| 2025-08-17 | MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models | Xueying Huang Team | 2508.12400 | null | |
| 2025-08-17 | Federated Cross-Modal Style-Aware Prompt Generation | Amit Sethi Team | 2508.12399 | null | |
| 2025-08-15 | Reinforcing Video Reasoning Segmentation to Think Before It Segments | Huchuan Lu Team | 2508.11538 | null | |
| 2025-08-15 | OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation | Aleksandr Panov Team | 2508.11479 | null | |
| 2025-08-15 | ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving | Li Zhang Team | 2508.11428 | null | |
| 2025-08-15 | Semantically Guided Adversarial Testing of Vision Models Using Language Models | Jorge M. Cruz-Duarte Team | 2508.11341 | null | |
| 2025-08-15 | Noise Matters: Optimizing Matching Noise for Diffusion Classifiers | Long Chen Team | 2508.11330 | null | |
| 2025-08-15 | Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models | Tat-Seng Chua Team | 2508.11317 | null | |
| 2025-08-15 | Vision-Language Models display a strong gender bias | Sreedath Panat Team | 2508.11262 | null | |
| 2025-08-15 | Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception | Zhuotao Tian Team | 2508.11256 | null | |
| 2025-08-15 | UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning | Yue Zhang Team | 2508.11196 | null | |
| 2025-08-15 | Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning | Bin Luo Team | 2508.11176 | null | |
| 2025-08-15 | Better Supervised Fine-tuning for VQA: Integer-Only Loss | Junhui Cui Team | 2508.11170 | null | |
| 2025-08-14 | Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance | Rustam Stolkin Team | 2508.11093 | null | |
| 2025-08-14 | Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors? | Zhengbo Zou Team | 2508.11011 | null | |
| 2025-08-14 | Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision | Anhong Guo Team | 2508.10972 | null | |
| 2025-08-14 | AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences | Joey Tianyi Zhou Team | 2508.10771 | null | |
| 2025-08-14 | From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models | Wenqi Shao Team | 2508.10770 | null | |
| 2025-08-14 | IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning | Bin Li Team | 2508.10681 | null | |
| 2025-08-14 | AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models | Jieping Ye Team | 2508.10667 | null | |
| 2025-08-14 | SemPT: Semantic Prompt Tuning for Vision-Language Models | Zhenzhong Chen Team | 2508.10645 | null | |
| 2025-08-14 | ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation | Mohsen Guizani Team | 2508.10635 | null | |
| 2025-08-14 | Retrieval-Augmented Prompt for OOD Detection | Changqing Zhang Team | 2508.10556 | null | |
| 2025-08-14 | DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales | Zhi Zeng Team | 2508.10444 | null | |
| 2025-08-14 | MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance | Yi Zhang Team | 2508.10429 | link | |
| 2025-08-14 | STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes | Yu Yamaguchi Team | 2508.10427 | link | |
| 2025-08-14 | PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection | Xinghui Song Team | 2508.10397 | null | |
| 2025-08-14 | Contrast Sensitivity Function of Multimodal Vision-Language Models | Valero Laparra Team | 2508.10367 | null | |
| 2025-08-14 | JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics | Hamid Rezatofighi Team | 2508.10287 | null | |
| 2025-08-14 | MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs | Yujun Cai Team | 2508.10264 | null | |
| 2025-08-13 | Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs | Xiaoxiao Li Team | 2508.10180 | null | |
| 2025-08-13 | SynSpill: Improved Industrial Spill Detection With Synthetic Data | Shruti Vyas Team | 2508.10171 | null | |
| 2025-08-13 | Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs | Bin Li Team | 2508.10113 | null | |
| 2025-08-13 | LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit | Wenya Wang Team | 2508.09981 | null | |
| 2025-08-13 | January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis | Mark Woodward Team | 2508.09966 | null | |
| 2025-08-14 | Prototype-Guided Diffusion: Visual Conditioning without External Memory | Mustapha Lebbah Team | 2508.09922 | null | |
| 2025-08-12 | OpenCUA: Open Foundations for Computer-Use Agents | Tao Yu Team | 2508.09123 | null | |
| 2025-08-12 | Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving | Tian Ding Team | 2508.09099 | null | |
| 2025-08-12 | Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision | Prashnna Gyawali Team | 2508.09087 | null | |
| 2025-08-13 | GeoVLA: Empowering 3D Representations in Vision-Language-Action Models | Jiale Cao Team | 2508.09071 | link | |
| 2025-08-12 | VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception | Lei He Team | 2508.09061 | null | |
| 2025-08-12 | MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions | Jin Xu Team | 2508.09057 | null | |
| 2025-08-12 | Rational Inverse Reasoning | Leslie Pack Kaelbling Team | 2508.08983 | null | |
| 2025-08-12 | How Does a Virtual Agent Decide Where to Look? – Symbolic Cognitive Reasoning for Embodied Head Rotation | Hyeongyeop Kang Team | 2508.08930 | null | |
| 2025-08-12 | Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models | Xuelong Li Team | 2508.08926 | null | |
| 2025-08-12 | 3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs | Eddy Ilg Team | 2508.08821 | null | |
| 2025-08-12 | SafeFix: Targeted Model Repair via Controlled Image Generation | Yunhui Guo Team | 2508.08701 | null | |
| 2025-08-12 | STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision | Marios Savvides Team | 2508.08688 | null | |
| 2025-08-12 | AME: Aligned Manifold Entropy for Robust Vision-Language Distillation | Yuming Ou Team | 2508.08644 | null | |
| 2025-08-13 | Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization | Hyunwoo J. Kim Team | 2508.08604 | null | |
| 2025-08-12 | Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation | Qi Lei Team | 2508.08570 | null | |
| 2025-08-11 | VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models | Ravikumar Balakrishnan Team | 2508.08521 | null | |
| 2025-08-11 | Re:Verse – Can Your VLM Read a Manga? | Shruti Vyas Team | 2508.08508 | null | |
| 2025-08-11 | ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks | Chunhua Shen Team | 2508.08240 | null | |
| 2025-08-11 | Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model | Shaoliang Peng Team | 2508.08199 | null | |
| 2025-08-11 | BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models | Bo Wang Team | 2508.08040 | null | |
| 2025-08-11 | TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation | Robert Wille Team | 2508.08038 | null | |
| 2025-08-11 | TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding | Jee-Hyong Lee Team | 2508.07925 | null | |
| 2025-08-11 | RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering | Mukesh Prasad Team | 2508.07918 | null | |
| 2025-08-11 | CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning | Ruixiang Tang Team | 2508.07871 | null | |
| 2025-08-11 | Effortless Vision-Language Model Specialization in Histopathology without Annotation | Katharina Breininger Team | 2508.07835 | null | |
| 2025-08-11 | MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization | Alexandros Stergiou Team | 2508.07833 | link | |
| 2025-08-11 | Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP | Yueyi Luo Team | 2508.07819 | null | |
| 2025-08-11 | SwarmVLM: VLM-Guided Impedance Control for Autonomous Navigation of Heterogeneous Robots in Dynamic Warehousing | Dzmitry Tsetserukou Team | 2508.07814 | null | |
| 2025-08-11 | Grasp-HGN: Grasping the Unexpected | Gunar Schirner Team | 2508.07648 | null | |
| 2025-08-11 | Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents | Parisa Kordjamshidi Team | 2508.07642 | null | |
| 2025-08-11 | InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information | Vivek Gupta Team | 2508.07630 | null | |
| 2025-08-11 | AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning | Yang Liu Team | 2508.07626 | null | |
| 2025-08-11 | Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models | Duc Thanh Nguyen Team | 2508.07570 | null | |
| 2025-08-10 | FormCoach: Lift Smarter, Not Harder | Lingjie Liu Team | 2508.07501 | null | |
| 2025-08-10 | Freeze and Reveal: Exposing Modality Bias in Vision-Language Models | Ponnurangam Kumaraguru Team | 2508.07432 | null | |
| 2025-08-10 | AgriVLN: Vision-and-Language Navigation for Agricultural Robots | Xiang Li Team | 2508.07406 | null | |
| 2025-08-10 | Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM | Wentao Zhang Team | 2508.07260 | null | |
| 2025-08-08 | Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding | Kun Shao Team | 2508.06317 | null | |
| 2025-08-08 | Real-Time 3D Vision-Language Embedding Mapping | Elmar Rueckert Team | 2508.06291 | null | |
| 2025-08-08 | InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic? | Youngjae Yu Team | 2508.06220 | null | |
| 2025-08-08 | VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation | ChengSheng Deng Team | 2508.06152 | null | |
| 2025-08-08 | Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation | Shaohui Liu Team | 2508.06092 | null | |
| 2025-08-08 | AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance | Yunhao Liu Team | 2508.06084 | null | |
| 2025-08-08 | Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models | Zhouhan Lin Team | 2508.06038 | null | |
| 2025-08-08 | More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment | Zhepeng Wang Team | 2508.06036 | null | |
| 2025-08-08 | Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making | Xiaosong Wang Team | 2508.05996 | null | |
| 2025-08-08 | PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation | Yao Mu Team | 2508.05976 | null | |
| 2025-08-07 | HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing | Chris Callison-Burch Team | 2508.05899 | null | |
| 2025-08-07 | ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates | Ali cheraghian Team | 2508.05898 | null | |
| 2025-08-07 | Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis | Zeyu Wang Team | 2508.05580 | null | |
| 2025-08-07 | Adapting Vision-Language Models Without Labels: A Comprehensive Survey | Olga Fink Team | 2508.05547 | link | |
| 2025-08-07 | Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions | Przemyslaw Biecek Team | 2508.05430 | null | |
| 2025-08-07 | From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization | Ibrahim Khalil Team | 2508.05409 | null | |
| 2025-08-07 | DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning | Bo Zheng Team | 2508.05405 | null | |
| 2025-08-07 | StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models | Youqiang Zhou Team | 2508.05383 | null | |
| 2025-08-07 | Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting | Hugo Kuijf Team | 2508.05323 | null | |
| 2025-08-07 | Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction | Jorge Peña Queralta Team | 2508.05294 | null | |
| 2025-08-07 | RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding | Guiru Liu Team | 2508.05244 | null | |
| 2025-08-07 | Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models | Jason Sun Team | 2508.05237 | null | |
| 2025-08-07 | ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking | Zhipeng Zhang Team | 2508.05221 | null | |
| 2025-08-07 | SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images | Liangpei Zhang Team | 2508.05202 | null | |
| 2025-08-07 | Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories | Andrew I. Cooper Team | 2508.05148 | null | |
| 2025-08-07 | Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation | Zhen Lei Team | 2508.05008 | null | |
| 2025-08-07 | Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification | Haiyang Zhang Team | 2508.04998 | null | |
| 2025-08-07 | Unified modality separation: A vision-language framework for unsupervised domain adaptation | Heng Tao Shen Team | 2508.04987 | null | |
| 2025-08-07 | Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models | Hyunseung Choo Team | 2508.04942 | null | |
| 2025-08-06 | INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM | Nikos Tsagarakis Team | 2508.04931 | link | |
| 2025-08-06 | Automated Bug Frame Retrieval from Gameplay Videos Using Vision-Language Models | Cor-Paul Bezemer Team | 2508.04895 | null | |
| 2025-08-06 | Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications | Wassim Bouachir Team | 2508.04868 | null | |
| 2025-08-01 | MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval | Chengcheng Mai Team | 2508.00579 | null | |
| 2025-08-01 | Training-Free Class Purification for Open-Vocabulary Semantic Segmentation | Xiaohua Xie Team | 2508.00557 | null | |
| 2025-08-01 | HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models | Bin Chen Team | 2508.00553 | null | |
| 2025-08-01 | Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images | Timo Ropinski Team | 2508.00549 | null | |
| 2025-08-01 | EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond | Baochang Zhang Team | 2508.00522 | null | |
| 2025-08-01 | When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework | Tony Q. S. Quek Team | 2508.00456 | null | |
| 2025-08-01 | CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text | Petar Durdevic Team | 2508.00447 | null | |
| 2025-08-01 | AutoDebias: Automated Framework for Debiasing Text-to-Image Models | Yang Liu Team | 2508.00445 | null | |
| 2025-08-01 | Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents | Rowel O. Atienza Team | 2508.00400 | null | |
| 2025-08-01 | iSafetyBench: A video-language benchmark for safety in industrial environment | Shruti Vyas Team | 2508.00399 | null | |
| 2025-08-01 | Decouple before Align: Visual Disentanglement Enhances Prompt Tuning | Yanfeng Wang Team | 2508.00395 | null | |
| 2025-08-01 | SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation | Renxin Zhong Team | 2508.00390 | null | |
| 2025-08-01 | CoRGI: Verified Chain-of-Thought Reasoning with Visual Grounding | Lin Shang Team | 2508.00378 | null | |
| 2025-08-01 | Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning | Athanasios Voulodimos Team | 2508.00356 | null | |
| 2025-08-01 | Evaluating the Efficacy of Large Language Models for Generating Fine-Grained Visual Privacy Policies in Homes | Hewu Li Team | 2508.00321 | null | |
| 2025-08-01 | DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios | Lin Ma Team | 2508.00311 | null | |
| 2025-08-01 | Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models | Eunwoo Kim Team | 2508.00260 | null | |
| 2025-08-01 | Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product | Ehsan Abbasnejad Team | 2508.00230 | null | |
| 2025-07-31 | On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI | Enzo Ferrante Team | 2508.00171 | null | |
| 2025-07-31 | ART: Adaptive Relation Tuning for Generalized Relation Prediction | Stefan Roth Team | 2507.23543 | null | |
| 2025-07-23 | BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems | Christian Berger Team | 2507.17722 | null | |
| 2025-07-23 | InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Jiangmiao Pang Team | 2507.17520 | null | |
| 2025-07-23 | Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection | Elisa Ricci Team | 2507.17456 | null | |
| 2025-07-23 | VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization | Shoaib Ehsan Team | 2507.17455 | null | |
| 2025-07-23 | Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection | Xi Li Team | 2507.17436 | null | |
| 2025-07-23 | Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models | Guanghui Sun Team | 2507.17379 | null | |
| 2025-07-23 | RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding | Tianyang Wang Team | 2507.17353 | null | |
| 2025-07-23 | HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study | Maria Spence Team | 2507.17118 | null | |
| 2025-07-23 | FedVLM: Scalable Personalized Vision-Language Models through Federated Learning | Habeeb Olufowobi Team | 2507.17088 | null | |
| 2025-07-22 | VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings | Kannan Achan Team | 2507.17080 | null | |
| 2025-07-22 | Controllable Hybrid Captioner for Improved Long-form Video Understanding | Arun Reddy Team | 2507.17047 | null | |
| 2025-07-22 | Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning | Kai Chen Team | 2507.16814 | null | |
| 2025-07-22 | Cooling Matters: Benchmarking Large Language Models and Vision-Language Models on Liquid-Cooled Versus Air-Cooled H100 GPU Systems | Arslan Munir Team | 2507.16781 | null | |
| 2025-07-22 | Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation | Ke Yang Team | 2507.16716 | null | |
| 2025-07-22 | Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory | Marco Hutter Team | 2507.16713 | null | |
| 2025-07-22 | Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models | Chao Zhang Team | 2507.16524 | null | |
| 2025-07-22 | SceneLoom: Communicating Data with Scene Context | Siming Chen Team | 2507.16466 | null | |
| 2025-07-22 | Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models | Isao Echizen Team | 2507.16257 | null | |
| 2025-07-22 | SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction | Jiaqi Wang Team | 2507.15852 | null | |
| 2025-07-21 | Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models | Erkut Erdem Team | 2507.15824 | null | |
| 2025-07-23 | Visual-Language Model Knowledge Distillation Method for Image Quality Assessment | Jiarun Song Team | 2507.15680 | null | |
| 2025-07-21 | Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging | Margret Keuper Team | 2507.15576 | null | |
| 2025-07-21 | HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation | Robby T. Tan Team | 2507.15542 | null | |
| 2025-07-21 | Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner | Lin Ma Team | 2507.15509 | null | |
| 2025-07-21 | One Last Attention for Your Vision-Language Model | Zhiqiang Shen Team | 2507.15480 | null | |
| 2025-07-21 | EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent | Xinlei Chen Team | 2507.15428 | null | |
| 2025-07-21 | In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems | Christoph Busch Team | 2507.15285 | null | |
| 2025-07-21 | VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving | Tong Heng Lee Team | 2507.15266 | null | |
| 2025-07-20 | Survey of GenAI for Automotive Software Development: From Requirements to Executable Code | Alois Knoll Team | 2507.15025 | null | |
| 2025-07-20 | Hierarchical Cross-modal Prompt Learning for Vision-Language Models | Zhenhua Huang Team | 2507.14976 | null | |
| 2025-07-20 | FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models | Mengnan Du Team | 2507.14823 | null | |
| 2025-07-19 | IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark | Ruiheng Zhang Team | 2507.14449 | null | |
| 2025-07-18 | CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation | Nicolas Thome Team | 2507.14312 | null | |
| 2025-07-18 | In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding | Leonid Sigal Team | 2507.14298 | null | |
| 2025-07-18 | VLA-Mark: A cross modal watermark for large vision-language alignment model | Xuming Hu Team | 2507.14067 | null | |
| 2025-07-18 | EdgeVLA: Efficient Vision-Language-Action Models | Benjamin Bolte Team | 2507.14049 | null | |
| 2025-07-18 | Moodifier: MLLM-Enhanced Emotion-Driven Image Editing | Sharon X. Huang Team | 2507.14024 | null | |
| 2025-07-18 | When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models | Alberto Cazzaniga Team | 2507.13868 | null | |
| 2025-07-18 | Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions | Jiajun Zhang Team | 2507.13773 | null | |
| 2025-07-17 | LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning | Margrit Betke Team | 2507.13568 | null | |
| 2025-07-17 | COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark | Vasu Sharma Team | 2507.13405 | null | |
| 2025-07-17 | VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning | Jiaya Jia Team | 2507.13348 | null | |
| 2025-07-17 | Leveraging Language Prior for Infrared Small Target Detection | Pravendra Singh Team | 2507.13113 | null | |
| 2025-07-17 | GLAD: Generalizable Tuning for Vision-Language Models | Shifeng Chen Team | 2507.13089 | null | |
| 2025-07-17 | Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection | Changwen Zheng Team | 2507.13061 | null | |
| 2025-07-21 | LaViPlan : Language-Guided Visual Path Planning with RLVR | Hayeon Oh Team | 2507.12911 | null | |
| 2025-07-17 | City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning | Xiaowen Chu Team | 2507.12795 | null | |
| 2025-07-16 | VLMgineer: Vision Language Models as Robotic Toolsmiths | Dinesh Jayaraman Team | 2507.12644 | null | |
| 2025-07-16 | NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting | Chaoli Wang Team | 2507.12621 | null | |
| 2025-07-16 | MindJourney: Test-Time Scaling with World Models for Spatial Reasoning | Chuang Gan Team | 2507.12508 | null | |
| 2025-07-16 | ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving | Xinge Zhu Team | 2507.12499 | null | |
| 2025-07-15 | Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering | Dimosthenis Karatzas Team | 2507.12490 | null | |
| 2025-07-20 | PhysX-3D: Physical-Grounded 3D Asset Generation | Ziwei Liu Team | 2507.12465 | null | |
| 2025-07-16 | Describe Anything Model for Visual Question Answering on Text-rich Images | Min Xu Team | 2507.12441 | null | |
| 2025-07-16 | AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models | Sihao Ding Team | 2507.12414 | null | |
| 2025-07-16 | Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models | Bernhard Kainz Team | 2507.12236 | null | |
| 2025-07-16 | InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing | Wen-Huang Cheng Team | 2507.12060 | null | |
| 2025-07-16 | GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models | Rongrong Ji Team | 2507.11969 | null | |
| 2025-07-16 | POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering | Qin Jin Team | 2507.11939 | null | |
| 2025-07-15 | Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis | Lihang Ying Team | 2507.11730 | null | |
| 2025-07-18 | How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study | Rossella Arcucci Team | 2507.11200 | null | |
| 2025-07-15 | Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities | Yang Zhang Team | 2507.11155 | null | |
| 2025-07-15 | Assessing Color Vision Test in Large Vision-language Models | Hongyang Chen Team | 2507.11153 | null | |
| 2025-07-15 | MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models | Hamza Moustafa Team | 2507.11114 | null | |
| 2025-07-15 | Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander | Lei Chen Team | 2507.11079 | null | |
| 2025-07-15 | Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection | Guanzhong Tian Team | 2507.11003 | null | |
| 2025-07-14 | EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Xiaojuan Qi Team | 2507.10548 | null | |
| 2025-07-14 | CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding | Yi Wang Team | 2507.10449 | null | |
| 2025-07-14 | Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter | Bin Luo Team | 2507.10355 | null | |
| 2025-07-14 | Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection | Wenqiang Zhang Team | 2507.10225 | null | |
| 2025-07-14 | BlueGlass: A Framework for Composite AI Safety | Kay-Ulrich Scholl Team | 2507.10106 | null | |
| 2025-07-14 | Foundation Model Driven Robotics: A Comprehensive Review | Ammar Waheed Team | 2507.10087 | null | |
| 2025-07-14 | LayLens: Improving Deepfake Understanding through Simplified Explanations | Abhinav Dhall Team | 2507.10066 | null | |
| 2025-07-14 | CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books | Dimosthenis Karatzas Team | 2507.10053 | null | |
| 2025-07-14 | Text-Driven Causal Representation Learning for Source-Free Domain Generalization | Zhen Lei Team | 2507.09961 | null | |
| 2025-07-13 | NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection | Pulei Xiong Team | 2507.09795 | null | |
| 2025-07-13 | Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score | Muhammad Haris Khan Team | 2507.09615 | null | |
| 2025-07-13 | Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations | Guiguang Ding Team | 2507.09500 | null | |
| 2025-07-13 | GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? | Huaxiu Yao Team | 2507.09491 | null | |
| 2025-07-12 | Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models | Tat-Seng Chua Team | 2507.09209 | null | |
| 2025-07-12 | MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models | Dahan Wang Team | 2507.09184 | null | |
| 2025-07-12 | OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering | Niaz Abdolrahim Team | 2507.09155 | null | |
| 2025-07-12 | RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze | Honghan Wu Team | 2507.09097 | null | |
| 2025-07-11 | BlindSight: Harnessing Sparsity for Efficient VLMs | Steven K. Reinhardt Team | 2507.09071 | null | |
| 2025-07-11 | Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery | Seana Coulson Team | 2507.09011 | null | |
| 2025-07-11 | VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models | Olivier Déforges Team | 2507.08982 | null | |
| 2025-07-11 | ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way | Subarna Tripathi Team | 2507.08679 | null | |
| 2025-07-11 | Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance | András Lőrincz Team | 2507.08624 | null | |
| 2025-07-11 | Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data | Ambedkar Dukkipati Team | 2507.08610 | null | |
| 2025-07-11 | BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis | Hui Xiong Team | 2507.08607 | null | |
| 2025-07-11 | Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R | Sanidhya Kashyap Team | 2507.08505 | null | |
| 2025-07-11 | LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning | Lei Fan Team | 2507.08496 | null | |
| 2025-07-11 | Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models | Jianping Fan Team | 2507.08410 | null | |
| 2025-07-11 | Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning | Yejin Choi Team | 2507.08224 | null | |
| 2025-07-10 | CLIP Won’t Learn Object-Attribute Binding from Natural Data and Here is Why | Thomas Brox Team | 2507.07985 | null | |
| 2025-07-10 | Scaling RL to Long Videos | Song Han Team | 2507.07966 | null | |
| 2025-07-10 | SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment | Lei Fan Team | 2507.07939 | null | |
| 2025-07-10 | MoSE: Skill-by-Skill Mixture-of-Expert Learning for Autonomous Driving | Chao Zhang Team | 2507.07818 | null | |
| 2025-07-10 | Energy-Guided Decoding for Object Hallucination Mitigation | Christopher Zach Team | 2507.07731 | null | |
| 2025-07-10 | One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models | Cairong Zhao Team | 2507.07709 | null | |
| 2025-07-10 | Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | Daiki Chijiwa Team | 2507.07685 | null | |
| 2025-07-11 | ViLU: Learning Vision-Language Uncertainties for Failure Prediction | Nicolas Thome Team | 2507.07620 | null | |
| 2025-07-10 | LOSC: LiDAR Open-voc Segmentation Consolidator | Renaud Marlet Team | 2507.07605 | null | |
| 2025-07-10 | The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs | Qun Liu Team | 2507.07562 | null | |
| 2025-07-10 | ArchiveGPT: A human-centered evaluation of using a vision language model for image cataloguing | Markus Huff Team | 2507.07551 | null | |
| 2025-07-11 | Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning | David Martins de Matos Team | 2507.07340 | null | |
| 2025-07-09 | ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation | Suren Kumar Team | 2507.07317 | null | |
| 2025-07-09 | LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation | Angel X. Chang Team | 2507.07299 | null | |
| 2025-07-09 | MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Dan Goldwasser Team | 2507.07297 | null | |
| 2025-07-09 | 4KAgent: Agentic Any Image to 4K Super-Resolution | Zhengzhong Tu Team | 2507.07105 | null | |
| 2025-07-14 | Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models | Junfei Xiao Team | 2507.07104 | link | |
| 2025-07-09 | Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Davide Talon Team | 2507.07079 | null | |
| 2025-07-09 | Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM | Sibei Yang Team | 2507.06973 | null | |
| 2025-07-09 | CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale | Quan Wang Team | 2507.06959 | null | |
| 2025-07-09 | VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation | Tat-Seng Chua Team | 2507.06899 | null | |
| 2025-07-09 | HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement | Yanning Zhang Team | 2507.06814 | null | |
| 2025-07-09 | Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu | Donghyeok Choi Team | 2507.06761 | null | |
| 2025-07-09 | Text-promptable Object Counting via Quantity Awareness Enhancement | Li Li Team | 2507.06679 | null | |
| 2025-07-09 | Cross-Modal Dual-Causal Learning for Long-Term Action Recognition | Fan Chao Team | 2507.06603 | null | |
| 2025-07-09 | Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection | Xiangmin Xu Team | 2507.06510 | null | |
| 2025-07-09 | 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds | Nick Haber Team | 2507.06484 | null | |
| 2025-07-08 | VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic | Andreas A. Malikopoulos Team | 2507.06441 | null | |
| 2025-07-08 | CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions | Yi R. Fung Team | 2507.06210 | null | |
| 2025-07-08 | Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling | Naga Harshita Marupaka Team | 2507.06183 | null | |
| 2025-07-10 | Skywork-R1V3 Technical Report | Yahui Zhou Team | 2507.06167 | null | |
| 2025-07-08 | LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models | Hongming Shan Team | 2507.06140 | null | |
| 2025-07-08 | GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing | Hao Liu Team | 2507.05887 | null | |
| 2025-07-08 | Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports | Hitomi Yanaka Team | 2507.05799 | null | |
| 2025-07-08 | SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning | Tao He Team | 2507.05798 | null | |
| 2025-07-08 | A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation | Yue Gao Team | 2507.05731 | null | |
| 2025-07-09 | Integrated Structural Prompt Learning for Vision-Language Models | Bin Luo Team | 2507.05677 | null | |
| 2025-07-08 | R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding | Shabnam Ghadar Team | 2507.05673 | null | |
| 2025-07-08 | Dynamic Rank Adaptation for Vision-Language Models | Bin Luo Team | 2507.05668 | null | |
| 2025-07-08 | Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik’s Cube | Shenghai Yuan Team | 2507.05607 | null | |
| 2025-07-08 | Rethinking Layered Graphic Design Generation with a Top-Down Approach | Qifeng Chen Team | 2507.05601 | null | |
| 2025-07-08 | PaddleOCR 3.0 Technical Report | Yanjun Ma Team | 2507.05595 | null | |
| 2025-07-07 | Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality | Junxiao Wang Team | 2507.05515 | null | |
| 2025-07-07 | Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model | Even Oldridge Team | 2507.05513 | null | |
| 2025-07-07 | OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts | Priyadarshini Panda Team | 2507.05427 | null | |
| 2025-07-07 | pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models | Ramtin Pedarsani Team | 2507.05394 | null | |
| 2025-07-07 | NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving | Cheng Lu Team | 2507.05227 | null | |
| 2025-07-07 | All in One: Visual-Description-Guided Unified Point Cloud Segmentation | Rao Muhammad Anwer Team | 2507.05211 | null | |
| 2025-07-07 | Differential Attention for Multimodal Crisis Event Analysis | Abdullah-Al-Zubaer Imran Team | 2507.05165 | null | |
| 2025-07-07 | INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling | Bo Zheng Team | 2507.05056 | null | |
| 2025-07-07 | Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision | Nicolas Padoy Team | 2507.05020 | null | |
| 2025-07-07 | Training-free Generation of Temporally Consistent Rewards from VLMs | Jian Tang Team | 2507.04789 | null | |
| 2025-07-07 | Vision-Language Models Can’t See the Obvious | Sanath Narayan Team | 2507.04741 | null | |
| 2025-07-07 | An analysis of vision-language models for fabric retrieval | Fabio Poiesi Team | 2507.04735 | null | |
| 2025-07-07 | A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets | Jie Zhou Team | 2507.04699 | null | |
| 2025-07-07 | MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding | Dinesh Manocha Team | 2507.04686 | null | |
| 2025-07-07 | Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation | Chang Xu Team | 2507.04680 | null | |
| 2025-07-06 | VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation | Lei Han Team | 2507.04524 | null | |
| 2025-07-08 | FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection | Ruixuan Wang Team | 2507.04511 | null | |
| 2025-07-06 | MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization | Changhao Chen Team | 2507.04509 | null | |
| 2025-07-06 | Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection | Sanasam Ranbir Singh Team | 2507.04458 | null | |
| 2025-07-06 | Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions | Johan Bos Team | 2507.04377 | null | |
| 2025-07-05 | LVLM-Composer’s Explicit Planning for Image Generation | Amina Grant Team | 2507.04152 | null | |
| 2025-07-05 | Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation | Hunter Young Team | 2507.04151 | null | |
| 2025-07-05 | PresentAgent: Multimodal Agent for Presentation Video Generation | Yang Zhao Team | 2507.04036 | null | |
| 2025-07-05 | A Comparative Study of Specialized LLMs as Dense Retrievers | Jiafeng Guo Team | 2507.03958 | null | |
| 2025-07-03 | ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects | Cewu Lu Team | 2507.02600 | null | |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null | |
| 2025-07-02 | Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges | Anuj Sharma Team | 2507.02074 | null | |
| 2025-07-01 | Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames | Cordelia Schmid Team | 2507.02001 | null | |
| 2025-07-02 | How Do Vision-Language Models Process Conflicting Information Across Modalities? | Ellie Pavlick Team | 2507.01790 | null | |
| 2025-07-02 | Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition | Muzammil Behzad Team | 2507.01673 | null | |
| 2025-07-02 | MARVIS: Modality Adaptive Reasoning over VISualizations | Chinmay Hegde Team | 2507.01544 | null | |
| 2025-07-02 | Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence | Martin Schramm Team | 2507.01504 | null | |
| 2025-07-02 | BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments | Mingzhai Sun Team | 2507.01485 | null | |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null | |
| 2025-07-02 | CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning | Yoshitaka Ushiku Team | 2507.01409 | null | |
| 2025-07-02 | Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model | Xi Li Team | 2507.01351 | null | |
| 2025-07-02 | AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation | Jiawei Zhang Team | 2507.01255 | null | |
| 2025-07-02 | GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jie Tang Team | 2507.01006 | null | |
| 2025-07-04 | Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations | Yunzhu Li Team | 2507.00990 | null | |
| 2025-07-01 | Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact | Seyedali Mirjalili Team | 2507.00951 | null | |
| 2025-07-01 | The Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses | Fabio Correa Xavier Team | 2507.00907 | null | |
| 2025-07-01 | ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models | Yaqi Xie Team | 2507.00898 | null | |
| 2025-07-01 | GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond | Luc Van Gool Team | 2507.00886 | null | |
| 2025-07-01 | UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement | Xiangxiang Chu Team | 2507.00721 | null | |
| 2025-07-01 | Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English | Rajesh Sharma Team | 2507.00700 | null | |
| 2025-07-01 | Context-Aware Academic Emotion Dataset and Benchmark | Wenwu Yang Team | 2507.00586 | null | |
| 2025-07-01 | Not All Attention Heads Are What You Need: Refining CLIP’s Image Representation with Attention Ablation | Rong Xiao Team | 2507.00537 | null | |
| 2025-07-01 | Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving | Yadan Luo Team | 2507.00525 | null | |
| 2025-06-30 | EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations | Sungzoon Cho Team | 2506.24016 | null | |
| 2025-06-30 | The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models | Tieniu Tan Team | 2506.24000 | null | |
| 2025-06-30 | GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models | Hassan Rivaz Team | 2506.23903 | null | |
| 2025-06-30 | A Closer Look at Conditional Prompt Tuning for Vision-Language Models | Heng Tao Shen Team | 2506.23856 | null | |
| 2025-06-30 | Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model | Fahad Shahbaz Khan Team | 2506.23822 | null | |
| 2025-06-30 | Visual Textualization for Image Prompted Object Detection | Yan Xu Team | 2506.23785 | null | |
| 2025-06-30 | PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? | Ransalu Senanayake Team | 2506.23725 | null | |
| 2025-06-30 | On the Domain Robustness of Contrastive Vision-Language Models | Erik Rodner Team | 2506.23663 | null | |
| 2025-06-30 | CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models | Bing Qin Team | 2506.23590 | null | |
| 2025-06-30 | A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation | Jie Xu Team | 2506.23584 | null | |
| 2025-07-01 | ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding | ShengJing Yang Team | 2506.23491 | null | |
| 2025-06-30 | Sanitizing Manufacturing Dataset Labels Using Vision-Language Models | Vinh Nguyen Team | 2506.23465 | null | |
| 2025-06-29 | GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields | Yutaka Matsuo Team | 2506.23352 | null | |
| 2025-06-29 | IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | Brandon Y. Feng Team | 2506.23329 | null | |
| 2025-07-01 | SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting | Hongliang Ren Team | 2506.23309 | null | |
| 2025-06-29 | Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models | Tanmoy Chakraborty Team | 2506.23122 | null | |
| 2025-06-29 | MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings | Zhicheng Dou Team | 2506.23115 | null | |
| 2025-06-29 | Empowering Small VLMs to Think with Dynamic Memorization and Exploration | Long Chen Team | 2506.23061 | null | |
| 2025-06-29 | SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions | Maarten Sap Team | 2506.23046 | null | |
| 2025-06-28 | Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models | Swadesh Swain Team | 2506.22982 | null | |
| 2025-06-27 | MiCo: Multi-image Contrast for Reinforcement Visual Reasoning | Hengshuang Zhao Team | 2506.22434 | null | |
| 2025-06-27 | Test-Time Consistency in Vision Language Models | Leonid Sigal Team | 2506.22395 | null | |
| 2025-06-27 | Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation | Xun Xu Team | 2506.22375 | null | |
| 2025-06-27 | Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment | Bo Du Team | 2506.22283 | null | |
| 2025-06-27 | COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication | Albert Gatt Team | 2506.22274 | null | |
| 2025-06-27 | Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs | Mahdieh Soleymani Baghshah Team | 2506.22146 | null | |
| 2025-06-27 | Universal Retrieval for Multimodal Trajectory Modeling | Dehan Kong Team | 2506.22056 | null | |
| 2025-06-27 | Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation | Daisuke Deguchi Team | 2506.22032 | null | |
| 2025-06-27 | SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation | Xulei Yang Team | 2506.21892 | null | |
| 2025-06-27 | Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles | Matthew J. Barth Team | 2506.21885 | null | |
| 2025-06-27 | Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation | Zhiting Hu Team | 2506.21876 | null | |
| 2025-06-27 | On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling | Ben Y. Zhao Team | 2506.21874 | null | |
| 2025-06-27 | Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling | Yong Man Ro Team | 2506.21863 | null | |
| 2025-06-27 | Embodied Domain Adaptation for Object Detection | Feras Dayoub Team | 2506.21860 | null | |
| 2025-06-27 | The Cost of Avoiding Backpropagation | Hui Guan Team | 2506.21833 | null | |
| 2025-06-26 | ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues | Carolina Nobre Team | 2506.21762 | null | |
| 2025-06-26 | Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs | Ismini Lourentzou Team | 2506.21656 | null | |
| 2025-06-26 | Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration | Jian Wu Team | 2506.21509 | null | |
| 2025-06-26 | Global and Local Entailment Learning for Natural World Imagery | Nathan Jacobs Team | 2506.21476 | null | |
| 2025-06-26 | Spatial Mental Modeling from Limited Views | Li Fei-Fei Team | 2506.21458 | null | |
| 2025-06-27 | ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models | Ziwei Liu Team | 2506.21356 | null | |
| 2025-06-26 | LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning | Hayaru Shouno Team | 2506.21317 | null | |
| 2025-06-26 | DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Ganesh Ramakrishnan Team | 2506.21316 | null | |
| 2025-06-26 | World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Xipeng QIu Team | 2506.21230 | null | |
| 2025-06-26 | Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion | Jian Liang Team | 2506.21144 | null | |
| 2025-06-26 | V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling | Bin Ran Team | 2506.21041 | null | |
| 2025-06-26 | Multimodal Prompt Alignment for Facial Expression Recognition | Shutao Li Team | 2506.21017 | null | |
| 2025-06-26 | Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology | S Kevin Zhou Team | 2506.21001 | null | |
| 2025-06-26 | TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation | Yihong Wu Team | 2506.20991 | null | |
| 2025-06-26 | SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes | Zheng Zhang Team | 2506.20990 | null | |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null | |
| 2025-06-26 | E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs | Minh-Son Dao Team | 2506.20944 | null | |
| 2025-06-25 | Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models | Zafer Dogan Team | 2506.20832 | null | |
| 2025-06-25 | How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? | Bastian Leibe Team | 2506.20795 | null | |
| 2025-06-27 | Shape2Animal: Creative Animal Generation from Natural Silhouettes | Trung-Nghia Le Team | 2506.20616 | null | |
| 2025-06-25 | HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction | Maja Matarić Team | 2506.20566 | null | |
| 2025-06-25 | Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation | Morten Rieger Hannemose Team | 2506.20449 | null | |
| 2025-06-25 | CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition | Michael Gienger Team | 2506.20373 | null | |
| 2025-06-25 | Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards | Bo Zheng Team | 2506.20332 | null | |
| 2025-06-25 | MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations | Vikram S. Adve Team | 2506.20100 | null | |
| 2025-06-24 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null | |
| 2025-06-24 | Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models | Christoph M. Friedrich Team | 2506.19825 | null | |
| 2025-06-24 | CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation | Jiangmiao Pang Team | 2506.19816 | null | |
| 2025-06-24 | UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation | Zhongliang Jiang Team | 2506.19694 | null | |
| 2025-06-24 | PEVLM: Parallel Encoding for Vision-Language Models | Yong Wu Team | 2506.19651 | null | |
| 2025-06-24 | V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis | Zuozhu Liu Team | 2506.19610 | null | |
| 2025-06-24 | ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP | Bokui Chen Team | 2506.19608 | null | |
| 2025-06-24 | Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects | Angelo Cangelosi Team | 2506.19579 | null | |
| 2025-06-24 | Visual hallucination detection in large vision-language models via evidential conflict | Liping Jing Team | 2506.19513 | null | |
| 2025-06-24 | T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models | Qingyao Wu Team | 2506.19498 | null | |
| 2025-06-24 | Emergence of Text Readability in Vision Language Models | Bohyung Han Team | 2506.19389 | null | |
| 2025-06-24 | Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference | Nutan Chen Team | 2506.19303 | null | |
| 2025-06-24 | Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models | Dan Zeng Team | 2506.19300 | null | |
| 2025-06-24 | Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding | Hui Xiong Team | 2506.19288 | null | |
| 2025-06-24 | MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models | Bo Zheng Team | 2506.19257 | null | |
| 2025-06-24 | Scaffolding Dexterous Manipulation with Vision-Language Models | Dorsa Sadigh Team | 2506.19212 | null | |
| 2025-06-23 | Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition | Bjoern W. Schuller Team | 2506.19079 | null | |
| 2025-06-23 | HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models | Krzysztof Czarnecki Team | 2506.19072 | null | |
| 2025-06-23 | GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs | Guanxi Shen Team | 2506.18985 | null | |
| 2025-06-23 | VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning | Jian Zhang Team | 2506.18564 | null | |
| 2025-06-23 | Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey | Heng Tao Shen Team | 2506.18504 | null | |
| 2025-06-23 | InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models | Wenhai Wang Team | 2506.18385 | null | |
| 2025-06-23 | Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review | Jing Qin Team | 2506.18378 | null | |
| 2025-06-23 | Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? | Bill Howe Team | 2506.18322 | null | |
| 2025-06-24 | Referring Expression Instance Retrieval and A Strong End-to-End Baseline | JinQiao Wang Team | 2506.18246 | null | |
| 2025-06-23 | Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | Xinhai Zhao Team | 2506.18234 | null | |
| 2025-06-22 | See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis | Xiaoxiao Li Team | 2506.18140 | null | |
| 2025-06-22 | CLGRPO: Reasoning Ability Enhancement for Small VLMs | Zhiwang Zhang Team | 2506.18048 | null | |
| 2025-06-22 | Adapting Vision-Language Models for Evaluating World Models | Sarah Parisot Team | 2506.17967 | null | |
| 2025-06-21 | RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models | Marco Pavone Team | 2506.17811 | null | |
| 2025-06-21 | MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation | Xiaochuan Shi Team | 2506.17664 | null | |
| 2025-06-21 | Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning | Yu-Chiang Frank Wang Team | 2506.17645 | null | |
| 2025-06-21 | CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning | Xiaoling Wang Team | 2506.17629 | null | |
| 2025-06-21 | DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving | Zhengzhong Tu Team | 2506.17590 | null | |
| 2025-06-21 | HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models | Tao He Team | 2506.17587 | null | |
| 2025-06-20 | Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction | Jose Dolz Team | 2506.17503 | null | |
| 2025-06-20 | Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation | Ismail Ben Ayed Team | 2506.17500 | null | |
| 2025-06-20 | General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting | Georgios Georgakis Team | 2506.17462 | null | |
| 2025-06-20 | Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? | Klara Nahrstedt Team | 2506.17417 | null | |
| 2025-06-20 | VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning | Hengshuang Zhao Team | 2506.17221 | null | |
| 2025-06-20 | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens | Chuang Gan Team | 2506.17218 | null | |
| 2025-06-20 | Do We Need Large VLMs for Spotting Soccer Actions? | Sandeep Chaurasia Team | 2506.17144 | null | |
| 2025-06-20 | Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments | Nathaniel D. Bastian Team | 2506.16994 | null | |
| 2025-06-20 | FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation | Jinqiao Wang Team | 2506.16806 | null | |
| 2025-06-20 | Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes | Chen Feng Team | 2506.16805 | null | |
| 2025-06-20 | Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models | Xiaohua Xu Team | 2506.16760 | null | |
| 2025-06-20 | TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion | Xinbo Gao Team | 2506.16730 | null | |
| 2025-06-20 | V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos | Xiaoyu Qin Team | 2506.16716 | null | |
| 2025-06-20 | VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation | Liang Ding Team | 2506.16703 | null | |
| 2025-06-20 | LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation | Jing Liu Team | 2506.16691 | null | |
| 2025-06-19 | CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity | Yunzhu Li Team | 2506.16652 | null | |
| 2025-06-19 | History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation | Fatemeh Afghah Team | 2506.16623 | null | |
| 2025-06-19 | GoalLadder: Incremental Goal Discovery with Vision-Language Models | Shimon Whiteson Team | 2506.16396 | null | |
| 2025-06-19 | CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset | Amith Adiraju Team | 2506.16385 | null | |
| 2025-06-19 | FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models | Tat-Seng Chua Team | 2506.16218 | null | |
| 2025-06-19 | AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models | Shanghang Zhang Team | 2506.16112 | null | |
| 2025-06-19 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation | Yansong Tang Team | 2506.16058 | null | |
| 2025-06-19 | DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning | Zongqing Lu Team | 2506.16012 | null | |
| 2025-06-18 | VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics | Michal Štefánik Team | 2506.15903 | null | |
| 2025-06-18 | GenRecal: Generation after Recalibration from Large to Small Vision-Language Models | Yueh-Hua Wu Team | 2506.15681 | null | |
| 2025-06-18 | Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning | Imran Razzak Team | 2506.15649 | null | |
| 2025-06-18 | FindingDory: A Benchmark to Evaluate Memory in Embodied Agents | Zsolt Kira Team | 2506.15635 | null | |
| 2025-06-18 | WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Rémi Lebret Team | 2506.15594 | link | |
| 2025-06-18 | DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement | Zhuang Li Team | 2506.15583 | link | |
| 2025-06-18 | Context-Informed Grounding Supervision | Minjoon Seo Team | 2506.15480 | link | |
| 2025-06-19 | OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models | Guotai Wang Team | 2506.15318 | null | |
| 2025-06-18 | MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering | Adrian K. Davision Team | 2506.15298 | null | |
| 2025-06-18 | ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections | Shin’ichi Satoh Team | 2506.15180 | null | |
| 2025-06-18 | DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory | Yue Gao Team | 2506.15096 | null | |
| 2025-06-18 | An Empirical Study of Bugs in Data Visualization Libraries | Chengnian Sun Team | 2506.15084 | link | |
| 2025-06-17 | PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning | Yeyun Gong Team | 2506.14907 | link | |
| 2025-06-17 | RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills | Chuang Gan Team | 2506.14763 | null | |
| 2025-06-17 | Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models | Yuke Zhu Team | 2506.14727 | null | |
| 2025-06-17 | AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions | Dacheng Tao Team | 2506.14697 | null | |
| 2025-06-17 | Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models | Jiaheng Wei Team | 2506.14674 | null | |
| 2025-06-17 | StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery | Michelle Pasco Team | 2506.14670 | null | |
| 2025-06-17 | SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks | Liang Lin Team | 2506.14512 | null | |
| 2025-06-17 | Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? | Soumik Sarkar Team | 2506.14507 | link | |
| 2025-06-17 | Adapting Lightweight Vision Language Models for Radiological Visual Question Answering | Chang Sun Team | 2506.14451 | null | |
| 2025-06-17 | Causally Steered Diffusion for Automated Video Counterfactual Generation | Sotirios A. Tsaftaris Team | 2506.14404 | null | |
| 2025-06-17 | Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments | Xuesu Xiao Team | 2506.14233 | null | |
| 2025-06-17 | Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology | Benjamin Kwan Team | 2506.14136 | null | |
| 2025-06-17 | A Hierarchical Test Platform for Vision Language Model (VLM)-Integrated Real-World Autonomous Driving | Ziran Wang Team | 2506.14100 | null | |
| 2025-06-16 | Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation | Hyeongwoo Kim Team | 2506.14015 | null | |
| 2025-06-16 | GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics | Mac Schwager Team | 2506.14009 | null | |
| 2025-06-16 | Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography | Alejandro Santos-Díaz Team | 2506.13964 | null | |
| 2025-06-16 | HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment | Abdul Bais Team | 2506.13925 | null | |
| 2025-06-16 | Touch begins where vision ends: Generalizable policies for contact-rich manipulation | Raunaq Bhirangi Team | 2506.13762 | null | |
| 2025-06-16 | Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins | Wei-Chiu Ma Team | 2506.13761 | null | |
| 2025-06-16 | OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning | Yonghang Tai Team | 2506.13723 | null | |
| 2025-06-16 | ROSA: Harnessing Robot States for Vision-Language and Action Alignment | Xiaoyan Sun Team | 2506.13679 | null | |
| 2025-06-16 | DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models | Hanspeter Pfister Team | 2506.13638 | null | |
| 2025-06-16 | VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation | Wei Pan Team | 2506.13428 | null | |
| 2025-06-16 | Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation | Marija Popović Team | 2506.13367 | null | |
| 2025-06-16 | Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling | Rei Kawakami Team | 2506.13282 | null | |
| 2025-06-16 | Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments | Ee-Chien Chang Team | 2506.13205 | null | |
| 2025-06-16 | Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence | Bernard Ghanem Team | 2506.13187 | null | |
| 2025-06-16 | GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models | Jun Wang Team | 2506.13166 | null | |
| 2025-06-16 | Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs | Byung-Hoon Kim Team | 2506.13102 | null | |
| 2025-06-16 | PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue | Siqi Liu Team | 2506.13063 | null | |
| 2025-06-17 | HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs | Xuezhi Cao Team | 2506.13038 | null | |
| 2025-06-15 | CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making | Zuozhu Liu Team | 2506.12849 | null | |
| 2025-06-15 | Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models | Chang D. Yoo Team | 2506.12822 | null | |
| 2025-06-15 | Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models | Wentao Zhang Team | 2506.12776 | null | |
| 2025-06-15 | NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models | Jitao Sang Team | 2506.12706 | null | |
| 2025-06-15 | Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context | Sandeep Singhal Team | 2506.12683 | null | |
| 2025-06-14 | Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation | Yuexian Zou Team | 2506.12609 | null | |
| 2025-06-13 | Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale | Minsu Cho Team | 2506.12009 | null | |
| 2025-06-13 | How Visual Representations Map to Language Feature Space in Multimodal LLMs | Neel Nanda Team | 2506.11976 | null | |
| 2025-06-13 | Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation | Kaifu Zhang Team | 2506.11820 | null | |
| 2025-06-13 | MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space | Jan Strich Team | 2506.11684 | null | |
| 2025-06-13 | VLM@school – Evaluation of AI image understanding on German middle school knowledge | Vincent Tischler Team | 2506.11604 | null | |
| 2025-06-16 | EasyARC: Evaluating Vision Language Models on True Visual Reasoning | Aylin Akkus Team | 2506.11595 | null | |
| 2025-06-13 | Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis | Johannes Betz Team | 2506.11526 | null | |
| 2025-06-13 | Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs | Min-Yen Kan Team | 2506.11515 | null | |
| 2025-06-13 | Taming Stable Diffusion for Computed Tomography Blind Super-Resolution | Lichao Mou Team | 2506.11496 | null | |
| 2025-06-13 | On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving | Mert D. Pesé Team | 2506.11472 | null | |
| 2025-06-12 | Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving | Liam Paull Team | 2506.11234 | null | |
| 2025-06-12 | AIR: Zero-shot Generative Model Adaptation with Iterative Refinement | Ngai-Man Cheung Team | 2506.10895 | link | |
| 2025-06-13 | RationalVLA: A Rational Vision-Language-Action Model with Dual System | Haoang Li Team | 2506.10826 | null | |
| 2025-06-12 | Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding | Mir Feroskhan Team | 2506.10756 | null | |
| 2025-06-13 | IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain | Yefeng Zheng Team | 2506.10730 | link | |
| 2025-06-12 | GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning | Guan Huang Team | 2506.10639 | null | |
| 2025-06-12 | Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning | Yong Liu Team | 2506.10575 | null | |
| 2025-06-12 | LLMs Are Not Yet Ready for Deepfake Image Detection | Kristen Moore Team | 2506.10474 | null | |
| 2025-06-12 | UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models | Shuai Lu Team | 2506.10342 | null | |
| 2025-06-12 | Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions | Gaowei Chen Team | 2506.10334 | null | |
| 2025-06-12 | HalLoc: Token-level Localization of Hallucinations for Vision Language Models | Gunhee Kim Team | 2506.10286 | null | |
| 2025-06-11 | Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval | Francis Ferraro Team | 2506.10202 | null | |
| 2025-06-11 | Improving Personalized Search with Regularized Low-Rank Parameter Updates | Bryan Russell Team | 2506.10182 | null | |
| 2025-06-11 | A Navigation Framework Utilizing Vision-Language Models | Kaiyu tang Team | 2506.10172 | null | |
| 2025-06-11 | One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence | Marinka Zitnik Team | 2506.10157 | null | |
| 2025-06-11 | ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs | Lijuan Wang Team | 2506.10128 | null | |
| 2025-06-11 | Test-Time Adaptation for Generalizable Task Progress Estimation | Alessandra Russo Team | 2506.10085 | null | |
| 2025-06-11 | Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing | Tieniu Tan Team | 2506.09965 | link | |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null | |
| 2025-06-11 | 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation | Hyunjung Shim Team | 2506.09883 | link | |
| 2025-06-11 | Adding simple structure at inference improves Vision-Language Compositionality | Gorka Azkune Team | 2506.09691 | link | |
| 2025-06-11 | FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models | Liangqiong Qu Team | 2506.09638 | null | |
| 2025-06-11 | Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs | Jaehyung Kim Team | 2506.09522 | link | |
| 2025-06-11 | Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning | Jia Li Team | 2506.09473 | null | |
| 2025-06-11 | TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision | Susmit Jha Team | 2506.09445 | null | |
| 2025-06-11 | DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Ge Li Team | 2506.09353 | null | |
| 2025-06-10 | UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation | Li Fei-Fei Team | 2506.09284 | null | |
| 2025-06-10 | MultiNet: An Open-Source Software Toolkit \& Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Harshvardhan Sikka Team | 2506.09172 | null | |
| 2025-06-10 | VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning | Zhenfei Yin Team | 2506.09049 | null | |
| 2025-06-11 | Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs | Yonatan Belinkov Team | 2506.09047 | null | |
| 2025-06-10 | Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better | Jiaqi Wang Team | 2506.09040 | null | |
| 2025-06-10 | Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Liansheng Wang Team | 2506.08990 | null | |
| 2025-06-10 | Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions | Yejin Choi Team | 2506.08927 | null | |
| 2025-06-12 | Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought | Shanghang Zhang Team | 2506.08817 | null | |
| 2025-06-10 | Multimodal Representation Alignment for Cross-modal Information Retrieval | Luis A. Leiva Team | 2506.08774 | null | |
| 2025-06-10 | PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Xiaodan Liang Team | 2506.08708 | null | |
| 2025-06-10 | VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism | Weijiang Yu Team | 2506.08691 | null | |
| 2025-06-10 | ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction | Taesup Kim Team | 2506.08678 | null | |
| 2025-06-10 | Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs | Ang Li Team | 2506.08543 | null | |
| 2025-06-10 | Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring | Jiaheng Wei Team | 2506.08429 | null | |
| 2025-06-11 | SafeCoT: Improving VLM Safety with Minimal Reasoning | Chaochao Lu Team | 2506.08399 | null | |
| 2025-06-10 | SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding | Jaeyoung Do Team | 2506.08391 | null | |
| 2025-06-09 | A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks | Matthias Bethge Team | 2506.08227 | null | |
| 2025-06-11 | GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Guha Balakrishnan Team | 2506.08194 | null | |
| 2025-06-09 | Open World Scene Graph Generation using Vision Language Models | Anuj Karpatne Team | 2506.08189 | null | |
| 2025-06-09 | CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Ramya Korlakai Vinayak Team | 2506.08071 | null | |
| 2025-06-10 | Vision Transformers Don’t Need Trained Registers | Yossi Gandelsman Team | 2506.08010 | null | |
| 2025-06-09 | Hidden in plain sight: VLMs overlook their visual representations | Trevor Darrell Team | 2506.08008 | null | |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null | |
| 2025-06-09 | Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations | Yiqing Shen Team | 2506.07943 | null | |
| 2025-06-09 | Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models | Zsolt Kira Team | 2506.07936 | null | |
| 2025-06-09 | SAM2Auto: Auto Annotation Using FLASH | Q. M. Jonathan Wu Team | 2506.07850 | null | |
| 2025-06-09 | Image Reconstruction as a Tool for Feature Analysis | Andrey Kuznetsov Team | 2506.07803 | null | |
| 2025-06-09 | Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger | Shiming Xiang Team | 2506.07785 | null | |
| 2025-06-09 | Language-Vision Planner and Executor for Text-to-Visual Reasoning | Ling Liu Team | 2506.07778 | null | |
| 2025-06-10 | ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models | Shuai Lu Team | 2506.07739 | null | |
| 2025-06-09 | OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting | Bastian Leibe Team | 2506.07697 | null | |
| 2025-06-09 | Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline | Idan Szpektor Team | 2506.07631 | null | |
| 2025-06-09 | Event-Priori-Based Vision-Language Model for Efficient Visual Understanding | Michele Magno Team | 2506.07627 | null | |
| 2025-06-10 | SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems | Zhengzhong Tu Team | 2506.07564 | null | |
| 2025-06-10 | GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition | Conghui He Team | 2506.07553 | null | |
| 2025-06-09 | Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent | Ting Yang Ling Team | 2506.07509 | null | |
| 2025-06-09 | Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency | Xinggang Wang Team | 2506.07497 | null | |
| 2025-06-09 | CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization | Hyun Myung Team | 2506.07484 | null | |
| 2025-06-09 | LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments | Josh Park Team | 2506.07416 | null | |
| 2025-06-09 | MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems | Tao Qi Team | 2506.07399 | null | |
| 2025-06-06 | CoMemo: LVLMs Need Image Context with Image Memory | Jifeng Dai Team | 2506.06279 | null | |
| 2025-06-06 | Movie Facts and Fibs (MF $^2$ ): A Benchmark for Long Movie Understanding | André F. T. Martins Team | 2506.06275 | null | |
| 2025-06-06 | Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study | Lena Maier-Hein Team | 2506.06232 | null | |
| 2025-06-06 | GenIR: Generative Visual Feedback for Mental Image Retrieval | James Davis Team | 2506.06220 | null | |
| 2025-06-06 | STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Horst Possegger Team | 2506.06218 | null | |
| 2025-06-06 | WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management | Zijian Wang Team | 2506.06084 | null | |
| 2025-06-06 | Full Conformal Adaptation of Medical Vision-Language Models | Jose Dolz Team | 2506.06076 | null | |
| 2025-06-06 | BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning | Rudolf Lioutikov Team | 2506.06072 | null | |
| 2025-06-06 | MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | Yiren Song Team | 2506.05982 | null | |
| 2025-06-06 | HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios | Weihao Gu Team | 2506.05883 | null | |
| 2025-06-06 | Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? | Hitomi Yanaka Team | 2506.05765 | null | |
| 2025-06-06 | MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory | João Magalhães Team | 2506.05696 | null | |
| 2025-06-06 | DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models | Xianpeng Lang Team | 2506.05667 | null | |
| 2025-06-05 | MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning | Furong Huang Team | 2506.05523 | null | |
| 2025-06-05 | Degradation-Aware Image Enhancement via Vision-Language Classification | Zibo Meng Team | 2506.05450 | null | |
| 2025-06-09 | Coordinated Robustness Evaluation Framework for Vision-Language Models | Soumyendu Sarkar Team | 2506.05429 | null | |
| 2025-06-06 | Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Xiaodan Liang Team | 2506.05318 | null | |
| 2025-06-05 | MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm | Xiang Bai Team | 2506.05218 | null | |
| 2025-06-05 | Quantifying Cross-Modality Memorization in Vision-Language Models | Chiyuan Zhang Team | 2506.05198 | null | |
| 2025-06-05 | CIVET: Systematic Evaluation of Understanding in VLMs | Giuseppe Riccardi Team | 2506.05146 | null | |
| 2025-06-05 | PixCell: A generative foundation model for digital histopathology images | Dimitris Samaras Team | 2506.05127 | null | |
| 2025-06-05 | A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions | Dung Nguyen Team | 2506.05061 | null | |
| 2025-06-05 | Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System | Moju Zhao Team | 2506.05020 | null | |
| 2025-06-05 | ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT | Mikołaj Koszowski Team | 2506.04929 | null | |
| 2025-06-05 | SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs | Dacheng Tao Team | 2506.04743 | null | |
| 2025-06-05 | Robust Few-Shot Vision-Language Model Adaptation | Shu Kong Team | 2506.04713 | null | |
| 2025-06-05 | HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Sung Ju Hwang Team | 2506.04704 | null | |
| 2025-06-05 | SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents | Yu-Wing Tai Team | 2506.04606 | null | |
| 2025-06-05 | MuSciClaims: Multimodal Scientific Claim Verification | Niranjan Balasubramanian Team | 2506.04585 | null | |
| 2025-06-05 | Handle-based Mesh Deformation Guided By Vision Language Model | Aniket Bera Team | 2506.04562 | null | |
| 2025-06-04 | RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics | Shanghang Zhang Team | 2506.04308 | null | |
| 2025-06-04 | Image Editing As Programs with Diffusion Models | Xinchao Wang Team | 2506.04158 | null | |
| 2025-06-04 | Recent Advances in Medical Image Classification | Ngoc Quoc Ly Team | 2506.04129 | null | |
| 2025-06-04 | LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward | Jing Li Team | 2506.04070 | null | |
| 2025-06-04 | Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization | Min Zhang Team | 2506.04039 | null | |
| 2025-06-04 | Vocabulary-free few-shot learning for Vision-Language Models | Christophe De Vleeschouwer Team | 2506.04005 | null | |
| 2025-06-04 | DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models | Anders Holst Team | 2506.03933 | null | |
| 2025-06-04 | Zero-Shot Temporal Interaction Localization for Egocentric Videos | Hesheng Wang Team | 2506.03662 | null | |
| 2025-06-04 | Spatial Understanding from Videos: Structured Prompts Meet Simulation Data | Liqiang Nie Team | 2506.03642 | null | |
| 2025-06-04 | VLMs Can Aggregate Scattered Training Patches | Chaochao Lu Team | 2506.03614 | null | |
| 2025-06-04 | BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance | Ngan Le Team | 2506.03589 | null | |
| 2025-06-04 | MiMo-VL Technical Report | Bingquan Xia Team | 2506.03569 | null | |
| 2025-06-04 | Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation | Yixin Zhang Team | 2506.03521 | null | |
| 2025-06-04 | DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models | Aliaksandr Siarohin Team | 2506.03517 | null | |
| 2025-06-04 | POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning | Weixin Yao Team | 2506.03511 | link | |
| 2025-06-03 | Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views | Hansaem Kim Team | 2506.03371 | null | |
| 2025-06-03 | Robustness in Both Domains: CLIP Needs a Robust Text Encoder | Volkan Cevher Team | 2506.03355 | null | |
| 2025-06-03 | Grounded Vision-Language Interpreter for Integrated Task and Motion Planning | Atsushi Hashimoto Team | 2506.03270 | null | |
| 2025-06-03 | OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models | Li Yi Team | 2506.03135 | null | |
| 2025-06-03 | EgoVLM: Policy Optimization for Egocentric Video Understanding | Linshen Liu Team | 2506.03097 | null | |
| 2025-06-03 | DPO Learning with LLMs-Judge Signal for Computer Use Agents | Phillip Howard Team | 2506.03095 | null | |
| 2025-06-03 | From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit | Demba Ba Team | 2506.03093 | null | |
| 2025-06-03 | Text-guided Generation of Efficient Personalized Inspection Plans | Aniket Bera Team | 2506.02917 | null | |
| 2025-06-04 | FlySearch: Exploring how vision-language models explore | Maciej Wołczyk Team | 2506.02896 | null | |
| 2025-06-03 | Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights | Tony Wu Team | 2506.02865 | null | |
| 2025-06-03 | SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking | Yiwei Wang Team | 2506.02803 | null | |
| 2025-06-04 | Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning | Arash Afkanpour Team | 2506.02738 | null | |
| 2025-06-03 | Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation | Toshihiko Yamasaki Team | 2506.02708 | null | |
| 2025-06-03 | Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet | Zhi Wang Team | 2506.02671 | null | |
| 2025-06-03 | Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models | Dong Seog Han Team | 2506.02615 | null | |
| 2025-06-03 | Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models | Farzan Farnia Team | 2506.02557 | null | |
| 2025-06-03 | Sign Language: Towards Sign Understanding for Robot Autonomy | David Hsu Team | 2506.02556 | null | |
| 2025-06-03 | SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence | Yueming Jin Team | 2506.02555 | null | |
| 2025-06-03 | Rethinking Post-Unlearning Behavior of Large Vision-Language Models | Kyomin Jung Team | 2506.02541 | null | |
| 2025-06-04 | MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection | Qingyao Wu Team | 2506.02535 | null | |
| 2025-06-03 | VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments | Yu Wang Team | 2506.02387 | null | |
| 2025-06-03 | Auto-Labeling Data for Object Detection | Jason J. Corso Team | 2506.02359 | null | |
| 2025-06-03 | RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models | Jianzong Wang Team | 2506.02354 | null | |
| 2025-05-30 | ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL | Lili Qiu Team | 2505.24875 | null | |
| 2025-05-30 | ProxyThinker: Test-Time Guidance through Small Visual Reasoners | Vicente Ordonez Team | 2505.24872 | null | |
| 2025-05-30 | GenSpace: Benchmarking Spatially-Aware Image Generation | Zhou Zhao Team | 2505.24870 | null | |
| 2025-05-30 | Time Blindness: Why Video-Language Models Can’t See What Humans Can? | Mohamed Elhoseiny Team | 2505.24867 | null | |
| 2025-05-30 | Conformal Prediction for Zero-Shot Models | Jose Dolz Team | 2505.24693 | null | |
| 2025-05-30 | BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models | Khoa Luu Team | 2505.24649 | null | |
| 2025-05-30 | SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition | Wadii Boulila Team | 2505.24600 | null | |
| 2025-05-30 | AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders | Liang Ding Team | 2505.24519 | null | |
| 2025-05-30 | CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | Thamar Solorio Team | 2505.24456 | null | |
| 2025-05-30 | Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning | Matthias Hein Team | 2505.24424 | null | |
| 2025-05-30 | MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs | Sophia Ananiadou Team | 2505.24423 | null | |
| 2025-05-30 | Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | Fadoua Ghourabi Team | 2505.24371 | null | |
| 2025-05-30 | KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval | Yong Li Team | 2505.24342 | null | |
| 2025-05-30 | ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving | Songan Zhang Team | 2505.24317 | null | |
| 2025-05-30 | Benchmarking Foundation Models for Zero-Shot Biometric Tasks | Arun Ross Team | 2505.24214 | null | |
| 2025-05-30 | Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap | Baharan Mirzasoleiman Team | 2505.24208 | null | |
| 2025-05-30 | DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? | Xuegong Zhang Team | 2505.24173 | null | |
| 2025-05-30 | CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs | Xuchen Song Team | 2505.24120 | null | |
| 2025-05-29 | mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation | Zhengzhong Tu Team | 2505.24073 | null | |
| 2025-05-29 | Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding | Tinoosh Mohsenin Team | 2505.23990 | null | |
| 2025-05-29 | ZeroGUI: Automating Online GUI Learning at Zero Human Cost | Jifeng Dai Team | 2505.23762 | link | |
| 2025-05-29 | Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint | David M. Chan Team | 2505.23759 | link | |
| 2025-05-29 | To Trust Or Not To Trust Your Vision-Language Model’s Prediction | Olga Fink Team | 2505.23745 | link | |
| 2025-05-29 | LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization | Jing Liao Team | 2505.23740 | null | |
| 2025-05-29 | Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better | Sergey Levine Team | 2505.23705 | null | |
| 2025-05-29 | Grounded Reinforcement Learning for Visual Reasoning | Katerina Fragkiadaki Team | 2505.23678 | null | |
| 2025-05-29 | Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition | Liangcai Gao Team | 2505.23566 | null | |
| 2025-05-30 | Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information | Weiping Li Team | 2505.23558 | link | |
| 2025-05-29 | TRAP: Targeted Redirecting of Agentic Preferences | Gagandeep Singh Team | 2505.23518 | null | |
| 2025-05-29 | VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation | Xu-Cheng Yin Team | 2505.23484 | link | |
| 2025-05-29 | Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model | Muzammil Behzad Team | 2505.23358 | null | |
| 2025-05-29 | LADA: Scalable Label-Specific CLIP Adapter for Continual Learning | Min-Ling Zhang Team | 2505.23271 | link | |
| 2025-05-29 | VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation | Panayiotis Kolios Team | 2505.23267 | null | |
| 2025-05-29 | Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion | Tao Xiang Team | 2505.23266 | null | |
| 2025-05-29 | ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering | Lei Wang Team | 2505.23242 | null | |
| 2025-05-29 | PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents | Jinjin Gu Team | 2505.23130 | null | |
| 2025-05-29 | Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation | Yu Cheng Team | 2505.23043 | link | |
| 2025-05-29 | An Empirical Study of Federated Prompt Learning for Vision Language Model | Mang Ye Team | 2505.23024 | null | |
| 2025-05-29 | SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model | Zhenwei Shi Team | 2505.23010 | null | |
| 2025-05-29 | QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining | Muhao Chen Team | 2505.23004 | link | |
| 2025-05-28 | Zero-Shot Vision Encoder Grafting via LLM Surrogates | Tom Goldstein Team | 2505.22664 | link | |
| 2025-05-28 | Training Free Stylized Abstraction | Vishal M. Patel Team | 2505.22663 | null | |
| 2025-05-28 | VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models | Dong Yu Team | 2505.22654 | null | |
| 2025-05-28 | Sherlock: Self-Correcting Reasoning in Vision-Language Models | Ruqi Zhang Team | 2505.22651 | null | |
| 2025-05-28 | Hypothesis Testing in Imaging Inverse Problems | Marcelo Pereyra Team | 2505.22481 | null | |
| 2025-05-28 | Zero-Shot 3D Visual Grounding from Vision-Language Models | Junwei Liang Team | 2505.22429 | null | |
| 2025-05-28 | IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth | Syed Masum Billah Team | 2505.22305 | null | |
| 2025-05-28 | Investigating Mechanisms for In-Context Vision Language Binding | Vineet Gandhi Team | 2505.22200 | null | |
| 2025-05-29 | Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging | Piji Li Team | 2505.22150 | null | |
| 2025-05-28 | 3D Question Answering via only 2D Vision-Language Models | Qianru Sun Team | 2505.22143 | null | |
| 2025-05-28 | Reinforced Reasoning for Embodied Planning | Bo Jin Team | 2505.22050 | null | |
| 2025-05-28 | Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization | Xinlei Chen Team | 2505.22038 | null | |
| 2025-05-28 | Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset | Muhammad Abdul-Mageed Team | 2505.21979 | null | |
| 2025-05-29 | DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation | Xin Tan Team | 2505.21969 | null | |
| 2025-05-28 | Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | Usman Naseem Team | 2505.21967 | null | |
| 2025-05-28 | Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs | Byonghyo Shim Team | 2505.21955 | null | |
| 2025-05-28 | Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null | |
| 2025-05-28 | Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation | Christian Desrosiers Team | 2505.21844 | null | |
| 2025-05-27 | MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning | Vivek Gupta Team | 2505.21771 | null | |
| 2025-05-27 | MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis | Christian Wachinger Team | 2505.21698 | null | |
| 2025-05-27 | ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models | Yueting Zhuang Team | 2505.21500 | null | |
| 2025-05-27 | AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery | Qing Wang Team | 2505.21499 | null | |
| 2025-05-27 | Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration | Ziwei Zhu Team | 2505.21472 | null | |
| 2025-05-27 | ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models | Wentao Zhang Team | 2505.21465 | null | |
| 2025-05-27 | LazyVLM: Neuro-Symbolic Approach to Video Analytics | M. Tamer Özsu Team | 2505.21459 | null | |
| 2025-05-27 | DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models | Soumik Sarkar Team | 2505.21382 | null | |
| 2025-05-27 | XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration | Min Zhang Team | 2505.21279 | null | |
| 2025-05-27 | Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation | Yutao Yue Team | 2505.21106 | null | |
| 2025-05-27 | DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response | Naoto Yokoya Team | 2505.21089 | null | |
| 2025-05-27 | LPOI: Listwise Preference Optimization for Vision Language Models | Gunhee Kim Team | 2505.21061 | null | |
| 2025-05-27 | RefAV: Towards Planning-Centric Scenario Mining | Neehar Peri Team | 2505.20981 | null | |
| 2025-05-27 | On VLMs for Diverse Tasks in Multimodal Meme Classification | Jasabanta Patro Team | 2505.20937 | null | |
| 2025-05-27 | A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models | Bugeun Kim Team | 2505.20901 | null | |
| 2025-05-27 | AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding | Joon Son Chung Team | 2505.20862 | null | |
| 2025-05-27 | Rendering-Aware Reinforcement Learning for Vector Graphics Generation | Marco Pedersoli Team | 2505.20793 | null | |
| 2025-05-27 | FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation | Mir Feroskhan Team | 2505.20783 | null | |
| 2025-05-27 | Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models | Yao Yang Team | 2505.20728 | null | |
| 2025-05-27 | ManiTaskGen: A Comprehensive Task Generator for Benchmarking and Improving Vision-Language Agents on Embodied Decision-Making | Hao Su Team | 2505.20726 | null | |
| 2025-05-27 | Automating eHMI Action Design with LLMs for Automated Vehicle Communication | Takeo Igarashi Team | 2505.20711 | null | |
| 2025-05-27 | GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning | Sundong Kim Team | 2505.20672 | null | |
| 2025-05-26 | Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models | Naoto Yokoya Team | 2505.20236 | null | |
| 2025-05-26 | Agentic 3D Scene Generation with Spatially Contextualized VLMs | Chi-Keung Tang Team | 2505.20129 | null | |
| 2025-05-26 | MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models | James M. Rehg Team | 2505.20122 | null | |
| 2025-05-27 | EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition | Sören Auer Team | 2505.20033 | null | |
| 2025-05-26 | ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers | Elmar Rückert Team | 2505.20032 | null | |
| 2025-05-26 | Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models | Ernest K. Ryu Team | 2505.20021 | null | |
| 2025-05-26 | Can Visual Encoder Learn to See Arrows? | Hiroaki Ozaki Team | 2505.19944 | null | |
| 2025-05-26 | Attention! You Vision Language Model Could Be Maliciously Manipulated | Shudong Zhang Team | 2505.19911 | null | |
| 2025-05-26 | Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement | Muzammil Behzad Team | 2505.19895 | null | |
| 2025-05-26 | One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP | Kehuan Zhang Team | 2505.19840 | null | |
| 2025-05-26 | TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning | Dongbin Zhao Team | 2505.19769 | null | |
| 2025-05-26 | Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality | Alessandro Bruno Team | 2505.19696 | null | |
| 2025-05-26 | Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs | Shu-Tao Xia Team | 2505.19678 | null | |
| 2025-05-26 | JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models | Yingchun Wang Team | 2505.19610 | null | |
| 2025-05-26 | What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation | Rongrong Ji Team | 2505.19569 | null | |
| 2025-05-26 | FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models | Ruixuan Li Team | 2505.19536 | null | |
| 2025-05-26 | Locality-Aware Zero-Shot Human-Object Interaction Detection | Minsu Cho Team | 2505.19503 | null | |
| 2025-05-26 | Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models | Guoliang Kang Team | 2505.19498 | null | |
| 2025-05-26 | Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model | Yu Cheng Team | 2505.19406 | null | |
| 2025-05-27 | DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving | Hao Zhao Team | 2505.19381 | null | |
| 2025-05-26 | DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models | Fatemeh Afghah Team | 2505.19373 | null | |
| 2025-05-23 | VideoGameBench: Can Vision-Language Models complete popular video games? | Ofir Press Team | 2505.18134 | null | |
| 2025-05-23 | One RL to See Them All: Visual Triple Unified Reinforcement Learning | Junjie Yan Team | 2505.18129 | null | |
| 2025-05-23 | CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays | Edward Choi Team | 2505.18087 | null | |
| 2025-05-23 | FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation | Shibiao Xu Team | 2505.18053 | null | |
| 2025-05-23 | Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation | Bogdan Sorin Coseriu Team | 2505.18039 | null | |
| 2025-05-23 | Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling | Mun Yong Yi Team | 2505.17982 | null | |
| 2025-05-23 | VLM Models and Automated Grading of Atopic Dermatitis | Hamed Ghodrati Team | 2505.17835 | null | |
| 2025-05-23 | Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations | Chao Shen Team | 2505.17812 | null | |
| 2025-05-23 | U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | Hongcheng Guo Team | 2505.17779 | null | |
| 2025-05-23 | SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain | Yu Li Team | 2505.17727 | null | |
| 2025-05-23 | Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek | Xiangdong Zhou Team | 2505.17702 | null | |
| 2025-05-23 | Towards General Continuous Memory for Vision-Language Models | Biwei Huang Team | 2505.17670 | null | |
| 2025-05-23 | EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications | Min Yang Team | 2505.17654 | null | |
| 2025-05-23 | HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning | Jianfei Yang Team | 2505.17645 | null | |
| 2025-05-23 | Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports | Takahiro Omi Team | 2505.17625 | null | |
| 2025-05-23 | CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment | Zeng-Guang Hou Team | 2505.17619 | null | |
| 2025-05-23 | Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving | Wangmeng Zuo Team | 2505.17609 | null | |
| 2025-05-23 | A Unified Multi-Scale Attention-Based Network for Automatic 3D Segmentation of Lung Parenchyma & Nodules In Thoracic CT Images | Furqan Shaukat Team | 2505.17602 | null | |
| 2025-05-23 | Multimodal Conversation Structure Understanding | David Bamman Team | 2505.17536 | null | |
| 2025-05-23 | Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding | Sungzoon Cho Team | 2505.17529 | null | |
| 2025-05-22 | Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models | Mike Zheng Shou Team | 2505.16854 | link | |
| 2025-05-23 | LaViDa: A Large Diffusion Language Model for Multimodal Understanding | Aditya Grover Team | 2505.16839 | link | |
| 2025-05-22 | From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | Huaxiu Yao Team | 2505.16832 | link | |
| 2025-05-22 | Perceptual Quality Assessment for Embodied AI | Guangtao Zhai Team | 2505.16815 | link | |
| 2025-05-22 | SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving | Hongsheng Li Team | 2505.16805 | null | |
| 2025-05-22 | REOBench: Benchmarking Robustness of Earth Observation Foundation Models | Tianjin Huang Team | 2505.16793 | link | |
| 2025-05-22 | Single Domain Generalization for Few-Shot Counting via Universal Representation Matching | Xinghao Chen Team | 2505.16778 | link | |
| 2025-05-22 | IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models | AiTi Aw Team | 2505.16774 | link | |
| 2025-05-22 | Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation | Jianbing Shen Team | 2505.16763 | null | |
| 2025-05-22 | SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images | Mahsa Baktashmotlagh Team | 2505.16659 | null | |
| 2025-05-22 | Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models | Pål Halvorsen Team | 2505.16647 | null | |
| 2025-05-22 | MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation | Zongqing Lu Team | 2505.16602 | null | |
| 2025-05-22 | ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models | Xiuying Chen Team | 2505.16517 | null | |
| 2025-05-22 | Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models | Yaochu Jin Team | 2505.16446 | null | |
| 2025-05-22 | Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models | Kai Han Team | 2505.16416 | link | |
| 2025-05-22 | Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | Souvik Kundu Team | 2505.16411 | link | |
| 2025-05-22 | VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving | Samuel Labi Team | 2505.16377 | null | |
| 2025-05-22 | MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing | Xinhan Di Team | 2505.16279 | null | |
| 2025-05-22 | When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification | Jiaheng Wei Team | 2505.16149 | null | |
| 2025-05-22 | Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | Junfeng Fang Team | 2505.16146 | null | |
| 2025-05-21 | InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition | Xue Yang Team | 2505.15818 | null | |
| 2025-05-21 | From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems | Soujanya Poria Team | 2505.15685 | null | |
| 2025-05-21 | FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models | Qian Wang Team | 2505.15644 | null | |
| 2025-05-21 | Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models | Ya Wang Team | 2505.15576 | link | |
| 2025-05-21 | TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving | Abdallah Shami Team | 2505.15564 | null | |
| 2025-05-21 | Clapper: Compact Learning and Video Representation in VLMs | Fuzheng Zhang Team | 2505.15529 | null | |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Ken Goldberg Team | 2505.15517 | null | |
| 2025-05-21 | Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought | Libo Qin Team | 2505.15510 | null | |
| 2025-05-21 | Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts | Soma Biswas Team | 2505.15506 | link | |
| 2025-05-21 | Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification | Irwin King Team | 2505.15504 | null | |
| 2025-05-21 | Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models | Bryan Hooi Team | 2505.15489 | null | |
| 2025-05-21 | Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL | Qing Li Team | 2505.15436 | null | |
| 2025-05-21 | TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | Keze Wang Team | 2505.15435 | null | |
| 2025-05-21 | On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable? | Mohammad Yaqub Team | 2505.15425 | null | |
| 2025-05-21 | Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study | Hwanjo Yu Team | 2505.15389 | null | |
| 2025-05-21 | RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation | Farshad Khorrami Team | 2505.15373 | null | |
| 2025-05-21 | Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition | Youngsook Song Team | 2505.15367 | null | |
| 2025-05-21 | AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving | Diange Yang Team | 2505.15298 | null | |
| 2025-05-21 | Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs | Zibin Zheng Team | 2505.15265 | null | |
| 2025-05-21 | Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation | Kyomin Jung Team | 2505.15249 | null | |
| 2025-05-20 | UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens | Wentao Zhang Team | 2505.14671 | null | |
| 2025-05-20 | CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation | Faez Ahmed Team | 2505.14646 | null | |
| 2025-05-20 | Debating for Better Reasoning: An Unsupervised Multimodal Approach | Mirella Lapata Team | 2505.14627 | null | |
| 2025-05-21 | PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models | Wenjia Zhang Team | 2505.14481 | null | |
| 2025-05-20 | RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | Serge Belongie Team | 2505.14462 | link | |
| 2025-05-20 | SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation | Masafumi Oyamada Team | 2505.14381 | null | |
| 2025-05-20 | Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds | Agnieszka Wykowska Team | 2505.14366 | null | |
| 2025-05-20 | DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning | Xing Yu Team | 2505.14362 | link | |
| 2025-05-20 | Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives | Gui-Song Xia Team | 2505.14361 | null | |
| 2025-05-20 | Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey | Dongwoo Kim Team | 2505.14340 | null | |
| 2025-05-20 | Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models | Chong Feng Team | 2505.14257 | null | |
| 2025-05-20 | Visual Agentic Reinforcement Fine-Tuning | Jiaqi Wang Team | 2505.14246 | link | |
| 2025-05-20 | VoQA: Visual-only Question Answering | Lei Huang Team | 2505.14227 | null | |
| 2025-05-20 | Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models | Matthew Purver Team | 2505.14160 | null | |
| 2025-05-20 | Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent | Xuming Hu Team | 2505.14141 | null | |
| 2025-05-20 | NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI | Benedikt Wiestler Team | 2505.14064 | null | |
| 2025-05-20 | ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs | Minlie Huang Team | 2505.14035 | null | |
| 2025-05-20 | Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models | Yalin Wang Team | 2505.13973 | null | |
| 2025-05-20 | APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight | Ambuj Singh Team | 2505.13921 | link | |
| 2025-05-20 | InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning | Jingkuan Song Team | 2505.13888 | null | |
| 2025-05-19 | ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models | Greg Durrett Team | 2505.13444 | null | |
| 2025-05-19 | G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning | Baobao Chang Team | 2505.13426 | link | |
| 2025-05-19 | Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots | Shreyas Kousik Team | 2505.13376 | null | |
| 2025-05-20 | Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning | Lan-Zhe Guo Team | 2505.13317 | null | |
| 2025-05-19 | I’ll believe it when I see it: Images increase misinformation sharing in Vision-Language Models | R. Maria del Rio-Chanona Team | 2505.13302 | link | |
| 2025-05-19 | Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts | Sashank Varma Team | 2505.13281 | null | |
| 2025-05-19 | From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection | Jian Liang Team | 2505.13233 | link | |
| 2025-05-19 | ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models | Pekka Marttinen Team | 2505.13180 | link | |
| 2025-05-19 | Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model | Dong Yu Team | 2505.13062 | null | |
| 2025-05-20 | 3D Visual Illusion Depth Estimation | Yunde Jia Team | 2505.13061 | link | |
| 2025-05-19 | MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | Ying Shan Team | 2505.13031 | link | |
| 2025-05-19 | Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption | Tomoki Hamagami Team | 2505.12912 | link | |
| 2025-05-19 | TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks | Jin Dong Team | 2505.12884 | null | |
| 2025-05-19 | FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models | Renxin Zhong Team | 2505.12835 | null | |
| 2025-05-19 | VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection | Ransalu Senanayake Team | 2505.12715 | null | |
| 2025-05-19 | TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | Soodeh Nikan Team | 2505.12670 | null | |
| 2025-05-19 | Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps | Miguel P. Eckstein Team | 2505.12660 | null | |
| 2025-05-19 | AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use | Fei Wei Team | 2505.12650 | link | |
| 2025-05-19 | Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency | Zhengyu Zhao Team | 2505.12644 | null | |
| 2025-05-19 | Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents | Honglak Lee Team | 2505.12632 | null | |
| 2025-05-16 | Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | Hong Bu Team | 2505.11404 | null | |
| 2025-05-16 | Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild | Guillaume Sartoretti Team | 2505.11350 | null | |
| 2025-05-16 | Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | Joyce Chai Team | 2505.11326 | null | |
| 2025-05-16 | Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation | Chang D. Yoo Team | 2505.11221 | null | |
| 2025-05-16 | Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing | Begüm Demir Team | 2505.11121 | null | |
| 2025-05-16 | CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs | Natalia Díaz-Rodríguez Team | 2505.11060 | null | |
| 2025-05-16 | Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere | Prashant Singh Team | 2505.11029 | null | |
| 2025-05-16 | On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating | Alessandro Rinaldo Team | 2505.10860 | null | |
| 2025-05-16 | Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities | Shan Lin Team | 2505.10764 | null | |
| 2025-05-15 | GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data? | Tanwi Mallick Team | 2505.10714 | null | |
| 2025-05-15 | MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation | Muzammil Behzad Team | 2505.10672 | null | |
| 2025-05-15 | CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier | Ziyang Ou Team | 2505.10664 | null | |
| 2025-05-15 | Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding | Chong Feng Team | 2505.10634 | null | |
| 2025-05-15 | MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | Mark Steedman Team | 2505.10610 | null | |
| 2025-05-18 | MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models | Vithursan Thangarasa Team | 2505.10526 | null | |
| 2025-05-16 | AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges | Manoj Karkee Team | 2505.10468 | null | |
| 2025-05-15 | Vision language models have difficulty recognizing virtual objects | J. G. Trafton Team | 2505.10453 | null | |
| 2025-05-15 | MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models | Xiaodong Gu Team | 2505.10088 | link | |
| 2025-05-15 | AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection | Chengjie Wang Team | 2505.09926 | link | |
| 2025-05-14 | Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling | Nikolaus Correll Team | 2505.09731 | null | |
| 2025-05-14 | ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation | Daniel Seita Team | 2505.09698 | null | |
| 2025-05-14 | LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models | Yanan Sun Team | 2505.09659 | link | |
| 2025-05-14 | Variational Visual Question Answering | Marcus Rohrbach Team | 2505.09591 | null | |
| 2025-05-14 | VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation | Shuo Wang Team | 2505.09577 | null | |
| 2025-05-14 | Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput | Lin Ma Team | 2505.09498 | null | |
| 2025-05-14 | Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition | Muzammil Behzad Team | 2505.09336 | null | |
| 2025-05-14 | MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning | Bin-Bin Gao Team | 2505.09265 | null | |
| 2025-05-14 | Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models | Ross Greer Team | 2505.09139 | null | |
| 2025-05-14 | Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning | Qing Li Team | 2505.09118 | null | |
| 2025-05-14 | OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions | Hao Zhou Team | 2505.09092 | link | |
| 2025-05-13 | Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training | Heng Ji Team | 2505.08971 | link | |
| 2025-05-15 | Behind Maya: Building a Multilingual Vision Language Model | Alham Fikri Aji Team | 2505.08910 | link | |
| 2025-05-12 | Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare | Imon Banerjee Team | 2505.08818 | null | |
| 2025-05-13 | Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving | Xiang Bai Team | 2505.08725 | link | |
| 2025-05-13 | OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning | Yu Cheng Team | 2505.08617 | link | |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null | |
| 2025-05-13 | Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? | Jimmy Huang Team | 2505.08468 | link | |
| 2025-05-13 | MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos | Wei Zhang Team | 2505.08367 | null | |
| 2025-05-13 | Removing Watermarks with Partial Regeneration using Semantic Information | Michael W. Mahoney Team | 2505.08234 | link | |
| 2025-05-13 | CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding | Shuo Wang Team | 2505.08194 | null | |
| 2025-05-13 | DSADF: Thinking Fast and Slow for Decision Making | Shufei Zhang Team | 2505.08189 | null | |
| 2025-05-12 | Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models | Jia-Bin Huang Team | 2505.07815 | null | |
| 2025-05-12 | Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction | Andrew Yates Team | 2505.07730 | null | |
| 2025-05-12 | Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images | Vasily Konovalov Team | 2505.07704 | null | |
| 2025-05-12 | Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models | Yihong Gong Team | 2505.07690 | null | |
| 2025-05-12 | Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ ptimization | Sung Ju Hwang Team | 2505.07675 | null | |
| 2025-05-12 | Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning | Hanwang Zhang Team | 2505.07538 | null | |
| 2025-05-12 | AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography | Xiaomeng Li Team | 2505.07347 | null | |
| 2025-05-12 | Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning | Yahui Zhou Team | 2505.07263 | null | |
| 2025-05-12 | Incomplete In-context Learning | Yangshijie Zhang Team | 2505.07251 | null | |
| 2025-05-12 | UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning | Dzmitry Tsetserukou Team | 2505.07236 | null | |
| 2025-05-12 | Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection | Ningjiang Chen Team | 2505.07219 | link | |
| 2025-05-12 | Internet of Agents: Fundamentals, Applications, and Challenges | Dusit Niyato Team | 2505.07176 | null | |
| 2025-05-12 | Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning | Weiping Wang Team | 2505.07172 | null | |
| 2025-05-12 | EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis | Eunil Park Team | 2505.07164 | null | |
| 2025-05-11 | A Vision-Language Foundation Model for Leaf Disease Identification | Luyl-Da Quach Team | 2505.07019 | null | |
| 2025-05-11 | Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models | Binod Bhattarai Team | 2505.07001 | null | |
| 2025-05-11 | UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms | Zhenze Liu Team | 2505.06832 | null | |
| 2025-05-10 | STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation | Jean Oh Team | 2505.06729 | null | |
| 2025-05-10 | METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection | Shuo Yang Team | 2505.06663 | link | |
| 2025-05-10 | Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation | Nancy F. Chen Team | 2505.06594 | null | |
| 2025-05-09 | MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks | Bo Yan Team | 2505.06152 | link | |
| 2025-05-09 | Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI | Dominik Bollmann Team | 2505.05895 | null | |
| 2025-05-09 | Describe Anything in Medical Images | Min Xu Team | 2505.05804 | null | |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null | |
| 2025-05-08 | Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos | Nina S. T. Hirata Team | 2505.05681 | null | |
| 2025-05-08 | X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP | James Bailey Team | 2505.05528 | link | |
| 2025-05-08 | Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging | Junxian He Team | 2505.05464 | link | |
| 2025-05-08 | SITE: towards Spatial Intelligence Thorough Evaluation | Boqing Gong Team | 2505.05456 | null | |
| 2025-05-08 | DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning | Jun Ma Team | 2505.05360 | null | |
| 2025-05-08 | Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization | Joon Son Chung Team | 2505.05343 | link | |
| 2025-05-08 | Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects | Matteo Matteucci Team | 2505.05318 | null | |
| 2025-05-08 | Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models | Meng Zhang Team | 2505.05189 | null | |
| 2025-05-08 | OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning | Qingming Huang Team | 2505.05180 | link | |
| 2025-05-08 | Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models | Joachim Denzler Team | 2505.05163 | null | |
| 2025-05-08 | CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models | Furao Shen Team | 2505.05130 | null | |
| 2025-05-08 | X-Driver: Explainable Autonomous Driving with Vision-Language Models | Zengfeng Zeng Team | 2505.05098 | null | |
| 2025-05-08 | Image-Text Relation Prediction for Multilingual Tweets | Edison Marrese-Taylor Team | 2505.05040 | null | |
| 2025-05-09 | G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness | Youngjae Yu Team | 2505.05026 | null | |
| 2025-05-08 | Split Matching for Inductive Zero-shot Semantic Segmentation | Daisuke Deguchi Team | 2505.05023 | null | |
| 2025-05-08 | LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture | Tatsuya Suzuki Team | 2505.04980 | null | |
| 2025-05-07 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null | |
| 2025-05-07 | “I Can See Forever!”: Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments | Xinlei He Team | 2505.04488 | null | |
| 2025-05-07 | DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception | Zhuotao Tian Team | 2505.04410 | link | |
| 2025-05-07 | CM1 – A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models | Gernot A. Fink Team | 2505.04214 | null | |
| 2025-05-07 | R^3-VQA: “Read the Room” by Video Social Reasoning | Lifeng Fan Team | 2505.04147 | null | |
| 2025-05-06 | X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains | Hoifung Poon Team | 2505.03981 | null | |
| 2025-05-06 | Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning | Victor Amblard Team | 2505.03703 | null | |
| 2025-05-06 | Distribution-Conditional Generation: From Class Distribution to Creative Generation | Xin Geng Team | 2505.03667 | null | |
| 2025-05-06 | Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images | Zhenan Sun Team | 2505.03611 | null | |
| 2025-05-06 | Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection | Ming-Hsuan Yang Team | 2505.03610 | null | |
| 2025-05-06 | Mitigating Image Captioning Hallucinations in Vision-Language Models | Xi Li Team | 2505.03420 | null | |
| 2025-05-07 | Enhancing Target-unspecific Tasks through a Features Matrix | Jun Yu Team | 2505.03414 | null | |
| 2025-05-06 | Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models | Aiden Doherty Team | 2505.03374 | null | |
| 2025-05-06 | A Vision-Language Model for Focal Liver Lesion Classification | Chen Yen-Wei Team | 2505.03350 | null | |
| 2025-05-06 | From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection | Rong Xiao Team | 2505.03334 | null | |
| 2025-05-06 | Seeing the Abstract: Translating the Abstract Language for Vision Language Models | Yiming Wang Team | 2505.03242 | link | |
| 2025-05-06 | VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making | Juan Carlos Niebles Team | 2505.03181 | null | |
| 2025-05-06 | Robust Fairness Vision-Language Learning for Medical Image Analysis | Shu Hu Team | 2505.03153 | link | |
| 2025-05-05 | Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation | Manish Dhakal Team | 2505.02971 | null | |
| 2025-05-05 | LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery | David M. Chan Team | 2505.02829 | null | |
| 2025-05-05 | HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction | Dzmitry Tsetserukou Team | 2505.02569 | null | |
| 2025-05-05 | Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality | Jimmy Lin Team | 2505.02466 | null | |
| 2025-05-05 | Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey | Songcan Chen Team | 2505.02448 | null | |
| 2025-05-05 | SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing | Sijie Zhu Team | 2505.02370 | link | |
| 2025-05-05 | TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment | Xinwei He Team | 2505.02325 | null | |
| 2025-05-04 | Compositional Image-Text Matching and Retrieval by Grounding Entities | Jana Košecká Team | 2505.02278 | null | |
| 2025-05-04 | Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin | Xinyang Chen Team | 2505.02056 | null | |
| 2025-05-04 | A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models | Xinya Du Team | 2505.01958 | null | |
| 2025-05-03 | PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications | Santosh Patapati Team | 2505.01881 | null | |
| 2025-05-03 | Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos | Anett Hoppe Team | 2505.01790 | null | |
| 2025-05-03 | An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding | Guoliang Xing Team | 2505.01743 | null | |
| 2025-05-03 | Vision and Intention Boost Large Language Model in Long-Term Action Anticipation | Yanning Zhang Team | 2505.01713 | null | |
| 2025-05-03 | RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation | Xiaodan Liang Team | 2505.01709 | null | |
| 2025-05-03 | Topology-Aware CLIP Few-Shot Learning | Dazhi Huang Team | 2505.01694 | null | |
| 2025-05-02 | TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | Jenq-Neng Hwang Team | 2505.01583 | null | |
| 2025-05-02 | Grounding Task Assistance with Multimodal Cues from a Single Demonstration | Andrew D. Wilson Team | 2505.01578 | null | |
| 2025-05-02 | Dynamic Robot Tool Use with Vision Language Models | Ahmed H. Qureshi Team | 2505.01399 | null | |
| 2025-05-02 | Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages | Valerio Guarrasi Team | 2505.01096 | null | |
| 2025-05-02 | Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation | Valerio Guarrasi Team | 2505.01091 | null | |
| 2025-05-02 | Transferable Adversarial Attacks on Black-Box Vision-Language Models | Matt Fredrikson Team | 2505.01050 | null | |
| 2025-04-30 | Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis | Alexei Kaltchenko Team | 2505.00746 | null | |
| 2025-05-01 | Robotic Visual Instruction | Xianzheng Ma Team | 2505.00693 | null | |
| 2025-05-01 | Visual Test-time Scaling for GUI Agent Grounding | Honglak Lee Team | 2505.00684 | null | |
| 2025-05-01 | DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation | Yang Gao Team | 2505.00527 | null | |
| 2025-05-01 | LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | Henry X. Liu Team | 2505.00284 | null | |
| 2025-05-01 | AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care | Tianming Liu Team | 2505.00275 | null | |
| 2025-04-30 | V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving | Markus Lienkamp Team | 2505.00156 | null | |
| 2025-04-30 | Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models | Xintao Wu Team | 2505.00150 | null | |
| 2025-04-30 | Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design | Mahdi S. Hosseini Team | 2505.00134 | null | |
| 2025-04-30 | Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization | Ganesh Ramakrishnan Team | 2504.21831 | null | |
| 2025-04-30 | Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models | Lin Lee Cheong Team | 2504.21559 | null | |
| 2025-04-30 | RoboGround: Robotic Manipulation with Grounded Vision-Language Priors | Zhou Zhao Team | 2504.21530 | null | |
| 2025-04-30 | Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection | William Hsu Team | 2504.21344 | null | |
| 2025-04-29 | MemeBLIP2: A novel lightweight multimodal system to detect harmful memes | Lisha Xu Team | 2504.21226 | null | |
| 2025-04-29 | GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model | Yue Zhao Team | 2504.21186 | null | |
| 2025-04-29 | Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization | Xiaojun Chang Team | 2504.21063 | null | |
| 2025-04-29 | Real-Time Wayfinding Assistant for Blind and Low-Vision Users | Farhan Sadaf Team | 2504.20976 | null | |
| 2025-04-29 | FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models | Elisa Ricci Team | 2504.20860 | null | |
| 2025-04-29 | In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer | Yi Yang Team | 2504.20690 | null | |
| 2025-04-29 | SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data | Freda Shi Team | 2504.20648 | null | |
| 2025-04-29 | PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations | Xuguang Lan Team | 2504.20520 | null | |
| 2025-04-29 | Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception | Xiaoqiang Li Team | 2504.20468 | null | |
| 2025-04-29 | Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks | Dimitrios K. Nasiopoulos Team | 2504.20419 | null | |
| 2025-04-29 | FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding | Bo Zheng Team | 2504.20384 | null | |
| 2025-04-28 | A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports | Christoph M. Friedrich Team | 2504.20220 | null | |
| 2025-04-28 | Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains | Rui Yan Team | 2504.20199 | null | |
| 2025-04-28 | SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Alan Yuille Team | 2504.20024 | null | |
| 2025-04-28 | EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia | Diego Marcos Team | 2504.19742 | null | |
| 2025-04-28 | Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model | Guoying Zhao Team | 2504.19739 | null | |
| 2025-04-28 | VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning | Xiaobo Xia Team | 2504.19627 | null | |
| 2025-04-28 | LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning | Aimin Yang Team | 2504.19524 | null | |
| 2025-04-27 | DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning | Shini Han Team | 2504.19127 | null | |
| 2025-04-27 | Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction | Jian Liu Team | 2504.19086 | null | |
| 2025-04-26 | Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation | Arif Mahmood Team | 2504.18856 | null | |
| 2025-04-26 | Video CLIP Model for Multi-View Echocardiography Interpretation | Norihiko Takeda Team | 2504.18800 | null | |
| 2025-04-25 | A Review of 3D Object Detection with Vision-Language Models | Manoj Karkee Team | 2504.18738 | null | |
| 2025-04-25 | Proof-of-TBI – Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction | Donna Broshek Team | 2504.18671 | null | |
| 2025-04-25 | Generalization Capability for Imitation Learning | Yixiao Wang Team | 2504.18538 | null | |
| 2025-04-25 | Fast-Slow Thinking for Large Vision-Language Model Reasoning | Fei Wu Team | 2504.18458 | null | |
| 2025-04-25 | Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation | Guang Yang Team | 2504.18453 | null | |
| 2025-04-25 | Revisiting Data Auditing in Large Vision-Language Models | Zhuosheng Zhang Team | 2504.18349 | null | |
| 2025-04-25 | A Large Vision-Language Model based Environment Perception System for Visually Impaired People | Shiguo Lian Team | 2504.18027 | null | |
| 2025-04-24 | CAMU: Context Augmentation for Meme Understanding | Aditya Joshi Team | 2504.17902 | null | |
| 2025-04-24 | FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model | Waikeung Wong Team | 2504.17826 | null | |
| 2025-04-25 | Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction | Weiyan Wen Team | 2504.17671 | null | |
| 2025-04-24 | SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting | Qingming Huang Team | 2504.17395 | null | |
| 2025-04-24 | M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction | Tatsunori Mori Team | 2504.17353 | null | |
| 2025-04-24 | DIMT25@ICDAR2025: HW-TSC’s End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model | Hao Yang Team | 2504.17315 | null | |
| 2025-04-24 | Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning | Khimya Khetarpal Team | 2504.17282 | null | |
| 2025-04-24 | Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation | Minhyuk Sung Team | 2504.17207 | null | |
| 2025-04-23 | Distilling semantically aware orders for autoregressive image generation | Marco Pedersoli Team | 2504.17069 | null | |
| 2025-04-23 | DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs | Ran Xu Team | 2504.17040 | null | |
| 2025-04-24 | V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations | Yi R. Fung Team | 2504.16727 | null | |
| 2025-04-23 | Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes | Giovanni Fusco Team | 2504.16538 | null | |
| 2025-04-23 | TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Jiaya Jia Team | 2504.16505 | null | |
| 2025-04-23 | FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing | Biplab Banerjee Team | 2504.16433 | null | |
| 2025-04-22 | CLIP-IT: CLIP-based Pairing for Histology Images Classification | Eric Granger Team | 2504.16181 | null | |
| 2025-04-22 | MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention | Lili Qiu Team | 2504.16083 | null | |
| 2025-04-22 | MR. Video: “MapReduce” is the Principle for Long Video Understanding | Yu-Xiong Wang Team | 2504.16082 | null | |
| 2025-04-22 | Describe Anything: Detailed Localized Image and Video Captioning | Yin Cui Team | 2504.16072 | null | |
| 2025-04-22 | Vision language models are unreliable at trivial spatial cognition | J. Gregory Trafton Team | 2504.16061 | null | |
| 2025-04-22 | Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation | Joyce Chai Team | 2504.16060 | null | |
| 2025-04-22 | Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis | Judy Gichoya Team | 2504.16047 | null | |
| 2025-04-22 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Mike Zheng Shou Team | 2504.16030 | null | |
| 2025-04-24 | Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models | Tolga Çukur Team | 2504.15929 | null | |
| 2025-04-21 | CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting | Mohit Bansal Team | 2504.15485 | null | |
| 2025-04-21 | Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | Guilin Liu Team | 2504.15271 | null | |
| 2025-04-21 | KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking | Kijung Shin Team | 2504.15135 | link | |
| 2025-04-21 | Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation | Serge Belongie Team | 2504.14988 | link | |
| 2025-04-21 | VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform | Kun Gai Team | 2504.14904 | null | |
| 2025-04-21 | Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation | Yunji Chen Team | 2504.14848 | null | |
| 2025-04-20 | OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Zuozhu Liu Team | 2504.14692 | null | |
| 2025-04-20 | NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation | Juho Kannala Team | 2504.14638 | null | |
| 2025-04-20 | LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation | Yongsheng Gao Team | 2504.14467 | null | |
| 2025-04-20 | Neglected Risks: The Disturbing Reality of Children’s Images in Datasets and the Urgent Call for Accountability | Sandra Avila Team | 2504.14446 | null | |
| 2025-04-19 | Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models | Nathaniel D. Bastian Team | 2504.14395 | null | |
| 2025-04-19 | How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | James Zou Team | 2504.14391 | null | |
| 2025-04-19 | A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling | Adriana Kovashka Team | 2504.14359 | null | |
| 2025-04-19 | Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses | Chau Yuen Team | 2504.14326 | null | |
| 2025-04-19 | Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization | Xu Yang Team | 2504.14200 | null | |
| 2025-04-19 | Bayesian Principles Improve Prompt Learning In Vision-Language Models | Mijung Park Team | 2504.14123 | null | |
| 2025-04-19 | PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models | Ozlem Ozmen Garibay Team | 2504.14117 | null | |
| 2025-04-21 | Analysing the Robustness of Vision-Language-Models to Common Corruptions | Umair Bin Mansoor Team | 2504.13690 | null | |
| 2025-04-18 | EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model | Beng Chin Ooi Team | 2504.13650 | link | |
| 2025-04-18 | PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting | Miao Yu Team | 2504.13624 | null | |
| 2025-04-18 | Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization | Huadong Ma Team | 2504.13460 | null | |
| 2025-04-18 | Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety | Ross Greer Team | 2504.13399 | null | |
| 2025-04-17 | VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture | Yanbo Huang Team | 2504.13365 | null | |
| 2025-04-17 | Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models | Jacky Liang Team | 2504.13351 | null | |
| 2025-04-17 | WildFireCan-MMD: A Multimodal dataset for Classification of User-generated Content During Wildfires in Canada | Marzieh Amini Team | 2504.13231 | null | |
| 2025-04-17 | PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | Christoph Feichtenhofer Team | 2504.13180 | null | |
| 2025-04-17 | Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling | David M. Chan Team | 2504.13169 | link | |
| 2025-04-17 | Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training | Zhanhui Kang Team | 2504.13123 | null | |
| 2025-04-17 | Probing and Inducing Combinational Creativity in Vision-Language Models | Zilong Zheng Team | 2504.13120 | null | |
| 2025-04-17 | Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration | Yong Hong Kuo Team | 2504.13119 | null | |
| 2025-04-17 | Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development | Christoph Csallner Team | 2504.13069 | null | |
| 2025-04-17 | NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Michael Qizhe Shieh Team | 2504.13055 | null | |
| 2025-04-17 | Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning | Wenwu Zhu Team | 2504.12680 | link | |
| 2025-04-17 | VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization | Siheng Chen Team | 2504.12661 | null | |
| 2025-04-16 | Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation | Éric Granger Team | 2504.12436 | link | |
| 2025-04-16 | FLIP Reasoning Challenge | Roger Wattenhofer Team | 2504.12256 | null | |
| 2025-04-16 | Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models - | Hanno Gottschalk Team | 2504.12137 | null | |
| 2025-04-17 | Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions | Zhi-Qi Cheng Team | 2504.11967 | null | |
| 2025-04-16 | Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning | Yi Chang Team | 2504.11930 | null | |
| 2025-04-16 | A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification | Janis Keuper Team | 2504.11838 | null | |
| 2025-04-17 | DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment | Moncef Gabbouj Team | 2504.11733 | null | |
| 2025-04-16 | Interpreting the Linear Structure of Vision-language Model Embedding Spaces | Stephanie Gil Team | 2504.11695 | null | |
| 2025-04-16 | VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps | Mariano Ceccato Team | 2504.11675 | null | |
| 2025-04-15 | Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation | Majid Mirmehdi Team | 2504.11669 | null | |
| 2025-04-17 | PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage | Lina Wang Team | 2504.11509 | null | |
| 2025-04-15 | From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation | Jungong Han Team | 2504.11368 | null | |
| 2025-04-17 | UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis | Yan Lu Team | 2504.11257 | null | |
| 2025-04-15 | R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning | Ran He Team | 2504.11195 | null | |
| 2025-06-30 | Benchmarking Vision Language Models on German Factual Data | Vincent Tischler Team | 2504.11108 | null | |
| 2025-04-16 | Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR | Gongshen Liu Team | 2504.11101 | null | |
| 2025-04-15 | QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Yu Wang Team | 2504.11038 | null | |
| 2025-04-15 | Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles | Ross Greer Team | 2504.10873 | null | |
| 2025-04-15 | LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Mohsen Imani Team | 2504.10854 | null | |
| 2025-04-15 | Enhancing Features in Long-tailed Data Using Large Vision Mode | Xuesong Li Team | 2504.10852 | null | |
| 2025-04-14 | ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models | Lifeng Zhou Team | 2504.10757 | null | |
| 2025-04-14 | AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Yu-Xiong Wang Team | 2504.10568 | null | |
| 2025-04-14 | Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding | Jiashi Feng Team | 2504.10465 | null | |
| 2025-04-15 | GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | Run Luo Team | 2504.10458 | null | |
| 2025-04-14 | SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model | Yanning Zhang Team | 2504.10320 | null | |
| 2025-04-15 | Breaking the Data Barrier – Building GUI Agents Through Task Generalization | Junxian He Team | 2504.10127 | null | |
| 2025-04-14 | AGO: Adaptive Grounding for Open World 3D Occupancy Prediction | Andreas Zell Team | 2504.10117 | null | |
| 2025-04-14 | CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography | Jun-Cheng Chen Team | 2504.10090 | null | |
| 2025-04-14 | Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure | Frédéric Dufaux Team | 2504.10049 | null | |
| 2025-04-14 | Aligning Anime Video Generation with Human Feedback | Zuxuan Wu Team | 2504.10044 | null | |
| 2025-04-14 | KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks | Takamitsu Matsubara Team | 2504.10011 | null | |
| 2025-04-14 | GenTe: Generative Real-world Terrains for General Legged Robot Locomotion Control | Xiaoqiang Ji Team | 2504.09997 | null | |
| 2025-04-14 | Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models | Keisuke Ozawa Team | 2504.09979 | null | |
| 2025-04-14 | Can VLMs Assess Similarity Between Graph Visualizations? | Jinwook Seo Team | 2504.09859 | null | |
| 2025-04-14 | VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents | Jun Suzuki Team | 2504.09795 | null | |
| 2025-04-13 | A Survey on Efficient Vision-Language Models | Nirmalya Roy Team | 2504.09724 | null | |
| 2025-04-13 | Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference | Tadahiro Taniguchi Team | 2504.09620 | null | |
| 2025-04-13 | DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning | Mukesh Prasad Team | 2504.09598 | null | |
| 2025-04-15 | Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation | Yunhong Wang Team | 2504.09480 | null | |
| 2025-04-13 | Identity-Aware Vision-Language Model for Explainable Face Forgery Detection | Yu-Gang Jiang Team | 2504.09439 | null | |
| 2025-04-13 | BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning | Boqing Gong Team | 2504.09426 | null | |
| 2025-04-12 | PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks | Yang Liu Team | 2504.09258 | null | |
| 2025-04-11 | AstroLLaVA: towards the unification of astronomical data and natural language | Dimitrios Tanoglidis Team | 2504.08583 | null | |
| 2025-04-11 | EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models | Jinwoo Kim Team | 2504.08205 | null | |
| 2025-04-10 | Investigating Vision-Language Model for Point Cloud-based Vehicle Classification | Camille Kamga Team | 2504.08154 | null | |
| 2025-04-10 | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | David Ha Team | 2504.08066 | null | |
| 2025-04-10 | VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning | Feng Zhao Team | 2504.07956 | null | |
| 2025-04-10 | SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos | Yuhao Chen Team | 2504.07867 | null | |
| 2025-04-10 | CollEX – A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections | Chris Biemann Team | 2504.07643 | null | |
| 2025-04-10 | VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model | Tiancheng Zhao Team | 2504.07615 | link | |
| 2025-04-10 | TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs | Xuezhi Cao Team | 2504.07556 | null | |
| 2025-04-10 | Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models | Xian-Sheng Hua Team | 2504.07521 | link | |
| 2025-04-10 | Kimi-VL Technical Report | Ziwei Chen Team | 2504.07491 | link | |
| 2025-04-09 | Perception in Reflection | Vishal M. Patel Team | 2504.07165 | null | |
| 2025-04-09 | Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Marzieh Fadaee Team | 2504.07072 | null | |
| 2025-04-09 | Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition | Aythami Morales Team | 2504.06925 | null | |
| 2025-04-09 | MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking | Hesheng Wang Team | 2504.06863 | null | |
| 2025-04-09 | ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models | Namhoon Lee Team | 2504.06838 | null | |
| 2025-04-09 | LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding | Bo XU Team | 2504.06835 | null | |
| 2025-04-08 | PromptHMR: Promptable Human Mesh Recovery | Muhammed Kocabas Team | 2504.06397 | null | |
| 2025-04-08 | SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation | Zhaozheng Yin Team | 2504.06389 | null | |
| 2025-04-08 | OmniSVG: A Unified Scalable Vector Graphics Generation Model | Yu-Gang Jiang Team | 2504.06263 | null | |
| 2025-04-08 | Latent Multimodal Reconstruction for Misinformation Detection | Panagiotis C. Petrantonakis Team | 2504.06010 | link | |
| 2025-04-08 | Measuring Déjà vu Memorization Efficiently | Kamalika Chaudhuri Team | 2504.05651 | null | |
| 2025-04-08 | A Lightweight Large Vision-language Model for Multimodal Medical Images | Navid Toosy Saidy Team | 2504.05575 | null | |
| 2025-04-10 | ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Shafiq Joty Team | 2504.05506 | null | |
| 2025-04-07 | Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models | Aliasghar Arab Team | 2504.05477 | null | |
| 2025-04-07 | Taxonomy-Aware Evaluation of Vision-Language Models | Stella Frank Team | 2504.05457 | null | |
| 2025-04-07 | Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly | Anamaria Crisan Team | 2504.05445 | null | |
| 2025-04-07 | InteractVLM: 3D Interaction Reasoning from 2D Foundational Models | Dimitrios Tzionas Team | 2504.05303 | null | |
| 2025-04-07 | SmolVLM: Redefining small and efficient multimodal models | Thomas Wolf Team | 2504.05299 | null | |
| 2025-04-07 | A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? | Ismail Ben Ayed Team | 2504.05227 | null | |
| 2025-04-07 | Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation | Wei Zhang Team | 2504.05225 | null | |
| 2025-04-08 | A Taxonomy of Self-Handover | Katsushi Ikeuchi Team | 2504.04939 | null | |
| 2025-04-07 | SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models | Lorenz Hufe Team | 2504.04893 | null | |
| 2025-04-07 | Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG | Ofer Hadar Team | 2504.04858 | null | |
| 2025-04-07 | OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance | Xinhan Di Team | 2504.04781 | null | |
| 2025-04-07 | Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding | Zahir Alsulaimawi Team | 2504.04772 | null | |
| 2025-04-07 | Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions | Yue Wang Team | 2504.04744 | null | |
| 2025-04-07 | Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Venkatesh Saligrama Team | 2504.04740 | null | |
| 2025-04-06 | M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models | Ruixiang Tang Team | 2504.04633 | null | |
| 2025-04-06 | Foundation Models for Software Engineering of Cyber-Physical Systems: the Road Ahead | Shaukat Ali Team | 2504.04630 | null | |
| 2025-04-06 | Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection | Xiaomeng Huang Team | 2504.04517 | link | |
| 2025-04-06 | OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | Jose M. Alvarez Team | 2504.04348 | null | |
| 2025-04-06 | MedM-VL: What Makes a Good Medical LVLM? | Ji Wu Team | 2504.04323 | null | |
| 2025-04-05 | GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill | Siyuan Huang Team | 2504.04191 | null | |
| 2025-04-05 | LATTE: Lightweight Attention-based Traffic Accident Anticipation Engine | Zhenning Li Team | 2504.04103 | null | |
| 2025-04-05 | TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection | Xiaohua Xu Team | 2504.04099 | null | |
| 2025-04-04 | VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models | Anelia Angelova Team | 2504.03970 | null | |
| 2025-04-04 | Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models | Matias Valdenegro-Toro Team | 2504.03440 | null | |
| 2025-04-04 | SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding | Naoto Yokoya Team | 2504.03254 | null | |
| 2025-04-04 | Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators | Lawson L. S. Wong Team | 2504.03245 | null | |
| 2025-04-04 | Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation | Robby T. Tan Team | 2504.03193 | null | |
| 2025-04-04 | NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving | Zhengzhong Tu Team | 2504.03164 | null | |
| 2025-04-04 | TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference | Xianpeng Lang Team | 2504.03154 | null | |
| 2025-04-04 | MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories | Arvind Ramanathan Team | 2504.03153 | null | |
| 2025-04-03 | QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding | Bryan Wang Team | 2504.02971 | null | |
| 2025-04-03 | STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection | Naoufel Werghi Team | 2504.02823 | null | |
| 2025-04-03 | Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models | Zeynep Akata Team | 2504.02821 | null | |
| 2025-04-03 | Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence | Serena Yeung-Levy Team | 2504.02799 | null | |
| 2025-04-03 | Robot-Led Vision Language Model Wellbeing Assessment of Children | Hatice Gunes Team | 2504.02765 | null | |
| 2025-04-04 | Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme | Pengfei Liu Team | 2504.02587 | null | |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Shibiao Xu Team | 2504.02477 | null | |
| 2025-04-03 | Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Rui Yan Team | 2504.02438 | null | |
| 2025-04-03 | ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback | Hailong Wang Team | 2504.02357 | null | |
| 2025-04-03 | Large (Vision) Language Models are Unsupervised In-Context Learners | Maria Brbic Team | 2504.02349 | link | |
| 2025-04-03 | Re-thinking Temporal Search for Long-Form Video Understanding | Manling Li Team | 2504.02259 | null | |
| 2025-04-03 | SocialGesture: Delving into Multi-person Gesture Understanding | James M. Rehg Team | 2504.02244 | null | |
| 2025-04-02 | FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs | Fatima Albreiki Team | 2504.01916 | link | |
| 2025-04-02 | Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Xiaobo Jin Team | 2504.01890 | null | |
| 2025-04-02 | Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images | Abdullah-Al-Zubaer Imran Team | 2504.01838 | link | |
| 2025-04-02 | BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Leonidas Guibas Team | 2504.01786 | null | |
| 2025-04-02 | AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization | Linli Xu Team | 2504.01735 | null | |
| 2025-04-02 | Reasoning LLMs for User-Aware Multimodal Conversational Agents | Mohamed Chetouani Team | 2504.01700 | null | |
| 2025-04-02 | CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition | Hamzah Luqman Team | 2504.01666 | link | |
| 2025-04-02 | BioAtt: Anatomical Prior Driven Low-Dose CT Denoising | UiHyun Cho Team | 2504.01662 | null | |
| 2025-04-02 | Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models | Ming-Hsuan Yang Team | 2504.01589 | null | |
| 2025-03-25 | Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning | Jinqiao Wang Team | 2503.18013 | link | |
| 2024-12-02 | VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models | Youngjune Kim Team | 2411.19103 | link | |
| 2025-05-19 | Evaluating Vision-Language Models as Evaluators in Path Planning | Ziyu Yao Team | 2411.18711 | null | |
| 2025-03-11 | ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding | Feng Guo Team | 2410.00982 | null | |
| 2024-09-24 | Behavioral Bias of Vision-Language Models: A Behavioral Finance View | Ming-Chang Chiu Team | 2409.15256 | null | |
| 2024-08-01 | Vision-Language Model Based Handwriting Verification | Sargur Srihari Team | 2407.21788 | null | |
| 2025-10-15 | Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions | Aman Chadha Team | 2404.07214 | null | |
| 2024-05-10 | Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving | Mohan Trivedi Team | 2403.19838 | null | |
| 2024-01-17 | Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models | Jie Yang Team | 2309.04041 | null | |
| 2023-10-13 | Distilling Large Vision-Language Model with Out-of-Distribution Generalizability | Hao Su Team | 2307.03135 | link | |
| 2023-06-16 | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Ping Luo Team | 2306.09265 | null | |
| 2023-11-08 | Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection | Adriana Kovashka Team | 2303.10093 | null | |
| 2024-04-19 | VLP: A Survey on Vision-Language Pre-training | Bo Xu Team | 2202.09061 | null | |
| 2022-10-07 | Learning to Prompt for Vision-Language Models | Ziwei Liu Team | 2109.01134 | null |
VLA
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy | Jiangmiao Pang Team | 2511.16651 | null |
| 2025-11-20 | VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference | Bo Zhao Team | 2511.16449 | null |
| 2025-11-20 | FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models | Mingsheng Shang Team | 2511.16233 | null |
| 2025-11-20 | When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models | Yaochu Jin Team | 2511.16203 | null |
| 2025-11-20 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Zhijie Deng Team | 2511.16175 | null |
| 2025-11-20 | EvoVLA: Self-Evolving Vision-Language-Action Model | Hao Tang Team | 2511.16166 | null |
| 2025-11-19 | SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models | Xipeng Qiu Team | 2511.15605 | null |
| 2025-11-19 | Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training | Jianfei Yang Team | 2511.15379 | null |
| 2025-11-19 | Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception | Wenzhao Lian Team | 2511.15279 | null |
| 2025-11-19 | Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation | Andrew J. Hung Team | 2511.15159 | null |
| 2025-11-19 | $π^{*}_{0.6}$ : a VLA That Learns From Experience | Zhiyuan Zhou Team | 2511.14759 | null |
| 2025-11-18 | NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards | Soujanya Poria Team | 2511.14659 | link |
| 2025-11-18 | Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM | Siyuan Cheng Team | 2511.14499 | null |
| 2025-11-18 | Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning | Hongpeng Wang Team | 2511.14396 | link |
| 2025-11-18 | Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion | Fei Chen Team | 2511.14178 | null |
| 2025-11-19 | RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action | Jiayu Chen Team | 2511.14161 | null |
| 2025-11-18 | MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs | Lu Cheng Team | 2511.14159 | null |
| 2025-11-18 | AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models | Biqing Qi Team | 2511.14148 | null |
| 2025-11-18 | Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models | Jidong J. Yang Team | 2511.14120 | null |
| 2025-11-19 | Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval | Roberto Martín-Martín Team | 2511.14004 | null |
| 2025-11-17 | TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models | Ying-Cong Chen Team | 2511.13704 | link |
| 2025-11-17 | Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models | Yuxiang Sun Team | 2511.12937 | null |
| 2025-11-18 | Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views | Hesheng Wang Team | 2511.12878 | null |
| 2025-11-16 | BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections | Vedhus Hoskere Team | 2511.12676 | null |
| 2025-11-16 | RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation | Long Chen Team | 2511.12436 | null |
| 2025-11-16 | VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving | David Hyunchul Shim Team | 2511.12405 | null |
| 2025-11-15 | AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models | Yu-Gang Jiang Team | 2511.12149 | null |
| 2025-11-15 | Decoupled Action Head: Confining Task Knowledge to Conditioning Layers | Qi WU Team | 2511.12101 | null |
| 2025-11-18 | Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective | Ngan Le Team | 2511.11478 | null |
| 2025-11-14 | EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment | Hongyi Zhang Team | 2511.11301 | null |
| 2025-11-14 | Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation | Xi Zheng Team | 2511.11298 | null |
| 2025-11-14 | AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation | Lin Shao Team | 2511.11052 | null |
| 2025-11-14 | DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition | Jianqin Yin Team | 2511.10948 | null |
| 2025-11-13 | Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals | Pawan Goyal Team | 2511.10615 | null |
| 2025-11-14 | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer | Ziwei Liu Team | 2511.10560 | link |
| 2025-11-13 | SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation | Liqiang Nie Team | 2511.10518 | link |
| 2025-11-13 | Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis | Min Cao Team | 2511.10254 | null |
| 2025-11-13 | SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition | Zitong Yu Team | 2511.10091 | null |
| 2025-11-13 | Phantom Menace: Exploring and Enhancing the Robustness of VLA Models against Physical Sensor Attacks | Wenyuan Xu Team | 2511.10008 | null |
| 2025-11-13 | Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation | Changbo Wang Team | 2511.09958 | null |
| 2025-11-12 | MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation | Ziwei Wang Team | 2511.09516 | null |
| 2025-11-12 | WMPO: World Model-based Policy Optimization for Vision-Language-Action Models | Song Guo Team | 2511.09515 | link |
| 2025-11-12 | Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning | Fatemeh Afghah Team | 2511.08942 | null |
| 2025-11-12 | Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds | Guang Shi Team | 2511.08892 | null |
| 2025-11-12 | MirrorLimb: Implementing hand pose acquisition and robot teleoperation based on RealMirror | Tao Shen Team | 2511.08865 | null |
| 2025-11-11 | SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control | Yuke Zhu Team | 2511.07820 | link |
| 2025-11-11 | ViPRA: Video Prediction for Robot Actions | Deepak Pathak Team | 2511.07732 | link |
| 2025-11-11 | LLM-GROP: Visually Grounded Robot Task and Motion Planning with Large Language Models | Shiqi Zhang Team | 2511.07727 | null |
| 2025-11-10 | How Do VLAs Effectively Inherit from VLMs? | Jiang Bian Team | 2511.06619 | null |
| 2025-11-09 | ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval | Jeff Ichnowski Team | 2511.06202 | null |
| 2025-11-09 | OpenVLN: Open-world aerial Vision-Language Navigation | Yang Cong Team | 2511.06182 | null |
| 2025-11-08 | 10 Open Challenges Steering the Future of Vision-Language-Action Models | David Hsu Team | 2511.05936 | null |
| 2025-11-11 | Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation | Xiachong Feng Team | 2511.05923 | null |
| 2025-11-07 | Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots | Mrinmoy Sarkar Team | 2511.05642 | null |
| 2025-11-07 | Visual Spatial Tuning | Hengshuang Zhao Team | 2511.05491 | null |
| 2025-11-07 | EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation | Samuel Dickerson Team | 2511.05397 | null |
| 2025-11-07 | TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models | Youngwoon Lee Team | 2511.05275 | link |
| 2025-11-06 | Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment | Bo Zhao Team | 2511.04555 | link |
| 2025-11-06 | GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies | Cédric Buche Team | 2511.04357 | null |
| 2025-11-04 | TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System | C. Karen Liu Team | 2511.02832 | link |
| 2025-11-04 | XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations | Jian Tang Team | 2511.02776 | null |
| 2025-11-01 | iFlyBot-VLA Technical Report | Jia Pan Team | 2511.01914 | null |
| 2025-11-03 | Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process | Haoang Li Team | 2511.01718 | null |
| 2025-11-03 | PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model | Yang Cong Team | 2511.01571 | null |
| 2025-11-03 | RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models | Donglin Wang Team | 2511.01331 | null |
| 2025-11-03 | Embodiment Transfer Learning for Vision-Language-Action Models | Yaxin Peng Team | 2511.01224 | null |
| 2025-11-06 | OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation | Lili Qiu Team | 2511.01210 | null |
| 2025-10-31 | End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection | Zhibin Li Team | 2511.00139 | null |
| 2025-10-30 | Self-Improving Vision-Language-Action Models with Data Generation via Residual RL | Yuke Zhu Team | 2511.00091 | null |
| 2025-10-30 | Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail | Marco Pavone Team | 2511.00088 | null |
| 2025-11-04 | Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model | Jinwoo Shin Team | 2510.27607 | null |
| 2025-10-31 | EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities | Luhui Hu Team | 2510.27545 | null |
| 2025-10-30 | RoboOS-NeXT: A Unified Memory-based Framework for Lifelong, Scalable, and Robust Multi-Robot Collaboration | Shanghang Zhang Team | 2510.26536 | null |
| 2025-10-30 | Human-in-the-loop Online Rejection Sampling for Robotic Manipulation | Yansong Tang Team | 2510.26406 | null |
| 2025-10-29 | $π_\texttt{RL}$ : Online RL Fine-tuning for Flow-based Vision-Language-Action Models | Chao Yu Team | 2510.25889 | null |
| 2025-10-29 | Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models | Robert Katzschmann Team | 2510.25713 | null |
| 2025-10-29 | Don’t Blind Your VLA: Aligning Visual Representations for OOD Generalization | Aleksandr I. Panov Team | 2510.25616 | null |
| 2025-10-29 | NanoVLA: Routing Decoupled Vision-Language Understanding for Nano-sized Generalist Robotic Policies | Jinghui Lu Team | 2510.25122 | null |
| 2025-10-27 | A Survey on Efficient Vision-Language-Action Models | Heng Tao Shen Team | 2510.24795 | null |
| 2025-10-28 | BLM $_1$ : A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning | Heng Tao Shen Team | 2510.24161 | null |
| 2025-11-01 | RoboOmni: Proactive Robot Manipulation in Omni-modal Context | Xipeng Qiu Team | 2510.23763 | null |
| 2025-10-27 | UrbanVLA: A Vision-Language-Action Model for Urban Micromobility | He Wang Team | 2510.23576 | null |
| 2025-10-27 | Dexbotic: Open-Source Vision-Language-Action Toolbox | Ziyu Zhang Team | 2510.23511 | link |
| 2025-10-28 | Evaluation of Vision-LLMs in Surveillance Video | Jelte P. Mense Team | 2510.23190 | null |
| 2025-10-25 | ACG: Action Coherence Guidance for Flow-based VLA models | Jaegul Choo Team | 2510.22201 | null |
| 2025-10-23 | Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence | Elias Aronsson Team | 2510.21860 | null |
| 2025-10-21 | VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting | Ran He Team | 2510.21817 | link |
| 2025-10-24 | Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos | Baining Guo Team | 2510.21571 | link |
| 2025-10-23 | SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing | Axel Krieger Team | 2510.20965 | null |
| 2025-10-23 | VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation | Abhishek Gupta Team | 2510.20818 | null |
| 2025-10-23 | MemER: Scaling Up Memory for Robot Control via Experience Retrieval | Chelsea Finn Team | 2510.20328 | link |
| 2025-10-22 | Learning Affordances at Inference-Time for Vision-Language-Action Models | Sergey Levine Team | 2510.19752 | null |
| 2025-10-22 | GigaBrain-0: A World Model-Powered Vision-Language-Action Model | Zheng Zhu Team | 2510.19430 | link |
| 2025-10-22 | Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes | Baining Guo Team | 2510.19400 | link |
| 2025-10-23 | MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning | Heng Yang Team | 2510.18337 | null |
| 2025-10-24 | RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation | Ziwei Wang Team | 2510.17640 | null |
| 2025-10-20 | From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors | Pan Zhou Team | 2510.17439 | link |
| 2025-10-20 | Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots | Josie Hughes Team | 2510.17369 | null |
| 2025-10-21 | DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment | Wang Jijun Team | 2510.17148 | null |
| 2025-10-23 | Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey | Jian Cheng Team | 2510.17111 | null |
| 2025-10-18 | MoS-VLA: A Vision-Language-Action Model with One-Shot Skill Adaptation | Ufuk Topcu Team | 2510.16617 | null |
| 2025-10-18 | Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification | Claudia P’erez-D’Arpino Team | 2510.16281 | null |
| 2025-10-21 | NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly? | Yu Yin Team | 2510.16263 | link |
| 2025-10-17 | Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning | Sean Huver Team | 2510.16240 | null |
| 2025-10-17 | VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving | Zufeng Zhang Team | 2510.15446 | null |
| 2025-10-16 | RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks | Jiachen Li Team | 2510.14968 | null |
| 2025-10-17 | From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance | Chang Xu Team | 2510.14952 | null |
| 2025-10-16 | VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation | Donglin Wang Team | 2510.14902 | null |
| 2025-10-16 | QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models | Haoran Li Team | 2510.14836 | null |
| 2025-10-16 | Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning | Yao Mu Team | 2510.14300 | null |
| 2025-10-15 | InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy | Yangkun Zhu Team | 2510.13778 | null |
| 2025-10-15 | LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models | Xipeng Qiu Team | 2510.13626 | null |
| 2025-10-15 | DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning | Hang Zhao Team | 2510.13375 | null |
| 2025-10-15 | Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models | Jingfeng Zhang Team | 2510.13237 | null |
| 2025-10-15 | VLA-0: Building State-of-the-Art VLAs with Zero Modification | Fabio Ramos Team | 2510.13054 | null |
| 2025-10-14 | DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving | Zhaoxiang Zhang Team | 2510.12796 | null |
| 2025-10-14 | Reflection-Based Task Adaptation for Self-Improving VLA | Hongbin Zha Team | 2510.12710 | null |
| 2025-10-17 | Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model | Haoang Li Team | 2510.12276 | null |
| 2025-10-14 | ManiAgent: An Agentic Framework for General Robotic Manipulation | Xudong Liu Team | 2510.11660 | null |
| 2025-10-13 | Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning | Zhi Hou Team | 2510.11027 | null |
| 2025-10-14 | RoVer: Robot Reward Model as Test-Time Verifier for Vision-Language-Action Model | Xinyu Wu Team | 2510.10975 | null |
| 2025-10-13 | TabVLA: Targeted Backdoor Attacks on Vision-Language-Action Models | Yu-Gang Jiang Team | 2510.10932 | null |
| 2025-10-11 | X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model | Xianyuan Zhan Team | 2510.10274 | null |
| 2025-10-11 | Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback | Hongtao Lu Team | 2510.10181 | null |
| 2025-10-11 | Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models | Yi Zeng Team | 2510.09976 | null |
| 2025-10-08 | OmniSAT: Compact Action Token, Faster Auto Regression | Changsheng Xu Team | 2510.09667 | null |
| 2025-10-10 | VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation | Caifeng Shan Team | 2510.09607 | link |
| 2025-10-10 | PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs | Ying-Cong Chen Team | 2510.09507 | null |
| 2025-10-10 | Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects | Jingfeng Zhang Team | 2510.09269 | null |
| 2025-10-09 | Don’t Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered | Shayegan Omidshafiei Team | 2510.08464 | null |
| 2025-10-15 | USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots | Zhengxing Wu Team | 2510.07869 | link |
| 2025-10-09 | IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction | Liqiang Nie Team | 2510.07778 | null |
| 2025-10-09 | DEAS: DEtached value learning with Action Sequence for Scalable Offline RL | Yuke Zhu Team | 2510.07730 | link |
| 2025-10-08 | TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking | He Wang Team | 2510.07134 | link |
| 2025-10-08 | Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications | Yuke Zhu Team | 2510.07077 | link |
| 2025-10-08 | Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models | Elena Tutubalina Team | 2510.07067 | null |
| 2025-10-08 | RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training | Yu Wang Team | 2510.06710 | link |
| 2025-10-07 | EmbodiedCoder: Parameterized Embodied Mobile Manipulation via Modern Coding Model | Zhaoxiang Zhang Team | 2510.06207 | link |
| 2025-10-07 | Verifier-free Test-Time Sampling for Vision Language Action Models | Jinwoo Shin Team | 2510.05681 | null |
| 2025-10-07 | MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption | Marios Savvides Team | 2510.05580 | null |
| 2025-10-06 | HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks | Shimon Whiteson Team | 2510.04898 | null |
| 2025-10-05 | ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context | Jinwoo Shin Team | 2510.04246 | link |
| 2025-10-05 | SITCOM: Scaling Inference-Time COMpute for VLAs | Esha Pahwa Team | 2510.04041 | null |
| 2025-10-04 | Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert | Chunhua Shen Team | 2510.03896 | null |
| 2025-10-04 | NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation | Chunhua Shen Team | 2510.03895 | null |
| 2025-10-04 | LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization | Lichao Sun Team | 2510.03827 | null |
| 2025-10-02 | Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer | Yuxiang Zhou Team | 2510.03342 | null |
| 2025-10-03 | MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning | He Wang Team | 2510.03142 | link |
| 2025-10-02 | Contrastive Representation Regularization for Vision-Language-Action Models | Jinwoo Shin Team | 2510.01711 | null |
| 2025-10-02 | FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models | Bihan Wen Team | 2510.01642 | link |
| 2025-10-02 | VLA-R1: Enhancing Reasoning in Vision-Language-Action Models | Zheng Zhu Team | 2510.01623 | null |
| 2025-10-01 | INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models | Tesca FItzgerald Team | 2510.01389 | null |
| 2025-10-01 | Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition | Andrew F. Luo Team | 2510.01068 | link |
| 2025-10-02 | HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy | Jinwoo Shin Team | 2510.00695 | link |
| 2025-10-01 | Hybrid Training for Vision-Language-Action Models | Daniel Dijkman Team | 2510.00600 | null |
| 2025-10-01 | VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators | Weihua Su Team | 2510.00406 | null |
| 2025-09-30 | MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation | Shanghang Zhang Team | 2509.26642 | null |
| 2025-09-30 | Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA | Ruqi Huang Team | 2509.26251 | null |
| 2025-09-30 | MUVLA: Learning to Explore Object Navigation via Map Understanding | Jianye Hao Team | 2509.25966 | null |
| 2025-09-30 | TacRefineNet: Tactile-Only Grasp Refinement Between Arbitrary In-Hand Object Poses | Yangwei You Team | 2509.25746 | null |
| 2025-09-30 | VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning | Zeng-Guang Hou Team | 2509.25718 | null |
| 2025-09-30 | dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought | Yi Xu Team | 2509.25681 | null |
| 2025-09-29 | AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation | Tetsuya Ogata Team | 2509.25032 | null |
| 2025-09-29 | World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | Qing Zhang Team | 2509.24948 | null |
| 2025-09-29 | IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks | Ville Kyrki Team | 2509.24768 | null |
| 2025-09-29 | Emergent World Representations in OpenVLA | Omar G. Younis Team | 2509.24559 | null |
| 2025-09-29 | PhysiAgent: An Embodied Agent Framework in Physical World | Xianyuan Zhan Team | 2509.24524 | null |
| 2025-09-28 | AutoPrune: Each Complexity Deserves a Pruning Policy | Zhipeng Zhang Team | 2509.23931 | null |
| 2025-09-28 | Control Your Robot: A Unified System for Robot Control and Policy Deployment | Bingshan Hu Team | 2509.23823 | link |
| 2025-09-28 | Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models | Pietro Mazzaglia Team | 2509.23655 | null |
| 2025-09-27 | Leave No Observation Behind: Real-time Correction for VLA Action Chunks | Yusuke Iwasawa Team | 2509.23224 | null |
| 2025-09-27 | Transferring Vision-Language-Action Models to Industry Applications: Architectures, Performance, and Challenges | Zhibo Pang Team | 2509.23121 | null |
| 2025-09-26 | VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search | Ziwei Wang Team | 2509.22643 | null |
| 2025-09-26 | UnderwaterVLA: Dual-brain Vision-Language-Action architecture for Autonomous Underwater Navigation | Dixia Fan Team | 2509.22441 | null |
| 2025-09-26 | EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer | Guan Huang Team | 2509.22407 | null |
| 2025-09-29 | MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training | Xingang Wang Team | 2509.22199 | null |
| 2025-09-26 | Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting | Anirudha Majumdar Team | 2509.22195 | null |
| 2025-09-26 | Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation | Chang Xu Team | 2509.22093 | null |
| 2025-09-26 | Developing Vision-Language-Action Model from Egocentric Videos | Shinsuke Mori Team | 2509.21986 | null |
| 2025-09-20 | KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache | Long Zhuang Team | 2509.21354 | null |
| 2025-09-25 | RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models | Andrew Jaeyong Choi Team | 2509.21243 | null |
| 2025-09-24 | Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving | Xianpeng Lang Team | 2509.20109 | null |
| 2025-09-24 | FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models | Yu-Gang Jiang Team | 2509.19870 | null |
| 2025-09-24 | Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training | Yi Chen Team | 2509.19752 | null |
| 2025-09-23 | Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action | Liam Paull Team | 2509.19571 | link |
| 2025-09-23 | OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation | Sergey Levine Team | 2509.19480 | null |
| 2025-09-25 | Pure Vision Language Action (VLA) Models: A Comprehensive Survey | Qingguo Zhou Team | 2509.19012 | null |
| 2025-09-23 | Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations | Wen Yao Team | 2509.18953 | null |
| 2025-09-22 | Latent Action Pretraining Through World Modeling | Ian Reid Team | 2509.18428 | null |
| 2025-09-18 | VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation | Anzhou Hou Team | 2509.18183 | null |
| 2025-09-19 | CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine | Jian Sun Team | 2509.15968 | null |
| 2025-09-19 | A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning | Jiangmiao Pang Team | 2509.15937 | null |
| 2025-09-18 | RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation | Xin Li Team | 2509.15212 | link |
| 2025-09-18 | Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale | Florian Walter Team | 2509.14932 | null |
| 2025-09-18 | CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human | Huaping Liu Team | 2509.14889 | null |
| 2025-09-18 | RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI | Tao Shen Team | 2509.14687 | null |
| 2025-09-18 | Toward Embodiment Equivariant Vision-Language-Action Policy | Yue Wang Team | 2509.14630 | null |
| 2025-09-17 | CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping | Lifeng Zhou Team | 2509.14143 | null |
| 2025-09-17 | SeqVLA: Sequential Task Execution for Long-Horizon Manipulation with Completion-Aware Vision-Language-Action Model | Yiming Feng Team | 2509.14138 | null |
| 2025-09-22 | GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model | Dezhen Song Team | 2509.14117 | null |
| 2025-09-17 | Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach | Yangwei You Team | 2509.13774 | null |
| 2025-09-17 | AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving | Zhi-xin Yang Team | 2509.13769 | null |
| 2025-09-13 | OpenHA: A Series of Open-Source Hierarchical Agentic Models in Minecraft | Yitao Liang Team | 2509.13347 | null |
| 2025-09-21 | The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning | Xianpeng Lang Team | 2509.12594 | link |
| 2025-09-17 | TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning | Donglin Wang Team | 2509.11839 | null |
| 2025-09-15 | Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs | Yanzhi Wang Team | 2509.11480 | null |
| 2025-09-17 | Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations | Xuanlin Li Team | 2509.11417 | link |
| 2025-09-11 | SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning | Ning Ding Team | 2509.09674 | null |
| 2025-09-22 | VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model | Donglin Wang Team | 2509.09372 | link |
| 2025-09-11 | SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models | Huanrui Yang Team | 2509.09090 | null |
| 2025-09-10 | RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation | Hao Zhao Team | 2509.08820 | link |
| 2025-09-09 | TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models | Hao Zhao Team | 2509.07962 | link |
| 2025-09-09 | Graph-Fused Vision-Language-Action for Policy Reasoning in Multi-Arm Robotic Manipulation | Yingbai Hu Team | 2509.07957 | null |
| 2025-09-09 | F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions | Jiangmiao Pang Team | 2509.06951 | link |
| 2025-09-11 | LLaDA-VLA: Vision Language Diffusion Action Models | Xiaoyan Sun Team | 2509.06932 | null |
| 2025-09-08 | CRISP – Compliant ROS2 Controllers for Learning-Based Manipulation Policies and Teleoperation | Angela P. Schöllig Team | 2509.06819 | null |
| 2025-09-09 | Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization | Surasakdi Siripong Team | 2509.05695 | null |
| 2025-09-06 | SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning | Guohao Dai Team | 2509.05614 | null |
| 2025-09-06 | OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision | Hang Zhao Team | 2509.05578 | null |
| 2025-09-05 | OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation | Yu Xiang Team | 2509.05513 | null |
| 2025-09-05 | FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies | Rudolf Lioutikov Team | 2509.04996 | null |
| 2025-09-04 | Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models | Donglin Wang Team | 2509.04063 | null |
| 2025-09-04 | FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction | Jingtai Liu Team | 2509.04018 | null |
| 2025-09-03 | ANNIE: Be Careful of Your Robots | Yiming Gan Team | 2509.03383 | null |
| 2025-09-05 | Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance | Xuelong Li Team | 2509.02055 | null |
| 2025-09-02 | AutoDrive-R $^2$ : Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving | Shuo Li Team | 2509.01944 | null |
| 2025-08-31 | OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving | Jun Ma Team | 2509.00789 | null |
| 2025-08-30 | Galaxea Open-World Dataset and G0 Dual-System VLA Model | Hang Zhao Team | 2509.00576 | link |
| 2025-08-30 | Mechanistic interpretability for steering vision-language-action models | Claire Tomlin Team | 2509.00328 | link |
| 2025-09-09 | EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control | Dong Wang Team | 2508.21112 | null |
| 2025-10-02 | CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification | Liqiang Nie Team | 2508.21046 | link |
| 2025-08-27 | Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies | Ping Luo Team | 2508.20072 | null |
| 2025-08-28 | Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation | Donglin Wang Team | 2508.19958 | link |
| 2025-08-28 | Ego-centric Predictive Model Conditioned on Hand Trajectories | Mike Zheng Shou Team | 2508.19852 | null |
| 2025-08-15 | TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models | Huiling Duan Team | 2508.19257 | null |
| 2025-08-26 | MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation | Gao Huang Team | 2508.19236 | link |
| 2025-08-26 | FlowVLA: Thinking in Motion with a Visual Chain of Thought | Haoang Li Team | 2508.18269 | null |
| 2025-09-06 | 4D Visual Pre-training for Robot Learning | Huazhe Xu Team | 2508.17230 | null |
| 2025-08-23 | NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows | Vladislav Kurenkov Team | 2508.16845 | null |
| 2025-08-22 | Do What? Teaching Vision-Language-Action Models to Reject the Impossible | David M. Chan Team | 2508.16292 | null |
| 2025-11-13 | Survey of Vision-Language-Action Models for Embodied Manipulation | Dongbin Zhao Team | 2508.15201 | null |
| 2025-08-19 | CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models | Sergey Levine Team | 2508.13446 | null |
| 2025-08-18 | Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy | Zhi Hou Team | 2508.13103 | null |
| 2025-09-01 | Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey | Liqiang Nie Team | 2508.13073 | link |
| 2025-08-17 | Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search | Glen Berseth Team | 2508.12211 | null |
| 2025-08-16 | Toward General Physical Intelligence for Resilient Agile Manufacturing Automation | Sunny Katyara Team | 2508.11960 | null |
| 2025-08-14 | CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model | Hao Dong Team | 2508.10416 | null |
| 2025-08-14 | Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning | Ping Kuang Team | 2508.10399 | null |
| 2025-08-14 | ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver | Haoang Li Team | 2508.10333 | null |
| 2025-08-13 | GeoVLA: Empowering 3D Representations in Vision-Language-Action Models | Jiale Cao Team | 2508.09071 | link |
| 2025-08-12 | Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding | Aleksandr I. Panov Team | 2508.09032 | null |
| 2025-08-22 | OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing | Hengdi Zhang Team | 2508.08706 | link |
| 2025-08-14 | Reinforcement Learning in Vision: A Survey | Mike Zheng Shou Team | 2508.08189 | null |
| 2025-08-12 | MolmoAct: Action Reasoning Models that can Reason in Space | Ranjay Krishna Team | 2508.07917 | link |
| 2025-08-13 | AgentWorld: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation | Lei Han Team | 2508.07770 | null |
| 2025-08-23 | GraphCoT-VLA: A 3D Spatial-Aware Reasoning Vision-Language-Action Model for Robotic Manipulation with Ambiguous Instructions | Hong Zhang Team | 2508.07650 | null |
| 2025-08-15 | IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model | Li Sun Team | 2508.06571 | null |
| 2025-08-06 | Static and Plugged: Make Embodied Evaluation Simple | Guangtao Zhai Team | 2508.06553 | null |
| 2025-08-06 | A tutorial note on collecting simulated data for vision-language-action models | Jingfeng Zhang Team | 2508.06547 | null |
| 2025-08-07 | Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control | Hamid Reza Karimi Team | 2508.05342 | null |
| 2025-08-14 | Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction | Jorge Peña Queralta Team | 2508.05294 | null |
| 2025-08-07 | Learning to See and Act: Task-Aware View Planning for Robotic Manipulation | Liang Lin Team | 2508.05186 | link |
| 2025-08-06 | Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions | Xiaokang Yang Team | 2508.04681 | link |
| 2025-08-06 | Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces | Pierre Lison Team | 2508.02917 | null |
| 2025-08-04 | MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming | Zhaoxin Fan Team | 2508.02549 | null |
| 2025-08-04 | CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning | Chunhe Xia Team | 2508.02219 | null |
| 2025-08-04 | FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation | Xiaodong Wang Team | 2508.02190 | null |
| 2025-08-04 | RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models | Insup Lee Team | 2508.02062 | null |
| 2025-07-31 | XRoboToolkit: A Cross-Platform Framework for Robot Teleoperation | Ning Yang Team | 2508.00097 | link |
| 2025-07-31 | villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models | Jiang Bian Team | 2507.23682 | link |
| 2025-07-31 | A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving | Alois Knoll Team | 2507.23540 | null |
| 2025-08-02 | FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning | Shanghang Zhang Team | 2507.23318 | null |
| 2025-07-30 | Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance | Derek F. Wong Team | 2507.22424 | null |
| 2025-07-23 | InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Jiangmiao Pang Team | 2507.17520 | null |
| 2025-07-23 | ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents | Hesheng Wang Team | 2507.17462 | null |
| 2025-07-23 | Confidence Calibration in Vision-Language-Action Models | Richard Zemel Team | 2507.17383 | null |
| 2025-07-29 | VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback | Harold Soh Team | 2507.17294 | null |
| 2025-07-22 | ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning | Fu-En Yang Team | 2507.16815 | link |
| 2025-07-21 | Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos | Zongqing Lu Team | 2507.15597 | null |
| 2025-07-22 | GR-3 Technical Report | Yichu Yang Team | 2507.15493 | link |
| 2025-07-18 | EdgeVLA: Efficient Vision-Language-Action Models | Benjamin Bolte Team | 2507.14049 | null |
| 2025-07-23 | LaViPlan : Language-Guided Visual Path Planning with RLVR | Hayeon Oh Team | 2507.12911 | null |
| 2025-07-17 | AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation | Jun Zhu Team | 2507.12768 | null |
| 2025-07-18 | EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos | Xiaolong Wang Team | 2507.12440 | link |
| 2025-07-14 | Vision Language Action Models in Robotic Manipulation: A Systematic Review | Irfan Hussain Team | 2507.10672 | null |
| 2025-07-12 | Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization | Yang Gao Team | 2507.09160 | null |
| 2025-07-09 | 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds | Nick Haber Team | 2507.06484 | link |
| 2025-07-07 | NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving | Cheng Lu Team | 2507.05227 | null |
| 2025-10-06 | VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting | Yanzhi Wang Team | 2507.05116 | null |
| 2025-07-17 | DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Xin Jin Team | 2507.04447 | null |
| 2025-07-06 | Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties | Yunxin Liu Team | 2507.04227 | null |
| 2025-07-03 | DexVLG: Dexterous Vision-Language-Grasp Model at Scale | He Wang Team | 2507.02747 | null |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null |
| 2025-07-03 | A Survey on Vision-Language-Action Models: An Action Tokenization Perspective | Yaodong Yang Team | 2507.01925 | null |
| 2025-07-02 | MoIRA: Modular Instruction Routing Architecture for Multi-Task Robotics | Nadiya Shvai Team | 2507.01843 | null |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null |
| 2025-07-01 | VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers | Tong He Team | 2507.01016 | null |
| 2025-07-01 | Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding | Bo Zhao Team | 2507.00416 | null |
| 2025-06-30 | A Survey on Vision-Language-Action Models for Autonomous Driving | Lijun Sun Team | 2506.24044 | null |
| 2025-06-27 | 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration | Li Zhang Team | 2506.22242 | null |
| 2025-08-08 | Can Vision Language Models Understand Mimed Actions? | Jonathan May Team | 2506.21586 | null |
| 2025-06-26 | WorldVLA: Towards Autoregressive Action World Model | Hao Chen Team | 2506.21539 | null |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null |
| 2025-06-25 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null |
| 2025-06-24 | CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation | Jiangmiao Pang Team | 2506.19816 | null |
| 2025-07-07 | RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models | Marco Pavone Team | 2506.17811 | null |
| 2025-06-21 | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Xiao Li Team | 2506.17639 | null |
| 2025-06-21 | VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models | Lin Shao Team | 2506.17561 | null |
| 2025-06-19 | CapsDT: Diffusion-Transformer for Capsule Robot Manipulation | Hongliang Ren Team | 2506.16263 | null |
| 2025-06-19 | ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models | Siyuan Huang Team | 2506.16211 | null |
| 2025-06-19 | ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes | Hao Dong Team | 2506.14317 | null |
| 2025-06-16 | GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics | Mac Schwager Team | 2506.14009 | null |
| 2025-06-16 | AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning | Jiaqi Ma Team | 2506.13757 | link |
| 2025-06-19 | LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction | Shankar Sastry Team | 2506.13751 | null |
| 2025-06-16 | CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding | Haoang Li Team | 2506.13725 | null |
| 2025-06-16 | ROSA: Harnessing Robot States for Vision-Language and Action Alignment | Xiaoyan Sun Team | 2506.13679 | null |
| 2025-06-16 | Block-wise Adaptive Caching for Accelerating Diffusion Policy | Zhi Wang Team | 2506.13456 | null |
| 2025-06-19 | A Comprehensive Survey on Continual Learning in Generative Models | Cheng-Lin Liu Team | 2506.13045 | link |
| 2025-06-19 | SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration | Wenwu Zhu Team | 2506.12723 | null |
| 2025-06-13 | RationalVLA: A Rational Vision-Language-Action Model with Dual System | Haoang Li Team | 2506.10826 | null |
| 2025-06-11 | EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models | Linfeng Zhang Team | 2506.10100 | null |
| 2025-06-11 | SAFE: Multitask Failure Detection for Vision-Language-Action Models | Florian Shkurti Team | 2506.09937 | null |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null |
| 2025-06-17 | An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Harshvardhan Sikka Team | 2506.09172 | null |
| 2025-06-10 | FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency | Jian Tang Team | 2506.08822 | null |
| 2025-06-10 | Hybrid Reasoning for Perception, Explanation, and Autonomous Action in Manufacturing | Sebastian W. Pattinson Team | 2506.08462 | null |
| 2025-06-11 | TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization | Qi Wang Team | 2506.08440 | null |
| 2025-06-11 | HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation | Cong Wang Team | 2506.08296 | null |
| 2025-06-14 | Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework | Jason H. Moore Team | 2506.08185 | link |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null |
| 2025-06-09 | Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse | Chris Xiaoxuan Lu Team | 2506.07639 | null |
| 2025-06-09 | BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation | Xilin Chen Team | 2506.07530 | link |
| 2025-06-09 | Real-Time Execution of Action Chunking Flow Policies | Sergey Levine Team | 2506.07339 | null |
| 2025-06-12 | Robotic Policy Learning via Human-assisted Action Preference Optimization | Di Hu Team | 2506.07127 | null |
| 2025-06-07 | RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation | Si Liu Team | 2506.06677 | null |
| 2025-06-06 | MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping | Farshad Khorrami Team | 2506.06535 | null |
| 2025-06-06 | DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models | Xianpeng Lang Team | 2506.05667 | null |
| 2025-06-04 | SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models | Jian Tang Team | 2506.03574 | null |
| 2025-06-03 | Adversarial Attacks on Robotic Vision Language Action Models | J. Zico Kolter Team | 2506.03350 | link |
| 2025-06-02 | Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning | Pheng-Ann Heng Team | 2506.01953 | null |
| 2025-06-02 | SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics | Remi Cadene Team | 2506.01844 | link |
| 2025-06-02 | MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments | Jun Zhu Team | 2506.01616 | null |
| 2025-06-02 | ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | Huaxiu Yao Team | 2506.01300 | null |
| 2025-06-01 | OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation | Valts Blukis Team | 2506.01196 | null |
| 2025-05-31 | LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks | Zhijie Deng Team | 2506.00411 | null |
| 2025-05-30 | Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction | Xuelong Li Team | 2505.24156 | null |
| 2025-05-29 | Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models | Hao Zhao Team | 2505.23757 | link |
| 2025-05-29 | Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better | Sergey Levine Team | 2505.23705 | null |
| 2025-05-29 | Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents | Lichao Sun Team | 2505.23450 | null |
| 2025-05-29 | TrackVLA: Embodied Visual Tracking in the Wild | He Wang Team | 2505.23189 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-05-29 | ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null |
| 2025-05-27 | EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models | Xiang Chen Team | 2505.21567 | null |
| 2025-06-02 | Hume: Introducing System-2 Thinking in Visual-Language-Action Model | Xuelong Li Team | 2505.21432 | null |
| 2025-05-27 | Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models | Tao Chen Team | 2505.21200 | null |
| 2025-05-26 | Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review | Goldie Nejat Team | 2505.20503 | null |
| 2025-05-26 | What Can RL Bring to VLA Generalization? An Empirical Study | Yu Wang Team | 2505.19789 | null |
| 2025-05-26 | RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback | Yongtao Wang Team | 2505.19767 | null |
| 2025-05-25 | ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning | Minh Nhat Vu Team | 2505.19080 | null |
| 2025-05-24 | Genie Centurion: Accelerating Scalable Real-World Robot Training with Human Rewind-and-Refine Guidance | Maoqing Yao Team | 2505.18793 | null |
| 2025-05-24 | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Ziwei Wang Team | 2505.18719 | link |
| 2025-05-22 | ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems | Farhad Imani Team | 2505.17295 | null |
| 2025-05-22 | Interactive Post-Training for Vision-Language-Action Models | Philipp Krähenbühl Team | 2505.17016 | null |
| 2025-05-22 | Perceptual Quality Assessment for Embodied AI | Guangtao Zhai Team | 2505.16815 | link |
| 2025-05-22 | BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization | Lichao Sun Team | 2505.16640 | null |
| 2025-05-22 | DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving | Junchi Yan Team | 2505.16278 | null |
| 2025-05-21 | From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems | Soujanya Poria Team | 2505.15685 | link |
| 2025-05-24 | Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization | Junwei Liang Team | 2505.15660 | link |
| 2025-05-21 | FLARE: Robot Learning with Implicit World Modeling | Linxi Fan Team | 2505.15659 | null |
| 2025-05-21 | Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control | Jungwook Choi Team | 2505.15304 | null |
| 2025-05-21 | EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy | Hongliang Ren Team | 2505.15206 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory | Ping Luo Team | 2505.14030 | null |
| 2025-05-22 | InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning | Jingkuan Song Team | 2505.13888 | link |
| 2025-05-25 | RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction | Bo Zhao Team | 2505.12224 | null |
| 2025-05-17 | OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning | Yang Gao Team | 2505.11917 | null |
| 2025-05-16 | Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions | Donglin Wang Team | 2505.11214 | null |
| 2025-05-16 | Conditioning Matters: Training Diffusion Policies is Faster Than You Think | Jianye Hao Team | 2505.11123 | null |
| 2025-05-14 | Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware | Ken Goldberg Team | 2505.09601 | null |
| 2025-05-14 | RT-cache: Efficient Robot Trajectory Retrieval System | Amir Barati Farimani Team | 2505.09040 | null |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null |
| 2025-05-17 | Training Strategies for Efficient Embodied Reasoning | Sergey Levine Team | 2505.08243 | null |
| 2025-05-12 | Pixel Motion as Universal Representation for Robot Control | Michael S Ryoo Team | 2505.07817 | null |
| 2025-05-12 | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | Donglin Wang Team | 2505.07395 | null |
| 2025-05-15 | UniVLA: Learning to Act Anywhere with Task-centric Latent Actions | Hongyang Li Team | 2505.06111 | link |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null |
| 2025-05-08 | Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | Harshvardhan Sikka Team | 2505.05540 | link |
| 2025-05-09 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null |
| 2025-05-06 | OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation | Donglin Wang Team | 2505.03912 | link |
| 2025-05-16 | Task Reconstruction and Extrapolation for $π_0$ using Text Latent | Quanyi Li Team | 2505.03500 | null |
| 2025-05-06 | GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data | He Wang Team | 2505.03233 | null |
| 2025-05-06 | Automated Data Curation Using GPS & NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets | Ross Greer Team | 2505.03174 | null |
| 2025-05-04 | CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation | Hao Dong Team | 2505.02166 | null |
| 2025-05-04 | Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions | Mingyu Ding Team | 2505.02152 | null |
| 2025-04-28 | NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks | Soujanya Poria Team | 2504.19854 | null |
| 2025-04-22 | $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization | Ury Zhilinsky Team | 2504.16054 | null |
| 2025-04-22 | Few-Shot Vision-Language Action-Incremental Policy Learning | Weili Guan Team | 2504.15517 | null |
| 2025-04-18 | GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | Xiaobo Xia Team | 2504.10458 | null |
| 2025-04-09 | OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning | Tyler Fenstermaker Team | 2504.06538 | null |
| 2025-04-02 | Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning | Roozbeh Mottaghi Team | 2504.00907 | null |
| 2025-03-30 | OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | Alois C. Knoll Team | 2503.23463 | link |
| 2025-03-27 | CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models | Tsung-Yi Lin Team | 2503.22020 | null |
| 2025-04-14 | MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation | Shanghang Zhang Team | 2503.20384 | null |
| 2025-03-25 | Gemini Robotics: Bringing AI into the Physical World | Yuxiang Zhou Team | 2503.20020 | null |
| 2025-03-25 | Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy | Yuntao Chen Team | 2503.19757 | null |
| 2025-03-25 | DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data | Lin Ma Team | 2503.19516 | null |
| 2025-03-27 | GR00T N1: An Open Foundation Model for Generalist Humanoid Robots | Yuke Zhu Team | 2503.14734 | null |
| 2025-03-15 | ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis | Mingyu Ding Team | 2503.14526 | null |
| 2025-03-17 | MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation | Haibin Yan Team | 2503.13446 | null |
| 2025-03-17 | HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | Shanghang Zhang Team | 2503.10631 | null |
| 2025-03-13 | SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment | Oleg Sinavski Team | 2503.09594 | null |
| 2025-03-12 | CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games | Bo Zheng Team | 2503.09527 | null |
| 2025-03-11 | MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models | Zongyuan Ge Team | 2503.08007 | null |
| 2025-03-10 | PointVLA: Injecting the 3D World into Vision-Language-Action Models | Yichen Zhu Team | 2503.07511 | null |
| 2025-03-11 | CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning | Andreas Bulling Team | 2503.06637 | null |
| 2025-03-06 | Refined Policy Distillation: From VLA Generalists to RL Experts | Florian Walter Team | 2503.05833 | null |
| 2025-03-06 | VLA Model-Expert Collaboration for Bi-directional Manipulation Learning | Zeng-Guang Hou Team | 2503.04163 | null |
| 2025-03-26 | OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction | Pieter Abbeel Team | 2503.03734 | null |
| 2025-03-05 | SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning | Yaodong Yang Team | 2503.03480 | null |
| 2025-03-04 | Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding | Haoang Li Team | 2503.02310 | null |
| 2025-03-03 | CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs | Dzmitry Tsetserukou Team | 2503.01378 | null |
| 2025-10-15 | CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving | Issei Yamamoto Team | 2408.10845 | link |
| 2024-07-23 | Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators | Harsh Lunia Team | 2407.14834 | null |
| 2024-03-15 | 3D-VLA: A 3D Vision-Language-Action Generative World Model | Chuang Gan Team | 2403.09631 | link |
| 2022-07-19 | Zero-Shot Temporal Action Detection via Vision-Language Prompting | Tao Xiang Team | 2207.08184 | link |
| 2022-06-01 | ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts | Xiaodan Liang Team | 2205.15509 | null |
| 2022-08-16 | A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility | Bryan A. Plummer Team | 2202.02312 | null |
| 2017-04-25 | An Analysis of Action Recognition Datasets for Language and Vision Tasks | Frank Keller Team | 1704.07129 | null |
Humanoid
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | InEKFormer: A Hybrid State Estimator for Humanoid Robots | Frank Kirchner Team | 2511.16306 | null |
| 2025-11-19 | VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation | Yuke Zhu Team | 2511.15200 | link |
| 2025-11-18 | HMC: Learning Heterogeneous Meta-Control for Contact-Rich Loco-Manipulation | Xiaolong Wang Team | 2511.14756 | null |
| 2025-11-15 | Learning Adaptive Neural Teleoperation for Humanoid Robots: From Inverse Kinematics to End-to-End Control | Sanjar Atamuradov Team | 2511.12390 | null |
| 2025-11-14 | Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning | Xiaoyu Ren Team | 2511.11218 | null |
| 2025-11-13 | DecARt Leg: Design and Evaluation of a Novel Humanoid Robot Leg with Decoupled Actuation for Agile Locomotion | Roman Gorbachev Team | 2511.10021 | null |
| 2025-11-12 | SPIDER: Scalable Physics-Informed Dexterous Retargeting | Francois Hogan Team | 2511.09484 | link |
| 2025-11-12 | Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots | Siheng Chen Team | 2511.09241 | null |
| 2025-11-12 | RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation | Miao Li Team | 2511.09141 | null |
| 2025-11-10 | Unified Humanoid Fall-Safety Policy from a Few Demonstrations | Stella X. Yu Team | 2511.07407 | null |
| 2025-11-10 | Human-Level Actuation for Humanoids | MD-Nazmus Sunbeam Team | 2511.06796 | null |
| 2025-11-11 | Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning | Chenjia Bai Team | 2511.06371 | null |
| 2025-11-08 | Towards Human-AI-Robot Collaboration and AI-Agent based Digital Twins for Parkinson’s Disease Management: Review and Outlook | Tareq Y. Al-Naffouri Team | 2511.06036 | null |
| 2025-11-06 | ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling | Fabio Ramos Team | 2511.04758 | link |
| 2025-11-06 | GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction | C. Karen Liu Team | 2511.04679 | link |
| 2025-11-06 | BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning | Guanya Shi Team | 2511.04131 | null |
| 2025-11-06 | Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots | Mingguo Zhao Team | 2511.03996 | link |
| 2025-11-05 | OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera | Kaiwei Wang Team | 2511.03571 | link |
| 2025-11-04 | TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System | C. Karen Liu Team | 2511.02832 | link |
| 2025-11-02 | Heuristic Step Planning for Learning Dynamic Bipedal Locomotion: A Comparative Study of Model-Based and Model-Free Approaches | Roman Gorbachev Team | 2511.00840 | null |
| 2025-10-31 | EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations | Philipp Wu Team | 2511.00153 | null |
| 2025-10-31 | Towards a Multi-Embodied Grasping Agent | Gerhard Neumann Team | 2510.27420 | null |
| 2025-10-30 | Cooperative Task Spaces for Multi-Arm Manipulation Control based on Similarity Transformations | Sylvain Calinon Team | 2510.26362 | null |
| 2025-11-05 | Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments | Shaqi Luo Team | 2510.26280 | null |
| 2025-11-01 | Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse | Renkai Ma Team | 2510.26082 | null |
| 2025-10-28 | A Humanoid Visual-Tactile-Action Dataset for Contact-Rich Manipulation | Kyung-Joong Kim Team | 2510.25725 | null |
| 2025-10-27 | Awakening Facial Emotional Expressions in Human-Robot | Jianwei Zhang Team | 2510.23059 | null |
| 2025-11-05 | Toward Humanoid Brain-Body Co-design: Joint Optimization of Control and Morphology for Fall Recovery | Guiliang Liu Team | 2510.22336 | null |
| 2025-10-21 | SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices | Yueyue Dai Team | 2510.18544 | null |
| 2025-10-20 | Humanoid Goalkeeper: Learning from Position Conditioned Task-Motion Constraints | Jiangmiao Pang Team | 2510.18002 | null |
| 2025-10-20 | SoftMimic: Learning Compliant Whole-body Control from Examples | Pulkit Agrawal Team | 2510.17792 | link |
| 2025-10-19 | CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions | Aaron D. Ames Team | 2510.14959 | null |
| 2025-10-17 | From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance | Chang Xu Team | 2510.14952 | null |
| 2025-10-16 | Towards Adaptable Humanoid Control via Adaptive Motion Tracking | Jiangmiao Pang Team | 2510.14454 | null |
| 2025-10-15 | A Modular Object Detection System for Humanoid Robots Using YOLO | Meng Cheng Lau Team | 2510.13625 | null |
| 2025-10-15 | Development of an Intuitive GUI for Non-Expert Teleoperation of Humanoid Robots | Meng Cheng Lau Team | 2510.13594 | null |
| 2025-10-14 | PolygMap: A Perceptive Locomotion Framework for Humanoid Robot Stair Climbing | Yucong Wu Team | 2510.12346 | null |
| 2025-10-13 | Ego-Vision World Model for Humanoid Contact Planning | Koushil Sreenath Team | 2510.11682 | null |
| 2025-10-13 | Simultaneous Calibration of Noise Covariance and Kinematics for State Estimation of Legged Robots via Bi-level Optimization | Xiaobin Xiong Team | 2510.11539 | null |
| 2025-10-13 | Path and Motion Optimization for Efficient Multi-Location Inspection with Humanoid Robots | Yao Su Team | 2510.11401 | null |
| 2025-10-13 | DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation | Zongqing Lu Team | 2510.11258 | null |
| 2025-10-13 | PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System | Jiangmiao Pang Team | 2510.11072 | link |
| 2025-10-12 | Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion | Mingguo Zhao Team | 2510.10851 | null |
| 2025-10-11 | It Takes Two: Learning Interactive Whole-Body Control Between Humanoid Robots | Siheng Chen Team | 2510.10206 | null |
| 2025-10-10 | Enhancing Diffusion Policy with Classifier-Free Guidance for Temporal Robotic Tasks | Zhicheng He Team | 2510.09786 | null |
| 2025-10-09 | Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation | Yue Wang Team | 2510.08807 | null |
| 2025-10-09 | DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos | Tsung-Wei Ke Team | 2510.08475 | link |
| 2025-10-09 | Reliability of Single-Level Equality-Constrained Inverse Optimal Control | Vincent Bonnet Team | 2510.08406 | null |
| 2025-10-15 | Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots | Zongqing Lu Team | 2510.07882 | null |
| 2025-10-10 | DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction | Qiang Zhang Team | 2510.07152 | null |
| 2025-10-07 | A Co-Design Framework for Energy-Aware Monoped Jumping with Detailed Actuator Modeling | Shishir Kolathaya Team | 2510.05923 | null |
| 2025-10-06 | Walking, Rolling, and Beyond: First-Principles and RL Locomotion on a TARS-Inspired Robot | Abhishek Warrier Team | 2510.05001 | null |
| 2025-10-05 | Stability-Aware Retargeting for Humanoid Multi-Contact Teleoperation | Robert Griffin Team | 2510.04353 | null |
| 2025-10-03 | LapSurgie: Humanoid Robots Performing Surgery via Teleoperated Handheld Laparoscopy | Michael C. Yip Team | 2510.03529 | null |
| 2025-10-03 | Embracing Evolution: A Call for Body-Control Co-Design in Embodied Humanoid Robot | Kui Jia Team | 2510.03081 | null |
| 2025-10-03 | HumanoidExo: Scalable Whole-Body Humanoid Manipulation via Wearable Exoskeleton | Yi Xu Team | 2510.03022 | null |
| 2025-10-02 | Retargeting Matters: General Motion Retargeting for Humanoid Motion Tracking | C. Karen Liu Team | 2510.02252 | null |
| 2025-10-02 | Stand Up, NAO! Increasing the Reliability of Stand-Up Motions Through Error Compensation in Position Control | Tim Laue Team | 2510.02129 | null |
| 2025-10-02 | Like Playing a Video Game: Spatial-Temporal Optimization of Foot Trajectories for Controlled Football Kicking in Bipedal Robots | Peng Lu Team | 2510.01843 | null |
| 2025-09-30 | Learning Human Reaching Optimality Principles from Minimal Observation Inverse Reinforcement Learning | Ludovic Righetti Team | 2510.00329 | null |
| 2025-10-08 | OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction | Guanya Shi Team | 2509.26633 | link |
| 2025-09-30 | ISyHand: A Dexterous Multi-finger Robot Hand with an Articulated Palm | Katherine J. Kuchenbecker Team | 2509.26236 | null |
| 2025-09-30 | Evolutionary Continuous Adaptive RL-Powered Co-Design for Humanoid Chin-Up Performance | Frank Kirchner Team | 2509.26082 | null |
| 2025-10-06 | CoTaP: Compliant Task Pipeline and Reinforcement Learning of Its Controller with Compliance Modulation | Yoshihiko Nakamura Team | 2509.25443 | null |
| 2025-09-29 | Stabilizing Humanoid Robot Trajectory Generation via Physics-Informed Learning and Control-Informed Steering | Daniele Pucci Team | 2509.24697 | null |
| 2025-09-29 | Game Theory to Study Cooperation in Human-Robot Mixed Groups: Exploring the Potential of the Public Good Game | Alessandra Sciutti Team | 2509.24530 | null |
| 2025-09-29 | Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models | Sethu Vijayakumar Team | 2509.24163 | null |
| 2025-09-28 | SIG-Chat: Spatial Intent-Guided Conversational Gesture Generation Involving How, When and Where | Chuanchen Luo Team | 2509.23852 | null |
| 2025-09-25 | SEEC: Stable End-Effector Control with Model-Enhanced Residual Learning for Humanoid Loco-Manipulation | Ye Zhao Team | 2509.21231 | null |
| 2025-09-25 | RuN: Residual Policy for Natural Humanoid Locomotion | Yong Liu Team | 2509.20696 | null |
| 2025-09-24 | Large Pre-Trained Models for Bimanual Manipulation in 3D | David Meger Team | 2509.20579 | null |
| 2025-09-24 | VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation | Jiajun Wu Team | 2509.20322 | link |
| 2025-09-25 | HL-IK: A Lightweight Implementation of Human-Like Inverse Kinematics in Humanoid Arms | Houde Liu Team | 2509.20263 | null |
| 2025-09-23 | Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforcement Learning | Aaron D. Ames Team | 2509.19573 | null |
| 2025-09-23 | RoMoCo: Robotic Motion Control Toolbox for Reduced-Order Model-Based Locomotion on Bipedal and Humanoid Robots | Aaron D. Ames Team | 2509.19545 | null |
| 2025-09-25 | Residual Off-Policy RL for Finetuning Behavior Cloning Policies | Anusha Nagabandi Team | 2509.19301 | link |
| 2025-09-27 | HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos | Guanya Shi Team | 2509.16757 | null |
| 2025-09-20 | KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control | Chenjia Bai Team | 2509.16638 | null |
| 2025-09-19 | A Framework for Optimal Ankle Design of Humanoid Robots | Daniele Pucci Team | 2509.16469 | null |
| 2025-09-19 | A Matter of Height: The Impact of a Robotic Object on Human Compliance | Hadas Erel Team | 2509.16032 | null |
| 2025-09-18 | Implicit Kinodynamic Motion Retargeting for Human-to-humanoid Imitation Learning | Haodong Zhang Team | 2509.15443 | null |
| 2025-09-18 | CAD-Driven Co-Design for Flight-Ready Jet-Powered Humanoids | Daniele Pucci Team | 2509.14935 | null |
| 2025-09-18 | RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI | Tao Shen Team | 2509.14687 | null |
| 2025-09-23 | Cybersecurity AI: Humanoid Robots as Attack Vectors | Kevin Finisterre Team | 2509.14139 | null |
| 2025-09-17 | The Cybersecurity of a Humanoid Robot | Víctor Mayoral-Vilches Team | 2509.14096 | null |
| 2025-09-17 | Behavior Foundation Model for Humanoid Robots | Jiangmiao Pang Team | 2509.13780 | null |
| 2025-09-17 | FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph | Zhizhong Su Team | 2509.13733 | null |
| 2025-09-16 | Embracing Bulky Objects with Humanoid Robots: Whole-Body Manipulation with Reinforcement Learning | Jun Ma Team | 2509.13534 | null |
| 2025-09-18 | StageACT: Stage-Conditioned Imitation for Robust Humanoid Door Opening | Shayegan Omidshafiei Team | 2509.13200 | null |
| 2025-09-14 | Quantum deep reinforcement learning for humanoid robot navigation task | Ahmed Biyabani Team | 2509.11388 | null |
| 2025-09-16 | FEWT: Improving Humanoid Robot Perception with Frequency-Enhanced Wavelet-based Transformers | Zhigong Song Team | 2509.11109 | null |
| 2025-09-16 | Data-fused Model Predictive Control with Guarantees: Application to Flying Humanoid Robots | Daniele Pucci Team | 2509.10353 | null |
| 2025-09-11 | MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos | Yuke Zhu Team | 2509.09769 | null |
| 2025-09-11 | AGILOped: Agile Open-Source Humanoid Robot for Research | Sven Behnke Team | 2509.09364 | null |
| 2025-09-09 | Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning | Changhyun Choi Team | 2509.08126 | null |
| 2025-09-09 | Interactive Shaping of Granular Media Using Reinforcement Learning | Maren Bennewitz Team | 2509.06469 | null |
| 2025-09-06 | Learning to Walk in Costume: Adversarial Motion Priors for Aesthetically Constrained Humanoids | Dennis W. Hong Team | 2509.05581 | null |
| 2025-09-08 | Hierarchical Reduced-Order Model Predictive Control for Robust Locomotion on Humanoid Robots | Aaron D. Ames Team | 2509.04722 | null |
| 2025-09-03 | The Role of Embodiment in Intuitive Whole-Body Teleoperation for Mobile Manipulation | Georgia Chalvatzaki Team | 2509.03222 | null |
| 2025-09-01 | ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training | Dieter Fox Team | 2509.01819 | null |
| 2025-09-04 | HITTER: A HumanoId Table TEnnis Robot via Hierarchical Planning and Learning | S. Shankar Sastry Team | 2508.21043 | null |
| 2025-09-16 | Traversing the Narrow Path: A Two-Stage Reinforcement Learning Framework for Humanoid Beam Walking | Shiwu Zhang Team | 2508.20661 | link |
| 2025-08-26 | HuBE: Cross-Embodiment Human-like Behavior Execution for Humanoid Robots | Guodong Guo Team | 2508.19002 | null |
| 2025-08-21 | PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors | Vincent Bonnet Team | 2508.18238 | null |
| 2025-09-01 | SoK: Cybersecurity Assessment of Humanoid Ecosystem | Yuval Elovici Team | 2508.17481 | null |
| 2025-08-20 | LookOut: Real-World Humanoid Egocentric Navigation | Leonidas J. Guibas Team | 2508.14466 | null |
| 2025-08-18 | Scaling Whole-body Multi-contact Manipulation with Contact Optimization | Sethu Vijayakumar Team | 2508.12980 | null |
| 2025-08-18 | Foundation Model for Skeleton-Based Human Action Understanding | Liang Wang Team | 2508.12586 | link |
| 2025-08-27 | Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids | Shuran Song Team | 2508.12252 | null |
| 2025-08-17 | Humanoid Motion Scripting with Postural Synergies | Oussama Khatib Team | 2508.12184 | null |
| 2025-08-16 | Contact-Rich and Deformable Foot Modeling for Locomotion Control of the Human Musculoskeletal System | Yanan Sui Team | 2508.11885 | null |
| 2025-08-16 | From Screen to Stage: Kid Cosmo, A Life-Like, Torque-Controlled Humanoid for Entertainment Robotics | Dennis W. Hong Team | 2508.11884 | null |
| 2025-08-15 | Anticipatory and Adaptive Footstep Streaming for Teleoperated Bipedal Robots | Robert Griffin Team | 2508.11802 | null |
| 2025-08-15 | A Comparative Study of Floating-Base Space Parameterizations for Agile Whole-Body Motion Planning | Konstantinos Chatzilygeroudis Team | 2508.11520 | null |
| 2025-08-15 | Learning Differentiable Reachability Maps for Optimization-based Humanoid Motion Generation | Fumio Kanehiro Team | 2508.11275 | null |
| 2025-08-15 | Geometry-Aware Predictive Safety Filters on Humanoids: From Poisson Safety Functions to CBF Constrained MPC | Aaron D. Ames Team | 2508.11129 | null |
| 2025-08-14 | MASH: Cooperative-Heterogeneous Multi-Agent Reinforcement Learning for Single Humanoid Robot Locomotion | Yanjie Li Team | 2508.10423 | null |
| 2025-08-13 | GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation | Jun-Guo Lu Team | 2508.09960 | null |
| 2025-08-11 | PCHands: PCA-based Hand Pose Synergy Representation on Manipulators with N-DoF | Lorenzo Natale Team | 2508.07945 | null |
| 2025-08-11 | End-to-End Humanoid Robot Safe and Comfortable Locomotion Policy | Junwei Liang Team | 2508.07611 | null |
| 2025-08-09 | Learning a Vision-Based Footstep Planner for Hierarchical Walking Control | Michael Posa Team | 2508.06779 | null |
| 2025-08-07 | Examining the legibility of humanoid robot arm movements in a pointing task | Igor Farkaš Team | 2508.05104 | null |
| 2025-08-06 | INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM | Nikos Tsagarakis Team | 2508.04931 | link |
| 2025-08-06 | On the causality between affective impact and coordinated human-robot reactions | Kasper Støy Team | 2508.04834 | null |
| 2025-08-06 | Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots | Gyeong-Tae Lee Team | 2508.04333 | null |
| 2025-08-08 | Would you let a humanoid play storytelling with your child? A usability study on LLM-powered narrative Human-Robot Interaction | Agnieszka Wykowska Team | 2508.02505 | null |
| 2025-08-04 | Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis | Jingya Wang Team | 2508.02106 | null |
| 2025-08-02 | Coordinated Humanoid Robot Locomotion with Symmetry Equivariant Reinforcement Learning Policy | Yue Gao Team | 2508.01247 | null |
| 2025-08-01 | A Whole-Body Motion Imitation Framework from Human Data for Full-Size Humanoid Robot | Rong Xiong Team | 2508.00362 | null |
| 2025-08-01 | TOP: Time Optimization Policy for Stable and Accurate Standing Manipulation with Humanoid Robots | Rong Xiong Team | 2508.00355 | null |
| 2025-07-31 | CHILD (Controller for Humanoid Imitation and Live Demonstration): a Whole-Body Humanoid Teleoperation System | Joohyung Kim Team | 2508.00162 | null |
| 2025-07-31 | The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking | Taihú Pire Team | 2508.00088 | null |
| 2025-07-28 | Binaural Sound Event Localization and Detection based on HRTF Cues for Humanoid Robots | Yong-Hwa Park Team | 2507.20530 | null |
| 2025-07-28 | LLMs-guided adaptive compensator: Bringing Adaptivity to Automatic Control Systems with Large Language Models | Yusuke Iwasawa Team | 2507.20509 | null |
| 2025-07-29 | Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots | Qiang Zhang Team | 2507.20217 | null |
| 2025-07-27 | LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks | Xiaoshuang Shi Team | 2507.20174 | null |
| 2025-07-25 | How Age Influences the Interpretation of Emotional Body Language in Humanoid Robots – long paper version | Giuseppe Palestra Team | 2507.19335 | null |
| 2025-07-24 | Experimental Comparison of Whole-Body Control Formulations for Humanoid Robots in Task Acceleration and Task Force Spaces | Christian Ott Team | 2507.18502 | link |
| 2025-07-22 | Humanoid Robot Whole-body Geometric Calibration with Embedded Sensors and a Single Plane | Florent Lamiraux Team | 2507.16369 | null |
| 2025-07-20 | Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture | Lisa Dargasz Team | 2507.15895 | null |
| 2025-07-21 | EMP: Executable Motion Prior for Humanoid Robot Standing Upper-body Motion Imitation | Rong Xiong Team | 2507.15649 | null |
| 2025-07-16 | Robot Drummer: Learning Rhythmic Skills for Humanoid Drumming | Loris Roveda Team | 2507.11498 | null |
| 2025-07-15 | From Production Logistics to Smart Manufacturing: The Vision for a New RoboCup Industrial League | Shohei Yasuda Team | 2507.11402 | null |
| 2025-07-14 | Physics-Informed Neural Networks with Unscented Kalman Filter for Sensorless Joint Torque Estimation in Humanoid Robots | Daniele Pucci Team | 2507.10105 | null |
| 2025-07-11 | Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots | Yue Gao Team | 2507.08303 | null |
| 2025-07-10 | UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots | Weinan Zhang Team | 2507.07356 | null |
| 2025-07-09 | ULC: A Unified and Fine-Grained Controller for Humanoid Loco-Manipulation | Zongwu Xie Team | 2507.06905 | null |
| 2025-07-08 | Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction | Alessio Del Bue Team | 2507.06404 | null |
| 2025-07-05 | Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning | Sangbae Kim Team | 2507.04140 | null |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | null |
| 2025-06-30 | Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation | Dennis Hong Team | 2507.00273 | null |
| 2025-07-02 | DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover | Yuexin Ma Team | 2506.23152 | null |
| 2025-06-29 | Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots | Yue Gao Team | 2506.23125 | null |
| 2025-07-10 | Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation | Navid Azizan Team | 2506.22827 | null |
| 2025-06-20 | Unsupervised Discovery of Behavioral Primitives from Sensorimotor Dynamic Functional Connectivity | Matej Hoffmann Team | 2506.22473 | null |
| 2025-07-14 | Ark: An Open-source Python-based Framework for Robot Learning | Haitham Bou-Ammar Team | 2506.21628 | null |
| 2025-07-18 | A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots | Wenjun Zeng Team | 2506.20487 | null |
| 2025-06-19 | DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning | Zongqing Lu Team | 2506.16012 | link |
| 2025-06-18 | TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality | Eiichi Yoshida Team | 2506.15146 | null |
| 2025-06-18 | Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion | Mingguo Zhao Team | 2506.15132 | link |
| 2025-06-17 | GMT: General Motion Tracking for Humanoid Whole-Body Control | Xiaolong Wang Team | 2506.14770 | null |
| 2025-06-17 | Whole-Body Control Framework for Humanoid Robots with Heavy Limbs: A Model-Based Approach | Yun-Hui Liu Team | 2506.14278 | null |
| 2025-06-15 | KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills | Xuelong Li Team | 2506.12851 | null |
| 2025-06-19 | From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots | Zongqing Lu Team | 2506.12779 | null |
| 2025-06-15 | RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control | Zongqing Lu Team | 2506.12769 | null |
| 2025-06-14 | Explosive Output to Enhance Jumping Ability: A Variable Reduction Ratio Design Paradigm for Humanoid Robots Knee Joint | Qiang Huang Team | 2506.12314 | null |
| 2025-06-13 | mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity | Robert K. Katzschmann Team | 2506.11916 | null |
| 2025-06-11 | Exploring EEG Responses during Observation of Actions Performed by Human Actor and Humanoid Robot | Michelle J. Johnson Team | 2506.10170 | null |
| 2025-06-11 | Locomotion on Constrained Footholds via Layered Architectures and Model Predictive Control | Aaron D. Ames Team | 2506.09979 | null |
| 2025-06-11 | Attention-Based Map Encoding for Learning Generalized Legged Locomotion | Marco Hutter Team | 2506.09588 | null |
| 2025-06-11 | Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations | Yanan Sui Team | 2506.09383 | null |
| 2025-06-11 | SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending | Yue Wang Team | 2506.09366 | link |
| 2025-06-10 | Fast Estimation of Globally Optimal Independent Contact Regions for Robust Grasping and Manipulation | Nancy S. Pollard Team | 2506.08856 | null |
| 2025-06-12 | MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains | Xuelong Li Team | 2506.08840 | null |
| 2025-06-10 | Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots | Lijun Zhu Team | 2506.08416 | null |
| 2025-06-05 | Realizing Text-Driven Motion Generation on NAO Robot: A Reinforcement Learning-Optimized Control Pipeline | Qijun Chen Team | 2506.05117 | link |
| 2025-06-04 | Phase-based Nonlinear Model Predictive Control for Humanoid Walking Stabilization with Single and Double Support Time Adjustments | Jaeheung Park Team | 2506.03856 | null |
| 2025-06-03 | AURA: Agentic Upskilling via Reinforced Abstractions | Dennis Hong Team | 2506.02507 | null |
| 2025-06-02 | Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation | Ayonga Hereid Team | 2506.02206 | null |
| 2025-06-02 | Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics | Matej Hoffmann Team | 2506.01756 | null |
| 2025-06-05 | Hierarchical Intention-Aware Expressive Motion Generation for Humanoid Robots | Chengxu Zhou Team | 2506.01563 | null |
| 2025-06-01 | Humanoid World Models: Open World Foundation Models for Humanoid Robotics | Mohammad Al-Sharman Team | 2506.01182 | null |
| 2025-06-01 | iRonCub 3: The Jet-Powered Flying Humanoid Robot | Daniele Pucci Team | 2506.01125 | null |
| 2025-05-30 | Learning Aerodynamics for the Control of Flying Humanoid Robots | Daniele Pucci Team | 2506.00305 | null |
| 2025-05-30 | Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives – A Survey | Rania Rayyes Team | 2506.00098 | null |
| 2025-06-05 | SignBot: Learning Human-to-Humanoid Sign Language Interaction | Guiliang Liu Team | 2505.24266 | null |
| 2025-05-30 | Humanoid Loco-Manipulations Pattern Generation and Stabilization Control | Abderrahmane Kheddar Team | 2505.24116 | null |
| 2025-05-29 | Humanoid Loco-manipulation Planning based on Graph Search and Reachability Maps | Abderrahmane Kheddar Team | 2505.23505 | null |
| 2025-05-29 | Centroidal Trajectory Generation and Stabilization based on Preview Control for Humanoid Multi-contact Motion | Fumio Kanehiro Team | 2505.23499 | link |
| 2025-06-01 | FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control | Pieter Abbeel Team | 2505.22642 | null |
| 2025-05-27 | Learning Unified Force and Position Control for Legged Loco-Manipulation | Siyuan Huang Team | 2505.20829 | null |
| 2025-05-27 | Gait-Conditioned Reinforcement Learning with Multi-Phase Curriculum for Humanoid Locomotion | CHengxu Zhou Team | 2505.20619 | null |
| 2025-05-26 | Integrating emotional intelligence, memory architecture, and gestures to achieve empathetic humanoid robot interaction in an educational setting | Paul Craig Team | 2505.19803 | null |
| 2025-05-26 | Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning | Jean-Baptiste Mouret Team | 2505.19717 | null |
| 2025-05-26 | Whole-body Multi-contact Motion Control for Humanoid Robots Based on Distributed Tactile Sensors | Eiichi Yoshida Team | 2505.19580 | link |
| 2025-05-26 | Heavy lifting tasks via haptic teleoperation of a wheeled humanoid | Joao Ramos Team | 2505.19530 | null |
| 2025-05-26 | SMAP: Self-supervised Motion Adaptation for Physically Plausible Humanoid Whole-body Control | Junting Dong Team | 2505.19463 | null |
| 2025-05-25 | Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP) | Libo Wang Team | 2505.19339 | link |
| 2025-05-25 | Staircase Recognition and Location Based on Polarization Vision | Zhiying Tan Team | 2505.19026 | null |
| 2025-05-23 | DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation | Ruqi Huang Team | 2505.18078 | null |
| 2025-08-11 | Unified Multi-Rate Model Predictive Control for a Jet-Powered Humanoid Robot | Daniele Pucci Team | 2505.16478 | null |
| 2025-05-19 | TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion | Minh Nhat Vu Team | 2505.13549 | null |
| 2025-05-19 | DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories | Linxi Fan Team | 2505.12705 | null |
| 2025-05-19 | Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion | Qi Wu Team | 2505.12679 | null |
| 2025-05-16 | Bracing for Impact: Robust Humanoid Push Recovery and Locomotion with Reduced Order Models | Aaron D. Ames Team | 2505.11495 | null |
| 2025-05-16 | X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation | Xiaohan Yu Team | 2505.11146 | link |
| 2025-05-15 | NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance | Jiangmiao Pang Team | 2505.08712 | null |
| 2025-05-13 | Rethink Repeatable Measures of Robot Performance with Statistical Query | Dylan Khor Team | 2505.08216 | null |
| 2025-05-14 | Neural Brain: A Neuroscience-inspired Framework for Embodied Agents | Lin Wang Team | 2505.07634 | link |
| 2025-05-12 | HuB: Learning Extreme Humanoid Balance | Yang Gao Team | 2505.07294 | null |
| 2025-05-11 | Dynamic Safety in Complex Environments: Synthesizing Safety Filters with Poisson’s Equation | Aaron D. Ames Team | 2505.06794 | null |
| 2025-05-10 | JAEGER: Dual-Level Humanoid Whole-Body Controller | Zongqing Lu Team | 2505.06584 | null |
| 2025-05-09 | Let Humanoids Hike! Integrative Skill Development on Complex Trails | Stella X. Yu Team | 2505.06218 | null |
| 2025-05-09 | Safe-EF: Error Feedback for Nonsmooth Constrained Optimization | Ilyas Fatkhullin Team | 2505.06053 | null |
| 2025-05-09 | Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination | Zhi Li Team | 2505.05773 | null |
| 2025-05-07 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null |
| 2025-05-06 | AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control | Xiaolong Wang Team | 2505.03738 | null |
| 2025-05-13 | Visual Imitation Enables Contextual Humanoid Control | Angjoo Kanazawa Team | 2505.03729 | null |
| 2025-05-05 | TWIST: Teleoperated Whole-Body Imitation System | C. Karen Liu Team | 2505.02833 | null |
| 2025-04-30 | LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning | Koushil Sreenath Team | 2504.21738 | null |
| 2025-04-29 | SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings | Jianwei Zhang Team | 2504.20808 | null |
| 2025-04-27 | Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems | Jairaj Singh Shaktawat Team | 2504.20109 | null |
| 2025-04-24 | Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot | Koushil Sreenath Team | 2504.17249 | null |
| 2025-04-20 | ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training | Jiahao Chen Team | 2504.14477 | null |
| 2025-04-19 | Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning | Xuelong Li Team | 2504.14305 | null |
| 2025-04-18 | Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning | Fumio Kanehiro Team | 2504.13619 | link |
| 2025-04-16 | EmoACT: a Framework to Embed Emotions into Artificial Agents Based on Affect Control Theory | Carmine Tommaso Recchiuto Team | 2504.12125 | null |
| 2025-04-14 | Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain | Zhengtao Zhang Team | 2504.10390 | null |
| 2025-04-14 | PreCi: Pretraining and Continual Improvement of Humanoid Locomotion via Model-Assumption-Based Regularization | Sehoon Ha Team | 2504.09833 | null |
| 2025-04-13 | Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation | Yi Fang Team | 2504.09532 | null |
| 2025-04-11 | Spectral Normalization for Lipschitz-Constrained Policies on Learning Humanoid Locomotion | Jaeheung Park Team | 2504.08246 | null |
| 2025-04-07 | MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond | Xun Cao Team | 2504.05046 | null |
| 2025-04-07 | A High-Force Gripper with Embedded Multimodal Sensing for Powerful and Perception Driven Grasping | Nikos G. Tsagarakis Team | 2504.04970 | null |
| 2025-04-06 | Public speech recognition transcripts as a configuring parameter | Christian Licoppe Team | 2504.04488 | null |
| 2025-04-02 | The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction | Matthew K. X. J Pan Team | 2504.01260 | null |
| 2025-04-01 | Extended Hybrid Zero Dynamics for Bipedal Walking of the Knee-less Robot SLIDER | Petar Kormushev Team | 2504.01165 | null |
| 2025-04-11 | Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs | Masaya Kinoshita Team | 2504.00614 | null |
| 2025-03-30 | Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework | Ysobel Sims Team | 2503.23601 | null |
| 2025-03-28 | Control of Humanoid Robots with Parallel Mechanisms using Kinematic Actuation Models | Nicolas Mansard Team | 2503.22459 | null |
| 2025-03-28 | FLAM: Foundation Model-Based Body Stabilization for Humanoid Locomotion and Manipulation | Debin Zhao Team | 2503.22249 | null |
| 2025-03-27 | OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation | Wanting Li Team | 2503.21257 | null |
| 2025-03-26 | Anti Robot Speciesism | Miklos Sarvary Team | 2503.20842 | null |
| 2025-03-25 | Can Vision-Language Models Answer Face to Face Questions in the Real-World? | Roland Memisevic Team | 2503.19356 | null |
| 2025-03-19 | StyleLoco: Generative Adversarial Distillation for Natural Humanoid Robot Locomotion | Siyuan Huang Team | 2503.15082 | null |
| 2025-03-27 | GR00T N1: An Open Foundation Model for Generalist Humanoid Robots | Yuke Zhu Team | 2503.14734 | null |
| 2025-03-24 | Humanoid Policy ~ Human Policy | Xiaolong Wang Team | 2503.13441 | null |
| 2025-03-17 | Humanoids in Hospitals: A Technical Study of Humanoid Surrogates for Dexterous Medical Interventions | Michael Yip Team | 2503.12725 | null |
| 2025-03-16 | Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills | Zongqing Lu Team | 2503.12533 | null |
| 2025-03-14 | Fast and Robust Localization for Humanoid Soccer Robot via Iterative Landmark Matching | Dennis W. Hong Team | 2503.11020 | null |
| 2025-03-13 | NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models | Michael Black Team | 2503.10626 | null |
| 2025-03-13 | NuExo: A Wearable Exoskeleton Covering all Upper Limb ROM for Outdoor Data Collection and Teleoperation of Humanoid Robots | Huimin Lu Team | 2503.10554 | null |
| 2025-03-12 | Natural Humanoid Robot Locomotion with Generative Motion Prior | Rong Xiong Team | 2503.09015 | null |
| 2025-03-13 | HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots | Renjing Xu Team | 2503.09010 | null |
| 2025-03-11 | LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures | Renjing Xu Team | 2503.08349 | null |
| 2025-04-29 | Learning Getting-Up Policies for Real-World Humanoid Robots | Saurabh Gupta Team | 2502.12152 | link |
| 2024-10-17 | Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions | Yuke Zhu Team | 2410.12773 | link |
| 2023-12-29 | How to Raise a Robot – A Case for Neuro-Symbolic AI in Constrained Task Planning for Humanoid Assistive Robots | Hannes Hartenstein Team | 2312.08820 | link |
| 2022-11-28 | Optimization of Humanoid Robot Designs for Human-Robot Ergonomic Payload Lifting | Daniele Pucci Team | 2211.13503 | null |
| 2022-10-20 | Dialogue system with humanoid robot | Naoki Igo Team | 2210.10151 | null |
| 2021-04-20 | The MIT Humanoid Robot: Design, Motion Planning, and Control For Acrobatic Behaviors | Sangbae Kim Team | 2104.09025 | null |
| 2019-09-24 | Whole-Body Geometric Retargeting for Humanoid Robots | Daniele Pucci Team | 1909.10080 | null |
| 2019-09-06 | NimbRo Robots Winning RoboCup 2018 Humanoid AdultSize Soccer Competitions | Sven Behnke Team | 1909.02385 | null |
| 2018-10-22 | NimbRo-OP2X: Adult-sized Open-source 3D Printed Humanoid Robot | Sven Behnke Team | 1810.08395 | null |
| 2018-10-22 | Online Balanced Motion Generation for Humanoid Robots | Sven Behnke Team | 1810.08388 | null |
| 2018-10-01 | NimbRo-OP2: Grown-up 3D Printed Open Humanoid Platform for Research | Sven Behnke Team | 1809.11144 | null |
| 2018-10-01 | A ROS-based Software Framework for the NimbRo-OP Humanoid Open Platform | Sven Behnke Team | 1809.11051 | null |
| 2017-01-11 | Automatic Gain Tuning of a Momentum Based Balancing Controller for Humanoid Robots | Francesco Nori Team | 1610.02849 | null |
| 2017-07-18 | Walking of the iCub humanoid robot in different scenarios: implementation and performance analysis | Katja Mombaur Team | 1607.08525 | null |
| 2017-01-16 | Walking on Partial Footholds Including Line Contacts with the Humanoid Robot Atlas | Jerry Pratt Team | 1607.08089 | null |
| 2016-07-19 | Design and implementation of computational platform for social-humanoid robot Lumen as an exhibition guide in Electrical Engineering Days 2015 | Ary Setijadi Prihatmanto Team | 1607.04763 | null |
| 2016-11-18 | Gaze Stabilization for Humanoid Robots: a Comprehensive Framework | Lorenzo Natale Team | 1411.3525 | null |
Dexterous
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations | Homanga Bharadhwaj Team | 2511.16661 | null |
| 2025-11-20 | InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy | Jiangmiao Pang Team | 2511.16651 | null |
| 2025-11-19 | VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation | Yuke Zhu Team | 2511.15200 | link |
| 2025-11-18 | Toward Robust and Harmonious Adaptation for Cross-modal Retrieval | Xi Peng Team | 2511.14416 | null |
| 2025-11-17 | From Power to Precision: Learning Fine-grained Dexterity for Multi-fingered Robotic Hands | Xiaolong Wang Team | 2511.13710 | link |
| 2025-11-17 | ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning | Ruizhen Hu Team | 2511.13327 | null |
| 2025-11-14 | Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment | Yi Sun Team | 2511.10987 | null |
| 2025-11-13 | Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning | Xiaocong Li Team | 2511.10087 | null |
| 2025-11-12 | ScaleADFG: Affordance-based Dexterous Functional Grasping via Scalable Dataset | Peng Wang Team | 2511.09602 | null |
| 2025-11-12 | IFG: Internet-Scale Guidance for Functional Grasping Generation | Deepak Pathak Team | 2511.09558 | link |
| 2025-11-12 | SPIDER: Scalable Physics-Informed Dexterous Retargeting | Francois Hogan Team | 2511.09484 | link |
| 2025-11-12 | RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation | Miao Li Team | 2511.09141 | null |
| 2025-11-12 | MirrorLimb: Implementing hand pose acquisition and robot teleoperation based on RealMirror | Tao Shen Team | 2511.08865 | null |
| 2025-11-10 | Lightning Grasp: High Performance Procedural Grasp Synthesis with Contact Fields | Pieter Abbeel Team | 2511.07418 | link |
| 2025-11-09 | Robust Differentiable Collision Detection for General Objects | He Wang Team | 2511.06267 | null |
| 2025-11-08 | Adversarial Game-Theoretic Algorithm for Dexterous Grasp Synthesis | Jeffrey Ichnowski Team | 2511.05809 | null |
| 2025-11-06 | Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning | Gavriel State Team | 2511.04831 | link |
| 2025-11-05 | Dexterous Intramyocardial Needle Ablation (d-INA): Design, Fabrication, and In-Vivo Validation | Yue Chen Team | 2511.03763 | null |
| 2025-11-09 | Development of the Bioinspired Tendon-Driven DexHand 021 with Proprioceptive Compliance Control | Sheng Yi Team | 2511.03481 | null |
| 2025-11-10 | 3D Cal: An Open-Source Software Library for Calibrating Tactile Sensors | Gregory Reardon Team | 2511.03078 | null |
| 2025-11-04 | TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System | C. Karen Liu Team | 2511.02832 | link |
| 2025-11-04 | Dexterous Robotic Piano Playing at Scale | Dieter Büchler Team | 2511.02504 | null |
| 2025-11-10 | Whole-body motion planning and safety-critical control for aerial manipulation | Jeonghyun Byun Team | 2511.02342 | null |
| 2025-11-03 | GenDexHand: Generative Simulation for Dexterous Hands | Yi Ma Team | 2511.01791 | null |
| 2025-11-09 | Scaling Cross-Embodiment World Models for Dexterous Manipulation | Hao Su Team | 2511.01177 | null |
| 2025-10-31 | End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection | Zhibin Li Team | 2511.00139 | null |
| 2025-10-31 | Whole-Body Proprioceptive Morphing: A Modular Soft Gripper for Robust Cross-Scale Grasping | Xiaonan Huang Team | 2510.27666 | null |
| 2025-10-30 | SpikeATac: A Multimodal Tactile Finger with Taxelized Dynamic Sensing for Dexterous Manipulation | Matei Ciocarlie Team | 2510.27048 | null |
| 2025-10-28 | A Humanoid Visual-Tactile-Action Dataset for Contact-Rich Manipulation | Kyung-Joong Kim Team | 2510.25725 | null |
| 2025-10-27 | OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback | Wei-Shi Zheng Team | 2510.23119 | link |
| 2025-10-24 | Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos | Baining Guo Team | 2510.21571 | link |
| 2025-10-23 | SutureBot: A Precision Framework & Benchmark For Autonomous End-to-End Suturing | Axel Krieger Team | 2510.20965 | null |
| 2025-10-21 | RAPID Hand Prototype: Design of an Affordable, Fully-Actuated Biomimetic Hand for Dexterous Teleoperation | Hui Cheng Team | 2510.16931 | null |
| 2025-10-23 | DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation | Yiwen Lu Team | 2510.15786 | null |
| 2025-10-16 | Open TeleDex: A Hardware-Agnostic Teleoperation System for Imitation Learning based Dexterous Manipulation | Shan An Team | 2510.14771 | null |
| 2025-10-16 | Leveraging Neural Descriptor Fields for Learning Contact-Aware Dynamic Recovery | Dmitry Berenson Team | 2510.14768 | null |
| 2025-10-16 | Spatially anchored Tactile Awareness for Robust Dexterous Manipulation | Kaifeng Zhang Team | 2510.14647 | null |
| 2025-10-16 | Restoring Noisy Demonstration for Imitation Learning With Diffusion Models | Shao-Hua Sun Team | 2510.14467 | null |
| 2025-10-14 | Learning to Grasp Anything by Playing with Random Toys | Roei Herzig Team | 2510.12866 | null |
| 2025-10-14 | T(R,O) Grasp: Efficient Graph Diffusion of Robot-Object Spatial Transformation for Cross-Embodiment Dexterous Grasping | Lin Shao Team | 2510.12724 | null |
| 2025-10-10 | Glovity: Learning Dexterous Contact-Rich Manipulation via Spatial Wrench Feedback Teleoperation System | Pai Zheng Team | 2510.09229 | null |
| 2025-10-10 | PLEXUS Hand: Lightweight Four-Motor Prosthetic Hand Enabling Precision-Lateral Dexterous Manipulation | Kazutoshi Tanaka Team | 2510.09209 | null |
| 2025-10-09 | DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model | Li Yi Team | 2510.08556 | link |
| 2025-10-09 | DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos | Tsung-Wei Ke Team | 2510.08475 | link |
| 2025-10-08 | AVO: Amortized Value Optimization for Contact Mode Switching in Multi-Finger Manipulation | Dmitry Berenson Team | 2510.07548 | null |
| 2025-10-07 | Cross-Embodiment Dexterous Hand Articulation Generation via Morphology-Aware Learning | Yan Wu Team | 2510.06068 | null |
| 2025-10-06 | A multi-modal tactile fingertip design for robotic hands to enhance dexterous manipulation | Zeynep Temel Team | 2510.05382 | null |
| 2025-10-01 | ISyHand: A Dexterous Multi-finger Robot Hand with an Articulated Palm | Katherine J. Kuchenbecker Team | 2509.26236 | null |
| 2025-09-28 | DexFlyWheel: A Scalable and Self-improving Data Generation Framework for Dexterous Manipulation | Yuanpei Chen Team | 2509.23829 | null |
| 2025-09-26 | DemoGrasp: Universal Dexterous Grasping from a Single Demonstration | Zongqing Lu Team | 2509.22149 | null |
| 2025-09-25 | Residual Off-Policy RL for Finetuning Behavior Cloning Policies | Anusha Nagabandi Team | 2509.19301 | link |
| 2025-09-23 | Lang2Morph: Language-Driven Morphological Design of Robotic Hands | Josie Hughes Team | 2509.18937 | null |
| 2025-10-07 | Learning Geometry-Aware Nonprehensile Pushing and Pulling with Dexterous Hands | Daniel Seita Team | 2509.18455 | null |
| 2025-09-23 | Learning Dexterous Manipulation with Quantized Hand State | Cewu Lu Team | 2509.17450 | null |
| 2025-09-18 | A Novel Task-Driven Diffusion-Based Policy with Affordance Learning for Generalizable Manipulation of Articulated Objects | Yongduan Song Team | 2509.14939 | null |
| 2025-09-18 | Learning to Pick: A Visuomotor Policy for Clustered Strawberry Picking | Chen Peng Team | 2509.14530 | null |
| 2025-09-17 | LeVR: A Modular VR Teleoperation Framework for Imitation Learning in Dexterous Manipulation | Han Liu Team | 2509.14349 | null |
| 2025-09-16 | \textsc{Gen2Real}: Towards Demo-Free Dexterous Manipulation by Harnessing Generated Video | Rui Huang Team | 2509.14178 | null |
| 2025-09-17 | Whole-body Motion Control of an Omnidirectional Wheel-Legged Mobile Manipulator via Contact-Aware Dynamic Optimization | Yiqun Li Team | 2509.14010 | null |
| 2025-10-03 | Beyond Anthropomorphism: Enhancing Grasping and Eliminating a Degree of Freedom by Fusing the Abduction of Digits Four and Five | Robert K. Katzschmann Team | 2509.13074 | null |
| 2025-09-16 | MoiréTac: A Dual-Mode Visuotactile Sensor for Multidimensional Perception Using Moiré Pattern Amplification | Wenbo Ding Team | 2509.12714 | null |
| 2025-09-11 | Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration | Wei Yang Team | 2509.09671 | null |
| 2025-09-10 | Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor Integration | Huimin Lu Team | 2509.08354 | null |
| 2025-09-09 | Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions | Nathan F. Lepora Team | 2509.07445 | null |
| 2025-09-05 | OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation | Yu Xiang Team | 2509.05513 | null |
| 2025-09-08 | DEXOP: A Device for Robotic Transfer of Dexterous Human Manipulation | Pulkit Agrawal Team | 2509.04441 | link |
| 2025-09-09 | EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control | Dong Wang Team | 2508.21112 | null |
| 2025-08-31 | HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation | Huazhe Xu Team | 2508.20085 | null |
| 2025-08-24 | LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations | Hao Su Team | 2508.17547 | null |
| 2025-08-21 | Exploiting Policy Idling for Dexterous Manipulation | Dushyant Rao Team | 2508.15669 | null |
| 2025-08-20 | GraspQP: Differentiable Optimization of Force Closure for Diverse and Robust Dexterous Grasping | Marco Hutter Team | 2508.15002 | null |
| 2025-08-20 | FBI: Learning Dexterous In-hand Manipulation with Dynamic Visuotactile Shortcut Policy | Cewu Lu Team | 2508.14441 | null |
| 2025-08-17 | Geodesic Tracing-Based Kinematic Integration of Rolling and Sliding Contact on Manifold Meshes for Dexterous In-Hand Manipulation | Nancy S. Pollard Team | 2508.12439 | null |
| 2025-08-15 | Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors | Hua Zou Team | 2508.08896 | null |
| 2025-08-22 | OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing | Hengdi Zhang Team | 2508.08706 | link |
| 2025-08-11 | PCHands: PCA-based Hand Pose Synergy Representation on Manipulators with N-DoF | Lorenzo Natale Team | 2508.07945 | null |
| 2025-08-29 | DexFruit: Dexterous Manipulation and Gaussian Splatting Inspection of Fruit | Monroe Kennedy III Team | 2508.07118 | null |
| 2025-08-05 | UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands | Yaonan Wang Team | 2508.03339 | link |
| 2025-08-03 | DexReMoE:In-hand Reorientation of General Object via Mixtures of Experts | Yunlong Dong Team | 2508.01695 | null |
| 2025-08-01 | Video Generators are Robot Policies | Carl Vondrick Team | 2508.00795 | null |
| 2025-07-31 | XRoboToolkit: A Cross-Platform Framework for Robot Teleoperation | Ning Yang Team | 2508.00097 | link |
| 2025-09-11 | villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models | Jiang Bian Team | 2507.23682 | link |
| 2025-07-19 | A 21-DOF Humanoid Dexterous Hand with Hybrid SMA-Motor Actuation: CYJ Hand-0 | Erbao Dong Team | 2507.14538 | null |
| 2025-07-18 | Improving Low-Cost Teleoperation: Augmenting GELLO with Force | Kai Arulkumaran Team | 2507.13602 | null |
| 2025-07-16 | The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey | Jiming Chen Team | 2507.11840 | null |
| 2025-07-14 | Demonstrating the Octopi-1.5 Visual-Tactile-Language Model | Harold Soh Team | 2507.09985 | null |
| 2025-07-09 | Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand | Xinjun Sheng Team | 2507.06822 | null |
| 2025-07-07 | A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation | Russ Tedrake Team | 2507.05331 | null |
| 2025-07-06 | SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training | Hao Dong Team | 2507.04452 | null |
| 2025-07-03 | DexVLG: Dexterous Vision-Language-Grasp Model at Scale | He Wang Team | 2507.02747 | null |
| 2025-07-03 | TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types | Wei-Shi Zheng Team | 2507.01857 | link |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | link |
| 2025-06-26 | Lightweight Fingernail Haptic Device: Unobstructed Fingerpad Force and Vibration Feedback for Enhanced Virtual Dexterous Manipulation | Shoichi Hasegawa Team | 2506.21417 | null |
| 2025-06-24 | Scaffolding Dexterous Manipulation with Vision-Language Models | Dorsa Sadigh Team | 2506.19212 | null |
| 2025-06-24 | The MOTIF Hand: A Robotic Hand for Multimodal Observations with Thermal, Inertial, and Force Sensors | Daniel Seita Team | 2506.19201 | null |
| 2025-06-21 | VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models | Lin Shao Team | 2506.17561 | null |
| 2025-06-20 | Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation | Xiaolong Wang Team | 2506.17198 | link |
| 2025-06-19 | ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation | Jitendra Malik Team | 2506.15953 | null |
| 2025-06-17 | Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation | Mustafa Mukadam Team | 2506.14754 | null |
| 2025-06-16 | CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding | Haoang Li Team | 2506.13725 | null |
| 2025-06-13 | ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation | Nima Fazeli Team | 2506.12239 | null |
| 2025-06-13 | ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations | Maria Bauza Villalonga Team | 2506.11775 | null |
| 2025-06-30 | Adaptive event-triggered robust tracking control of soft robots | Marios M. Polycarpou Team | 2506.09523 | null |
| 2025-06-11 | Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation | Xiang Li Team | 2506.09384 | null |
| 2025-06-09 | TensorTouch: Calibration of Tactile Sensors for High Resolution Stress Tensor and Deformation for Dexterous Manipulation | Monroe Kennedy III Team | 2506.08291 | null |
| 2025-06-09 | RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy | Hui Cheng Team | 2506.07490 | null |
| 2025-06-05 | GEX: Democratizing Dexterity with Fully-Actuated Dexterous Hand and Exoskeleton Glove | Zelin Deng Team | 2506.04982 | link |
| 2025-06-06 | ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning | Jian Tang Team | 2506.04941 | null |
| 2025-06-03 | Reachability Weighted Offline Goal-conditioned Resampling | Joni Pajarinen Team | 2506.02577 | null |
| 2025-05-30 | Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives – A Survey | Rania Rayyes Team | 2506.00098 | null |
| 2025-05-30 | DexMachina: Functional Retargeting for Bimanual Dexterous Manipulation | Shuran Song Team | 2505.24853 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-10-03 | DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation | Shuran Song Team | 2505.21864 | null |
| 2025-05-27 | Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt | Jianyu Chen Team | 2505.20795 | null |
| 2025-05-25 | MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation | Xue Bin Peng Team | 2505.19086 | null |
| 2025-05-24 | Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos | Mario Bijelic Team | 2505.18899 | link |
| 2025-05-24 | DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets | Dzmitry Tsetserukou Team | 2505.18876 | null |
| 2025-05-27 | GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning | Ye Shi Team | 2505.18763 | null |
| 2025-05-22 | TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation | Hengdi Zhang Team | 2505.16289 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation | Hao Dong Team | 2505.13982 | null |
| 2025-05-19 | Approximating Global Contact-Implicit MPC via Sampling and Local Complementarity | Michael Posa Team | 2505.13350 | null |
| 2025-05-19 | TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation | Jiangmiao Pang Team | 2505.12748 | null |
| 2025-05-18 | PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis | Zhipong Cai Team | 2505.12294 | null |
| 2025-05-17 | OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning | Yang Gao Team | 2505.11917 | null |
| 2025-05-16 | EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video | Jian Zhang Team | 2505.11709 | null |
| 2025-05-16 | Self-supervised perception for tactile skin covered dexterous hands | Mustafa Mukadam Team | 2505.11420 | null |
| 2025-05-16 | Learning Multimodal AI Algorithms for Amplifying Limited User Input into High-dimensional Control Space | Reza Abiri Team | 2505.11366 | link |
| 2025-05-16 | Estimating Deformable-Rigid Contact Interactions for a Deformable Tool via Learning and Model-Based Optimization | Nima Fazeli Team | 2505.10884 | null |
| 2025-05-15 | SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning | Axel Krieger Team | 2505.10251 | null |
| 2025-05-13 | HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands | Yunhui Liu Team | 2505.08213 | null |
| 2025-05-12 | DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies | Deepak Pathak Team | 2505.07813 | null |
| 2025-05-08 | Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation | Georgia Chalvatzaki Team | 2505.05287 | null |
| 2025-05-04 | Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning | Sven Behnke Team | 2505.02232 | null |
| 2025-05-04 | KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation | Yang Gao Team | 2505.01974 | null |
| 2025-05-02 | DexFlow: A Unified Approach for Dexterous Hand Pose Retargeting and Interaction | Miao Li Team | 2505.01083 | null |
| 2025-05-02 | DexCtrl: Towards Sim-to-Real Dexterity with Adaptive Controller Learning | Masayoshi Tomizuka Team | 2505.00991 | null |
| 2025-05-01 | Multi-Goal Dexterous Hand Manipulation using Probabilistic Model-based Reinforcement Learning | Yunduan Cui Team | 2504.21585 | null |
| 2025-04-27 | PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies | Edward Adelson Team | 2504.19341 | null |
| 2025-04-23 | PP-Tac: Paper Picking Using Tactile Feedback in Dexterous Robotic Hands | Ziyuan Jiao Team | 2504.16649 | null |
| 2025-04-22 | $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization | Ury Zhilinsky Team | 2504.16054 | null |
| 2025-04-21 | LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning | Boyuan Chen Team | 2504.15472 | null |
| 2025-04-21 | SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks | Animesh Garg Team | 2504.14857 | null |
| 2025-04-20 | BiDexHand: Design and Evaluation of an Open-Source 16-DoF Biomimetic Dexterous Hand | Zhengyang Kris Weng Team | 2504.14712 | null |
| 2025-04-18 | On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting | Jan Peters Team | 2504.13618 | null |
| 2025-04-17 | RUKA: Rethinking the Design of Humanoid Hands with Learning | Lerrel Pinto Team | 2504.13165 | null |
| 2025-04-17 | Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator | E. Witrant Team | 2504.13056 | null |
| 2025-04-17 | Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation | Iman Soltani Team | 2504.12967 | null |
| 2025-04-22 | Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration | Jeannette Bohg Team | 2504.12609 | null |
| 2025-04-14 | Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation | Guoying Gu Team | 2504.10280 | null |
| 2025-04-08 | Functionally graded keratin facilitates tactile sensing in elephant whiskers | Katherine J. Kuchenbecker Team | 2504.07143 | null |
| 2025-04-08 | ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface | Rui Chen Team | 2504.06156 | null |
| 2025-04-06 | DexTOG: Learning Task-Oriented Dexterous Grasp with Language | Cewu Lu Team | 2504.04573 | null |
| 2025-04-06 | DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments | Lin Shao Team | 2504.04516 | null |
| 2025-04-05 | ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning | Robert K. Katzschmann Team | 2504.04259 | null |
| 2025-09-11 | Dexterous Manipulation through Imitation Learning: A Survey | Hong Zhang Team | 2504.03515 | null |
| 2025-03-29 | Dexterous Non-Prehensile Manipulation for Ungraspable Object via Extrinsic Dexterity | Yuanpei Chen Team | 2503.23120 | null |
| 2025-03-27 | ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning | Siyuan Huang Team | 2503.21860 | null |
| 2025-03-25 | G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation | Ruizhen Hu Team | 2503.19457 | null |
| 2025-03-16 | Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills | Zongqing Lu Team | 2503.12533 | null |
| 2025-03-14 | Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping | Haruki Nishimura Team | 2503.10966 | null |
| 2025-03-12 | Sequential Multi-Object Grasping with One Dexterous Hand | Daniel Seita Team | 2503.09078 | null |
| 2025-03-16 | DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness | Yuexin Ma Team | 2503.08257 | link |
| 2025-03-13 | AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems | Jianchao Zhu Team | 2503.06669 | link |
| 2025-03-08 | ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features | Hong Zhang Team | 2503.05995 | link |
| 2025-03-07 | Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction | Bin He Team | 2503.05231 | null |
| 2025-03-06 | Dexterous Hand Manipulation via Efficient Imitation-Bootstrapped Online Reinforcement Learning | Xiaodong He Team | 2503.04014 | null |
| 2025-03-05 | LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation | Alois Knoll Team | 2503.03890 | null |
| 2025-03-05 | Selective Tweezing and Immobilization of Colloids for Dexterous Manipulation of Biological Materials | Kimani C. Toussaint Jr Team | 2503.03102 | null |
| 2025-03-03 | TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-to-Robot Skill Transfer | Mark R. Cutkosky Team | 2503.01789 | null |
| 2025-03-03 | RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation | Jun Ma Team | 2503.01616 | null |
| 2025-03-03 | Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning | Wenbo Ding Team | 2503.01543 | null |
| 2025-03-03 | KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands | Jeffrey Ichnowski Team | 2503.01078 | null |
| 2025-02-27 | Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids | Yuke Zhu Team | 2502.20396 | null |
| 2025-02-28 | ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration | Feifei Feng Team | 2502.19250 | null |
| 2025-02-26 | Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand | Yuanpei Chen Team | 2502.18423 | null |
| 2025-02-07 | Dexterous Cable Manipulation: Taxonomy, Multi-Fingered Hand Design, and Long-Horizon Manipulation | Robert B. Fisher Team | 2502.00396 | link |
| 2024-12-23 | Dexterous Manipulation Based on Prior Dexterous Grasp Pose Knowledge | Cewu Lu Team | 2412.15587 | null |
| 2024-08-22 | Tilde: Teleoperation for Dexterous In-Hand Manipulation Learning with a DeltaHand | Oliver Kroemer Team | 2405.18804 | null |
| 2023-12-13 | DEFT: Dexterous Fine-Tuning for Real-World Hand Policies | Deepak Pathak Team | 2310.19797 | link |
| 2023-12-27 | DELTAHANDS: A Synergistic Dexterous Hand Framework Based on Delta Robots | F. Zeynep Temel Team | 2310.05266 | null |
| 2023-10-17 | Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation | C. Karen Liu Team | 2309.00987 | null |
| 2023-08-23 | Dexterous Soft Hands Linearize Feedback-Control for In-Hand Manipulation | Oliver Brock Team | 2308.10691 | null |
| 2023-04-20 | Progressive Transfer Learning for Dexterous In-Hand Manipulation with Multi-Fingered Anthropomorphic Hand | Jia Sun Team | 2304.09526 | null |
| 2022-03-25 | Dexterous Imitation Made Easy: A Learning-Based Framework for Efficient Dexterous Manipulation | Lerrel Pinto Team | 2203.13251 | null |
| 2022-05-11 | RBO Hand 3 – A Platform for Soft Dexterous Manipulation | Oliver Brock Team | 2201.10883 | null |
| 2019-01-23 | Learning Dexterous In-Hand Manipulation | Wojciech Zaremba Team | 1808.00177 | null |
| 2018-06-27 | Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations | Sergey Levine Team | 1709.10087 | link |
| 2017-03-21 | Learning Dexterous Manipulation for a Soft Robotic Hand from Human Demonstration | Pieter Abbeel Team | 1603.06348 | null |