SwiftVLA

Unlocking Spatiotemporal Dynamics for Lightweight VLA Models
at Minimal Overhead

Chaojun Ni^1,2* Cheng Chen^3* Xiaofeng Wang^1,4* Zheng Zhu^1*†

Wenzhao Zheng⁴ Boyuan Wang¹ Tianrun Chen^3† Guosheng Zhao¹ Haoyun Li¹

Zhehao Dong^1,2 Qiang Zhang⁵ Yun Ye¹ Yang Wang¹ Guan Huang¹ Wenjun Mei^2†

¹ GigaAI ² Peking University ³ Moxin (Huzhou) Technology Co., Ltd. ⁴ Tsinghua University ⁵ X-Humanoid
^* Equal contribution. ^† Corresponding author.

Paper

Code

Abstract: Vision–Language–Action (VLA) models built on pretrained Vision–Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA , an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens , a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that randomly masks 4D inputs to the VLM and trains the VLA to reconstruct the masked features. This self-reconstruction objective helps learn effective 4D representations, allowing the 4D branch to be dropped at inference with minimal performance loss. Extensive experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7× larger. On edge devices, SwiftVLA achieves comparable performance while being 18× faster than the π₀ and reducing the memory footprint by 12×.

SwiftVLA is a method that integrates 4D spatiotemporal information into a lightweight vision–language–action (VLA) model at minimal cost.
(1) It extracts 4D features and adopts a mask-and-reconstruct training strategy that distills 4D knowledge into the VLA, enabling comparable performance during inference using only 2D inputs;
(2) It fuses 2D and 4D features in a lightweight VLM via learnable Fusion Tokens , trained with supervision from the robot arm's future end-effector trajectory to produce a unified, action-aware representation;
(3) Extensive experiments in simulation and on real robots demonstrate that SwiftVLA achieves performance comparable to a baseline that is 7× larger, runs 18× faster, and uses 12× less memory than π₀ on edge devices.

Motivation

We find that smaller VLM models, such as SmolVLM-0.5B, perform noticeably worse on tasks that require spatial reasoning. For instance, when answering questions like "What is the color of the leftmost bowl?", these smaller models often struggle to provide accurate answers compared to their larger counterparts. While models like SmolVLA, based on SmolVLM-0.5B, have a clear advantage in inference speed—especially when compared to more complex models like π₀ based on PaliGemma-3B—they significantly lag behind in task success rate. This is primarily because manipulation tasks in real-world scenarios typically require stronger spatiotemporal reasoning abilities and a deeper understanding of scene dynamics, which these smaller models are limited in handling.

Architecture Comparison

Recent works explore integrating 3D and 4D information into VLA models to enhance perception. As shown in Fig. (b), some fuse 3D features with 2D representations using large VLMs, improving spatial awareness but still relying on heavy models. Others in Fig. (c) decouple 3D processing with an additional branch, increasing parameter overhead and making them unsuitable for compact models. However, as shown in Fig. (d), SwiftVLA utilizes 4D representations as auxiliary inputs and employs a reconstruction objective to learn the spatiotemporal dynamics of 4D features. This enables the model to discard the 4D representations during inference while maintaining performance comparable to using the full 4D inputs.

(a) Using only 2D features as input to the VLM, which results in limited spatiotemporal awareness. (b) Direct fusion approaches combine spatial and 2D features within large VLMs. (c) Decoupled designs that introduce a dedicated spatial branch, causing large parameter overhead. (d) SwiftVLA leverages a pretrained model to extract 4D features and applies a feature reconstruction objective to align 4D and 2D representations. In addition, Fusion Tokens and a future prediction objective are introduced to strengthen cross-modal integration. The 4D inputs and auxiliary heads are removed at inference to maintain efficiency.

The Pipeline of SwiftVLA

In the SwiftVLA, we first extract 2D and 4D features from input images. A lightweight VLM processes 2D and 4D features with Fusion Tokens to achieve cross-modal integration. The outputs of the Fusion Tokens are supervised by the robot end-effector's future trajectory. During training, we randomly mask either the 2D or the 4D features, and we require the action expert to reconstruct the masked features while learning to generate actions. We show the attention mask under random masking of the 4D features. In this case, 4D features are excluded from the VLM attention, and the model is required to reconstruct the 4D features from the others.

Incremental 4D Feature Extraction

We use a pretrained 4D Visual Geometry Transformer to extract 4D features from input images, utilizing both spatial and temporal cues. The method employs spatiotemporal attention, where current features interact with a temporal cache through cross-attention, integrating temporal context. The temporal cache is updated using a FIFO strategy, optimizing efficiency. By providing 4D features from certain views to the VLM and updating the cache with others, we reduce training costs while improving spatiotemporal reasoning.

Fusion Tokens

Rather than relying on heavyweight vision-language models (VLMs) to directly fuse 2D and 4D information, we introduce Fusion Tokens , a set of learnable tokens designed to interact with both 2D features and 4D spatiotemporal representations. These tokens are guided by supervision based on the future trajectory of the end-effector. Our results demonstrate that when tasked with future prediction, Fusion Tokens significantly enhance the ability of lightweight VLMs to effectively integrate and interpret both 2D and 4D information.

Mask and Reconstruction Strategy

We propose a mask-and-reconstruction strategy that enhances the spatial reasoning capabilities of the VLA by incorporating 4D features during training, while maintaining efficiency during inference. This approach ensures a lightweight model suitable for resource-constrained environments, without sacrificing significant performance.

Training Procedure

During training, we apply random masking to either the 2D or 4D features, requiring the VLA to predict actions based on the remaining modalities while also reconstructing the masked features. This encourages the model to learn geometry-aware representations, which are further refined through reconstruction losses.

Inference Procedure

At inference time, we discard the 4D feature inputs and retain only the 2D feature branch, resulting in a lightweight model with reduced parameters. The 4D feature extractor and reconstruction heads, which were used solely for training supervision, are removed. This compact design enables the model to maintain strong spatial and temporal awareness, while also being computationally efficient and suitable for deployment on real-world robotic platforms.

Comparison of SwiftVLA and π₀ on Edge Devices

We provide a video that compares SwiftVLA and π₀ across multiple tasks. The video consists of the following segments:

4-40s: Demonstrates the comparison between SwiftVLA and π₀ on the "Fold the Cloth" task.
40-60s: Displays the comparison between SwiftVLA and π₀ on the "Throw the Bottle" task.
60-72s: Compares SwiftVLA and π₀ on the "Clean the Desk" task.
72-90s: Highlights the superior error-correction capability of SwiftVLA over π₀ , particularly evident when handling deformable objects. In the video, we compare the two algorithms on the "Fold the Cloth" task, focusing on how each model adjusts after failure. Compared to π₀ , SwiftVLA recovers more quickly, with smoother motion trajectories that enable more fluid and accurate handling of deformable objects.
90-132s: Shows additional examples of SwiftVLA on the "Fold the Cloth" task.