R-MeeTo: Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training.

1National University of Singapore,  2Sichuan University,  3Shanghai Jiao Tong University,  4Shenzhen University of Technology,  5Genpact Innovation Center,  6Independent Researcher.
(*: equal contribution, †: corresponding authors)

Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training ! We introduce R-MeeTo the first Token Merging method for Mamba. The key knowledge loss mainly causes the heavier performance drop, applying token reduction. R-MeeTo is thus proposed, fast fixing key knowledge and therefore recovering performance.
In the following video, we show the intuition of our work, where Genera Knowledge includes the common partterns shared among tokens, and Specific Knowledge indicates the specific partterns in particular tokens.


Conclusion: Intuition of our R-MeeTo: 1) why is Mamba sensitive to token reduction?
2) Why does R-MeeTo (i.e., Merging + Re-training) work? -- Key knowledge loss and recovery

R-MeeTo

Motivation

Token reduction is popular in model efficiency. It has yielded promising outcomes in ViTs, yet its efficiency in Vim remains unexplored. Thus, we have the following pre-experiments.

Conclusion: Mamba is sensitive to token reduction.

Analyses

From the perspective of knowledge structure, we have the following analyses (X: inputs; Y: outputs):

Conclusion: the distributions of Key Knowledge are different. Token reduction in Mamba is risky

Building on these insights, we propose R-MeeTo. R-MeeTo is simple and effective, with only two main modules: merging and re-training. Merging lowers the knowledge loss; re-training fast recovers the knowledge structure of Mamba. From the next video, we show that merging does help.

Conclusion: merging works!

Methodology

The process of our R-MeeTo.

Evaluations

Building in Munites!

Table 1. With current available GPUs, we achieve minute-level method to build faster Mamba.

R-MeeTo faster Mamba achieved in minutes with limited performance drop.


Image Tasks


Table 2. R-MeeTo shows significant improvements on image tasks (i.e., classification on ImageNet-1K).

Image 1. Visualization of R-MeeTo on ImageNet-1K, similar and redundant features are merged.

Video Tasks


Table 3. R-MeeTo shows significant improvements on image tasks (i.e., classification on Kinetics-400)

Image 2. Visualization of R-MeeTo on Kinetics-400, similar and redundant features are merged.

BibTeX

@misc{shi2024faster,
      title={Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training},
      author={Shi, Mingjia and Zhou, Yuhao and Yu, Ruiji and Li, Zekai and Liang, Zhiyuan and Zhao, Xuanlei and
       Peng, Xiaojiang and Rajpurohit, Tanmay and Vedantam, Ramakrishna and
       Zhao, Wangbo and Wang, Kai and You, Yang},
      year={2024},
      eprint={2412.12496},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2412.12496},
}