Routers in Vision Mixture of Experts: An Empirical Study
CoRR(2024)
摘要
Mixture-of-Experts (MoE) models are a promising way to scale up model
capacity without significantly increasing computational cost. A key component
of MoEs is the router, which decides which subset of parameters (experts)
process which feature embeddings (tokens). In this paper, we present a
comprehensive study of routers in MoEs for computer vision tasks. We introduce
a unified MoE formulation that subsumes different MoEs with two parametric
routing tensors. This formulation covers both sparse MoE, which uses a binary
or hard assignment between experts and tokens, and soft MoE, which uses a soft
assignment between experts and weighted combinations of tokens. Routers for
sparse MoEs can be further grouped into two variants: Token Choice, which
matches experts to each token, and Expert Choice, which matches tokens to each
expert. We conduct head-to-head experiments with 6 different routers, including
existing routers from prior work and new ones we introduce. We show that (i)
many routers originally developed for language modeling can be adapted to
perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers
generally outperform Token Choice routers, and (iii) soft MoEs generally
outperform sparse MoEs with a fixed compute budget. These results provide new
insights regarding the crucial role of routers in vision MoE models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要