[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations
https://arxiv.org/abs/2603.28768The evidence pack includes HTML, screenshots, summaries, and metadata. It can be downloaded on Pro.
[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations
Open the archived HTML with saved-time metadata attached.
This HTML has CSS and images embedded, so it can still be opened even if the original page disappears.
This page presents CRAFT, a framework for efficient expert replication in Mixture-of-Experts (MoE) architectures used in large language models. The paper addresses the problem that existing replication schemes often over-replicate experts, consuming substantial GPU memory and degrading throughput. CRAFT maximizes load balance under memory constraints through fine-grained, per-layer replication based on estimated benefits. It seamlessly integrates into existing serving frameworks without requiring additional training or model modifications. Evaluations demonstrate that CRAFT achieves 1.14× average (up to 1.2×) throughput improvements over conventional replication techniques in large-scale deployments.
![[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations - Saved screenshot](https://pub-f6fa8ca7bebe4069bff3224f9a8f5334.r2.dev
/screenshots/89fd2c77a519d474.jpg)