arxiv.org/abs/2603.28768

[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

This is the newest public snapshot for this URL and the best place to start reviewing the page.

Apr 1, 2026, 04:26 AM

Source URL

https://arxiv.org/abs/2603.28768

About this page

This page presents CRAFT, a framework for efficient expert replication in Mixture-of-Experts (MoE) architectures used in large language models. The paper addresses the problem that existing replication schemes often over-replicate experts, consuming substantial GPU memory and degrading throughput. CRAFT maximizes load balance under memory constraints through fine-grained, per-layer replication based on estimated benefits. It seamlessly integrates into existing serving frameworks without requiring additional training or model modifications. Evaluations demonstrate that CRAFT achieves 1.14× average (up to 1.2×) throughput improvements over conventional replication techniques in large-scale deployments.

Open latest saved version Open oldest saved version Open full history

Total saves

Latest save

Apr 1, 2026, 04:26 AM

First save

Apr 1, 2026, 04:26 AM

Open latest saved version Open oldest saved version Newest first Oldest first

Page 1

Saved versions

[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

This page presents CRAFT, a framework for efficient expert replication in Mixture-of-Experts (MoE) a...

4/1/2026

arxiv.org/abs/2603.28768 web archives are listed here. You can still review the saved screenshot and HTML even if the original page disappears.

Save another page Search archives