Archive ready

[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

https://arxiv.org/abs/2603.28768

April 1, 2026 at 01:26 PM JST•The archive page, viewer, and downloads use this saved version.

April 1, 2026 at 01:26 PM JST·arxiv.org

The evidence pack includes HTML, screenshots, summaries, and metadata. It can be downloaded on Pro.

Saved page

[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Open the archived HTML with saved-time metadata attached.

Original URLhttps://arxiv.org/abs/2603.28768

StartedApril 1, 2026 at 01:26 PM JST

This HTML has CSS and images embedded, so it can still be opened even if the original page disappears.

About this pageAI generated

This page presents CRAFT, a framework for efficient expert replication in Mixture-of-Experts (MoE) architectures used in large language models. The paper addresses the problem that existing replication schemes often over-replicate experts, consuming substantial GPU memory and degrading throughput. CRAFT maximizes load balance under memory constraints through fine-grained, per-layer replication based on estimated benefits. It seamlessly integrates into existing serving frameworks without requiring additional training or model modifications. Evaluations demonstrate that CRAFT achieves 1.14× average (up to 1.2×) throughput improvements over conventional replication techniques in large-scale deployments.

Screenshot

The full page can be captured up to 15,000px in height so you can review the complete page layout when needed.