arxiv.org/abs/2603.28768

archives

This URL has 1 public saves. The first save was Apr 1, 2026, 04:26 AM and the latest save was Apr 1, 2026, 04:26 AM.

View recent saves on this domain

Latest saved version

[2603.28768] CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

This is the newest public snapshot for this URL and the best place to start reviewing the page.

Apr 1, 2026, 04:26 AM

Source URL

https://arxiv.org/abs/2603.28768

About this page

This page presents CRAFT, a framework for efficient expert replication in Mixture-of-Experts (MoE) architectures used in large language models. The paper addresses the problem that existing replication schemes often over-replicate experts, consuming substantial GPU memory and degrading throughput. CRAFT maximizes load balance under memory constraints through fine-grained, per-layer replication based on estimated benefits. It seamlessly integrates into existing serving frameworks without requiring additional training or model modifications. Evaluations demonstrate that CRAFT achieves 1.14× average (up to 1.2×) throughput improvements over conventional replication techniques in large-scale deployments.

Total saves

1

Latest save

Apr 1, 2026, 04:26 AM

First save

Apr 1, 2026, 04:26 AM

Saved versions

arxiv.org/abs/2603.28768 web archives are listed here. You can still review the saved screenshot and HTML even if the original page disappears.