deepseek:2024年DeepSeekV3技术报告(英文版).pdf |
下载文档 |
资源简介
We present DeepSeek-V3, a strong, Mixture-of-Experts (MoE) language model with 671B totalparameters with 37B activated for each token, To achieve efficient inference and cost-effectivetraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneersan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction trainingobjective for stronger performance. We pre-tr
本文档仅能预览20页