面向大模型分布式训练的光组网技术综述

王冬柔; 李新; 赵辰宇; 祁舒波; 黄善国

doi:10.13756/j.gtxyj.2025.250050

您当前的位置：

首页 >

文章列表页 >

面向大模型分布式训练的光组网技术综述

专题：纪念创刊50周年 | 更新时间：2025-12-25

- 面向大模型分布式训练的光组网技术综述
- A Review of Optical Grouping Techniques for Distributed Training of Large Models
- 光通信研究 2025年第6期
- 作者机构：
  
  北京邮电大学　信息光子学与光通信全国重点实验室，北京　100876
- 作者简介：
  
  王冬柔（2001-），女，河北邢台人。硕士，主要研究方向为大模型分布式训练。
  李新，副教授。E-mail：xinli@bupt.edu.cn
- 基金信息：
  
  国家重点研发计划资助项目(2024YFB2908702);国家自然科学基金资助项目(62125103;62171050);中央高校基本科研业务费专项资金资助项目(2024ZCJH02)
- DOI：10.13756/j.gtxyj.2025.250050
  中图分类号： TP3;TN929
- 收稿：2025-02-27，
  
  修回：2025-05-21，
  
  纸质出版：2025-12-10
- 稿件说明：
移动端阅览
王冬柔，李新，赵辰宇，等. 面向大模型分布式训练的光组网技术综述[J]. 光通信研究，2025(6): 250050.

Wang D R, Li X, Zhao C Y, et al. A Review of Optical Grouping Techniques for Distributed Training of Large Models[J]. Study on Optical Communications, 2025(6): 250050.
王冬柔，李新，赵辰宇，等. 面向大模型分布式训练的光组网技术综述[J]. 光通信研究，2025(6): 250050. DOI： 10.13756/j.gtxyj.2025.250050.

Wang D R, Li X, Zhao C Y, et al. A Review of Optical Grouping Techniques for Distributed Training of Large Models[J]. Study on Optical Communications, 2025(6): 250050. DOI： 10.13756/j.gtxyj.2025.250050.

摘要

近年来，超大规模深度学习模型的广泛应用推动了人工智能的快速发展，但随着模型参数规模从千万到万亿的飞跃式增长，单机或单卡已无法满足其训练需求，大模型分布式训练应运而生。光网络有超高带宽和低延迟等特点，为分布式训练提供了良好的解决方案。文章首先分析了大模型分布式训练的并行技术和通信需求，介绍了大模型分布式训练的常见网络拓扑；然后，分别从智算中心内和智算中心间的光组网架构和优化算法这两个层面出发，探究了其在大模型分布式训练中的应用；最后，提出了未来可能的研究方向，旨在为学者提供相关的知识脉络和研究参考。

Abstract

In recent years

the wide application of ultra-large-scale deep learning models has driven the rapid development of artificial intelligence. However

with the leap in model parameter size from ten million to trillions

a single machine or a single card can no longer meet its training needs. Therefore

the distributed training of large models has emerged. Optical networks have the characteristics of ultra-high bandwidth and low latency

which provide a good solution for distributed training. In this paper

we first analyze the parallel technology and communication requirements for large model distributed training

and introduce the common network topologies for large model distributed training. Then

from the two levels of optical networking architecture and optimization algorithms within and between intelligent computing centers

we explore their applications in large model distributed training. Finally

possible future research directions are proposed

aiming to provide scholars with relevant knowledge veins and research references.

关键词

Keywords

references

Achiam J , Adler S , Agarwal S , et al . GPT-4 Technical Report [J ] . ( 2024-03-04 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.2303.08774 https://doi.org/10.48550/arXiv.2303.08774 .

Islam R , Ahmed I . Gemini-the Most Powerful LLM: Myth or Truth [C ] // 2024 5th Information Communication Technologies Conference (ICTC) . Nanjing, China : IEEE , 2024 : 10602253 .

Touvron H , Lavril T , Izacard G , et al . Llama: Open and Efficient Foundation Language Models [DB/OL ] . ( 2023-02-27 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.2302.13971 https://doi.org/10.48550/arXiv.2302.13971 .

Brown B T , Mann B , Ryder N , et al . Language Models are Few-Shot Learners [J ] . Neural Information Processing Systems , 2020 , 33 : 1877 - 1901 .

Chowdhery A , Narang S , Devlin J . Palm: Scaling Language Modeling with Pathways [J ] . Journal of Machine Learning Research , 2023 ( 24 ): 113 .

OpenAI . AI and Compute [EB/OL ] . ( 2018-05-16 )[ 2025-02-27 ] . https://openai.com/research/ai-and-compute https://openai.com/research/ai-and-compute .

Sapio A , Canini M , Ho C Y , et al . Scaling Distributed Machine Learning with In-Network Aggregation [C ] // 18th Symposium on Networked Systems Design and Implementation (NSDI 21) . Boston, MA, US : NSDI , 2021 : 785 - 808 .

Zhao W X , Zhou K , Li J Y , et al . A Survey of Large Language Models [DB/OL ] . ( 2023-03-31 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.2303.18223 https://doi.org/10.48550/arXiv.2303.18223 .

Huang Y P , Cheng Y L , Bapna A , et al . Gpipe: Efficient Training of Giant Neural Networks Using Pipeline-parallelism [C ] // Neural In-formation Processing Systems . Montréal, Canada : NeurIPS , 2019 : 06965 .

Narayanan D , Harlap A , Phanishayee A , et al . Pipe-Dream: Generalized Pipeline Parallelism for DNN Training [C ] // Proceedings of the 27th ACM Symposium on Operating Systems Principles . Huntsville, Ontario, Canada : ACM , 2019 : 1 - 15 .

Shoeybi M , Patwary M , Puri R , et al . Megatron-LM: Training Multi-billion Parameter Language Models Using Model Parallelism [DB/OL ] . ( 2019-09-17 )[ 2025-02-27 ] . https://arxiv.org/abs/1909.08053 https://arxiv.org/abs/1909.08053 .

Li S , Liu H , Bian Z , et al . Colossal-AI: a Unified Deep Learning System for Large-Scale Parallel Training [C ] // Proceedings of the 52nd International Conference on Parallel Processing . Salt Lake City, UT, USA : ACM , 2023 : 766 - 775 .

Zhao X , Zhang H , Huo Z , et al . Accelerated on Device Forward Neural Network Training with Module-Wise DescenDingasynchronism [J ] . Advances in Neural Information Processing Systems , 2023 , 36 : 52265 - 52292 .

Yang H , Zhou H , Liu Z , et al . Energy Optimization of Wireless Sensor Embedded Cloud Computing Data Monitoring System in 6G Environment [J ] . Sensors , 2023 , 23 ( 2 ): 1013 .

Al-Fares M , Loukissas A , Vahdat A . A Scalable, Commodity Data Center Network Architecture [J ] . ACM SIGCOMM Computer Communication Review , 2008 , 38 ( 4 ): 63 - 74 .

Moualla G , Turletti T , Saucez D . Online Robust Placement of Service Chains for Large Data Center Topologies [J ] . IEEE Access , 2019 , 7 : 60150 - 60162 .

Gibiansky A . Bringing HPC Techniques to Deep Learning [EB/OL ] . ( 2017-02-21 )[ 2025-02-27 ] . https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/ https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/ .

Alqahtani S , Demirbas M . Performance Analysis and Comparison of Distributed Machine Learning Systems [DB/OL ] . ( 2019-09-04 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.1909.02061 https://doi.org/10.48550/arXiv.1909.02061 .

Yu M , Ji B , Rajan H , et al . On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention [C ] // Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing . Seoul, Korea : ACM , 2022 : 21 - 30 .

Wang W Y , Ghobadi M , Shakeri K , et al . Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters [J ] . ( 2023-06-22 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.2307.12169 https://doi.org/10.48550/arXiv.2307.12169 .

朱宸 , 周谞 , 王佩龙 . 可重构OCS技术在大模型预训练中的应用 [J ] . 光通信研究 , 2024 ( 5 ): 24004901 .

Zhu C , Zhou X , Wang P L . Application of Reconfigurable OCS Technology for Pre-Training Large Language Models [J ] . Study on Optical Communications , 2024 ( 5 ): 24004901 .

Khani M , Ghobadi M , Alizadeh M , et al . SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training [C ] // Proceedings of the 2021 ACM SIGCOMM 2021 Conference . New York, USA : ACM , 2021 : 657 - 675 .

Wade M , Anderson E , Ardalan S , et al . TeraPHY: A Chiplet Technology for Low-Power, High-Bandwidth Optical I/O [J ] . IEEE Micro , 2020 , 40 ( 2 ): 63 - 71 .

Coffey V . DARPA PIPES Program Demonstrates 2 Tbit/s Optical Interconnects at the Chip Level [EB/OL ] . ( 2020-03-25 )[ 2025-02-27 ] . https://www.laserfocusworld.com/fiberoptics/article/14176186/darpa-pipes-program-demonstrates-2-tbits-optical-interconnects-at-the-chip-level https://www.laserfocusworld.com/fiberoptics/article/14176186/darpa-pipes-program-demonstrates-2-tbits-optical-interconnects-at-the-chip-level .

Wade M . Optical I/O Chiplets Eliminate Bottlenecks to Unleash Innovation [EB/OL ] . ( 2020-03-25 )[ 2025-02-27 ] . https://ayarlabs.com/resources/technical-brief-optical-i-o-chiplets-eliminate-bottlenecks-to-unleas-h-innovation/ https://ayarlabs.com/resources/technical-brief-optical-i-o-chiplets-eliminate-bottlenecks-to-unleas-h-innovation/ .

Lu Y F , Gu H X , Yu X S , et al . Lotus: A New Topology for Large-scale Distributed Mmachine Learning [J ] . ACM Journal on Emerging Technologies in Computing Systems (JETC) , 2021 , 17 ( 1 ): 1 - 21 .

Grani P , Proietti R , Akella V , et al . Design and Evaluation of AWGR-based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems [C ] // 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . Austin, TX, USA : IEEE , 2017 : 7920833 .

Kim J , Dally W J , Scott S , et al . Technology-Driven, Highly-Scalable Dragonfly Topology [C ] // 2008 International Symposium on Computer Architecture . Beijing, China : IEEE , 2008 : 4556717 .

Horiguchi S , Ooki T . Hierarchical 3D-Torus Interconnection Network [C ] // Proceedings International Symposium on Parallel Architectures, Algorithms and Networks . Dallas, TX, USA : IEEE , 2000 : 900261 .

Jouppi N , Kurian G , Li S , et al . TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings [C ] // Proceedings of the 50th Annual International Symposium on Computer Architecture . Orlando, FL, USA : ACM , 2023 : 1 - 14 .

杨兴雨 , 王中华 , 汪禧希 , 等 . 先进体制激光成像雷达技术的发展与分析 [J ] . 激光与光电子学进展 , 2024 , 61 ( 8 ): 53 - 68 .

Yang X Y , Wang Z H , Wang X X , et al . Development and Analysis of Advanced Laser Imaging Radar Technology with Different Systems [J ] . Laser & Optoelectronics Progress , 2024 , 61 ( 8 ): 53 - 68 .

Poutievski L , Mashayekhi O . Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Softwa-re-Defined Networking [C ] // ACM SIGCOMM . Amsterdam, Netherland : SIGCOMM , 2022 : 66 - 85 .

Wang W , Khazraee M , Zhong Z Z , et al . TopoOpt: Co-Optimizing Network Topology and Parallelization Strategy for Distri-Buted Training Jobs [C ] // 20th USENIX Symposium on Networked Systems Design and Implementation . Santa Clara, CA, USA : NSDI , 2023 : 739 - 767 .

Ballani H , Costa P , Behrendt R , et al . Sirius: a Flat Datacenter Network with Nanosecond Optical Switching [C ] // Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication . San Jose, CA, USA : ACM , 2020 : 782 - 797 .

Jia Z H , Zaharia M , Aiken A , et al . Beyond Data and Model Parallelism for Deep Neural Networks [DB/OL ] . ( 2018-07-14 )[ 2025-02-27 ] . https://doi.org/10.48550/arXiv.1807.05358 https://doi.org/10.48550/arXiv.1807.05358 .

He K , Zhang X , Ren S , et al . Deep Residual Learning for Image Recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, NV, USA : IEEE , 2016 : 7780459 .

Devlin J , Chang M W , Toutanova K , et al . Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding [C ] // North American Chapter of the Association for Computational Linguistics . Minneapolis, USA : NAACL-HLT , 2019 : 4171 - 4186 .

Yang J , Buglia H , Jarmolovičius M , et al . 122.6 Tb/s S+C+L Band Unrepeatered Transmission over 223 km Link with Optimized Bidirectional Raman Amplification [J ] . Journal of Lightwave Technology , 2025 , 43 ( 4 ): 1893 - 1901 .

Zhang Q , Zhang X , Gong X , et al . Crosstalk-Avoid Virtual Optical Network Embedding over Elastic Optical Networks with Heterogeneous Multi-Core Fibers [J ] . Journal of Lightwave Technology , 2022 , 40 ( 24 ): 7687 - 7700 .

Chen X , Gao H , Xiong X , et al . Field Trial of Autonomous Lightpath Provisioning over an SDM Network Using 7-Core Fibers [J ] . Journal of Lightwave Technology , 2024 , 42 ( 3 ): 1151 - 1158 .

Zhang X , Xu C , Zeng H , et al . Risk-Avoid Lightpath Provision in Space Division Multiplexing Elastic Optical Data Center Networks [J ] . Journal of Lightwave Technology , 2025 , 43 ( 5 ): 2100 - 2116 .

Li Y , Zeng Z , Li J , et al . Distributed Model Training based on Data Parallelism in Edge Computing-Enabled Elastic Optical Networks [J ] . IEEE Communications Letters , 2021 , 25 ( 4 ): 1241 - 1244 .

Wang C , Yoshikane N , Balasis F , et al . OSDL: Dedicated Optical Slice Provisioning in Support of Distributed Deep Learning [J ] . Computer Networks , 2022 , 214 : 109191 .

Liu L , Yu H , Sun G , et al . Online Job Scheduling for Distributed Machine Learning in Optical Circuit Switch Networks [J ] . Knowledge-based Systems , 2020 , 201 : 106002 .

Chen L , Li Y , Natalino C , et al . Reliable and Efficient RAR-based Distributed Model Training in Computing Power Network [J ] . Journal of Optical Communications and Networking , 2024 , 16 ( 5 ): 527 - 540 .

于振明 , 居振宇 , 丁开强 , 等 . 基于多模光纤的光谱分析、射频谱感知及成像技术研究进展 [J ] . 信号处理 , 2020 , 36 ( 10 ): 1628 - 1643 .

Yu Z M , Ju Z Y , Ding K Q , et al . Research Progress of Spectral Analysis, RF Spectrum Sensing and Imaging Techniques based on Multimode Fiber [J ] . Journal of Signal Processing , 2020 , 36 ( 10 ): 1628 - 1643 .

Zhao Z , Tang M , Lu C . Distributed Multicore Fiber Sensors [J ] . Opto-Electronic Advances , 2020 , 3 ( 2 ): 19002401 - 19002417 .

Soares Luís R , Puttnam B J , Rademacher G , et al . Spectral Efficiency in Crosstalk-Impaired Multi-Core Fiber Links [C ] // Optical Data Science: Trends Shaping the Future of Photonics . San Francisco, USA : SPIE , 2018 : 2282822 .

Iwashina S , Shimano K , Takasugi K , et al . Research and Development for Pioneering a New Communications Paradigm with Wide-Aarea Coverage [J ] . NTT Technical Review , 2022 , 20 ( 7 ): 10 - 16 .

Lu K , Chen Z , Chen H , et al . Empowering High-Dimensional Optical Fiber Communications with Integrated Photonic Processors [J ] . Nature Communications , 2024 , 15 : 3515 .

Zhao C , Li X , Xin J , et al . Intelligent Agent-based Dynamic Reliability Evaluation for Optical Networks: a Comprehensive Framework and Case Study [J ] . IEEE Communications Magazine , 2024 , 63 ( 3 ): 151 - 157 .

浏览量

下载量

CSCD

CNKI被引量

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据