ASCENT: Collaborative Research: Scaling Distributed AI Systems based on Universal Optical I/O

ASCENT:协作研究:基于通用光学 I/O 扩展分布式人工智能系统

基本信息

  • 批准号:
    2023468
  • 负责人:
  • 金额:
    $ 32.5万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-08-15 至 2023-07-31
  • 项目状态:
    已结题

项目摘要

Our society is rapidly becoming reliant on neural networks based artificial intelligence computation. New algorithms are invented daily, increasing the memory and computational requirements for both inference and training. This explosive growth has created an enormous demand for distributed machine learning (ML) training and inference. Estimates by OpenAI illustrate the steady growth of computational requirements of 100x every two years since 2012, which is a 50x faster than the rate of computation improvements enabled previously through Moore’s Law of semiconductor industry that we have enjoyed in the last half-century. This new computation demand has been partly met by rapid development of hardware accelerators and software stacks to support these specialized computations. Hardware accelerators have provided a significant amount of speed-up but today’s training tasks can still take days and even weeks. The reason for this: as the number of workers (e.g. compute nodes) increases, the computation time per worker decreases, but the communication requirements between the nodes increase, creating a bottleneck in the interconnect between the compute nodes. Future distributed ML systems will require 1-2 orders of magnitude higher interconnect bandwidth per node, creating a pressing need for entirely new ways to build interconnects for distributed ML systems. This proposal aims to create a new paradigm for scaling distributed ML computation, by developing a scalable interconnect solution based on advancing the integrated electronics and photonics technology that enables direct node-to-node optical fiber connectivity. The proposed cross-stack collaborative multi-disciplinary work will enable the education and training of a unique crop of engineers and scientists that cross the boundaries of machine learning, networking, and electronic-photonic systems and devices, which are in severe demand. The principal investigators have an established track record of direct engagement with high-school students providing summer internships at Berkeley Wireless Research Center and MIT’s Women’s Technology Program, as well as exemplary undergraduate research activities at Boston University. The educational and outreach activities the PIs have put in place will ensure early exposure and continued training of new generation of leaders in this field, from K-12, through undergraduate and graduate studies, and continuing workforce education, with special focus on underrepresented students.The interconnect has emerged as the key bottleneck in enabling the full potential of distributed ML. Future ML workloads are likely to require tens of Tbps of bandwidth per device. Ubiquitous deployment of logically-connected, physically distributed computation across shelf, rack and row scale can only be enabled by a new universal I/O that enables socket to socket communication at the energy, latency and bandwidth density of in-package interconnects. No such technology currently exists. Silicon-photonics based optical I/O has the potential to address this critical challenge, but fundamental advances–from chip manufacturing to routing algorithms–are still needed to ensure the scalability of these interconnect systems. To enable high-bandwidth density and energy-efficiency, dense wavelength division multiplexing must be used. High-efficiency ring resonator-based modulators and comb laser sources are needed to enable Tbps rates over each fiber connection and socket bandwidth scaling from 10s to 100s of Tbps. New link architectures like the proposed laser-forwarded coherent link are needed to enable high-efficiency external centralized comb laser sources with modest (sub-mW) power per wavelength per fiber port. The proposed work will also develop new scheduling algorithms, network architectures, and workload parallelism strategy to leverage the bandwidth density and low-latency of the universal optical I/O, to map large AI workloads with massive datasets to a scalable distributed compute system.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
我的社会迅速依赖基于神经网络的人工智能计算。自2012年以来,每两年的要求提高了100倍,我们在过去半个世纪中享受的摩尔的半产物行业定律以前。提供了重要的,但今天的任务仍然可以拖延几周:随着工人的数量(例如计算节点)增加,计算时间降低,但节点S之间的共处要求增加,在互连中增加计算节点。每个节点较高的互连带宽,创造了用于分布式ML系统的全新方法,旨在创建一个新的范式节点到节点的光纤连接。与Hig H-School学生的直接互动记录在伯克利研究和MIT的女性技术计划的暑期实习期间从K-12到本科和研究生学习,以及互联的劳动力教育,特别关注代表性不足的学生。互连已成为主要的瓶颈,以使分布式ML的全部潜力可能需要数十个TBP。设备的架子,架子和行刻度只能通过能量的新通用I/O套接字来启用。关键挑战的地址,但基本进展 - 从芯片制造到路由算法 - 逐渐降低这些互连系统的ty,以使基于效率的调制器和梳子源是效率的高带宽和能量效率的调节器。每个纤维连接和套接字带宽将ROM缩放10s到100 s的TBP,例如支撑激光的相干链接,以使高效的外部集中式梳子激光源具有每个波长每个波长每个波长。

项目成果

期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Emerging Optical Interconnects for AI Systems
适用于人工智能系统的新兴光互连
  • DOI:
    10.1364/ofc.2022.th1g.1
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ghobadi, Manya
  • 通讯作者:
    Ghobadi, Manya
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
  • DOI:
  • 发表时间:
    2022-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Weiyang Wang;Moein Khazraee;Zhizhen Zhong;M. Ghobadi;Zhihao Jia;Dheevatsa Mudigere;Ying Zhang;
  • 通讯作者:
    Weiyang Wang;Moein Khazraee;Zhizhen Zhong;M. Ghobadi;Zhihao Jia;Dheevatsa Mudigere;Ying Zhang;
Demonstration of WDM-Enabled Ultralow-Energy Photonic Edge Computing
支持 WDM 的超低能量光子边缘计算演示
  • DOI:
    10.1364/ofc.2022.th3a.3
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Sludds, Alexander;Hamerly, Ryan;Bandyopadhyay, Saumil;Zhong, Zhizhen;Chen, Zaijun;Bernstein, Liane;Ghobadi, Manya;Englund, Dirk
  • 通讯作者:
    Englund, Dirk
SiP-ML: high-bandwidth optical network interconnects for machine learning training
  • DOI:
    10.1145/3452296.3472900
  • 发表时间:
    2021-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mehrdad Khani Shirkoohi;M. Ghobadi;M. Alizadeh;Ziyi Zhu;M. Glick;K. Bergman;A. Vahdat;Benjamin Klenk-Ben
  • 通讯作者:
    Mehrdad Khani Shirkoohi;M. Ghobadi;M. Alizadeh;Ziyi Zhu;M. Glick;K. Bergman;A. Vahdat;Benjamin Klenk-Ben
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Manya Ghobadi其他文献

Manya Ghobadi的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Manya Ghobadi', 18)}}的其他基金

CAREER: Large-scale Dynamic Reconfigurable Networks
职业:大规模动态可重构网络
  • 批准号:
    2144766
  • 财政年份:
    2022
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Medium: A Stateful Switch Architecture for In-Network Compute
合作研究:CNS Core:Medium:用于网内计算的有状态交换机架构
  • 批准号:
    2211382
  • 财政年份:
    2022
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Spatial Multi-Tenant Neural Acceleration for Next Generation Datacenters
合作研究:SHF:中:下一代数据中心的空间多租户神经加速
  • 批准号:
    2107244
  • 财政年份:
    2021
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Small: A Principled Framework for Workload Distribution Techniques in Large-Scale Networks
合作研究:CNS 核心:小型:大规模网络中工作负载分配技术的原则框架
  • 批准号:
    2008624
  • 财政年份:
    2020
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant

相似国自然基金

数智背景下的团队人力资本层级结构类型、团队协作过程与团队效能结果之间关系的研究
  • 批准号:
    72372084
  • 批准年份:
    2023
  • 资助金额:
    40 万元
  • 项目类别:
    面上项目
颅颌面手术机器人辅助半面短小牵张成骨术的智能规划与交互协作研究
  • 批准号:
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
面向自主认知与群智协作的多智能体制造系统关键技术研究
  • 批准号:
    52305539
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
大规模物联网多协作绿色信息感知和智慧响应决策一体化方法研究
  • 批准号:
    62371149
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
多UAV协作的大规模传感网并发充电模型及其服务机制研究
  • 批准号:
    62362017
  • 批准年份:
    2023
  • 资助金额:
    32 万元
  • 项目类别:
    地区科学基金项目

相似海外基金

Collaborative Research: How faithfully are melt embayments wedded to magma ascent?
合作研究:熔体海湾与岩浆上升的关系有多忠实?
  • 批准号:
    2221896
  • 财政年份:
    2022
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
  • 批准号:
    2245910
  • 财政年份:
    2022
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant
Achieving Equity through SocioCulturally-informed, Digitally-Enabled Cancer Pain managemeNT” (ASCENT) Clinical Trial
通过社会文化知情、数字化的癌症疼痛管理 NT™ (ASCENT) 临床试验实现公平
  • 批准号:
    10539159
  • 财政年份:
    2022
  • 资助金额:
    $ 32.5万
  • 项目类别:
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
  • 批准号:
    2128540
  • 财政年份:
    2021
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
  • 批准号:
    2128584
  • 财政年份:
    2021
  • 资助金额:
    $ 32.5万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了