NVIDIA, Microsoft team up to build large-scale AI cloud computer

Monday, 21 November, 2022

NVIDIA has announced a multi-year collaboration with Microsoft to build a powerful AI supercomputer, powered by Microsoft Azure’s advanced supercomputing infrastructure combined with NVIDIA GPUs, networking and full stack of AI software to help enterprises train, deploy and scale AI, including large, state-of-the-art models. Azure’s cloud-based AI supercomputer includes powerful and scalable ND- and NC-series virtual machines optimised for AI distributed training and inference. It is reportedly the first public cloud to incorporate NVIDIA’s advanced AI stack, adding tens of thousands of NVIDIA A100 and H100 GPUs, NVIDIA Quantum-2 400 Gb/s InfiniBand networking and the NVIDIA AI Enterprise software suite to its platform.

As part of the collaboration, NVIDIA will use Azure’s scalable virtual machine instances to research and further accelerate advances in generative AI, an emerging area of AI in which foundational models like Megatron Turing NLG 530B are the basis for unsupervised, self-learning algorithms to create new text, code, digital images, video or audio. The companies will also collaborate to optimise Microsoft’s DeepSpeed deep learning optimisation software. NVIDIA’s full stack of AI workflows and software development kits, optimised for Azure, will be made available to Azure enterprise customers.

Manuvir Das, vice president of enterprise computing at NVIDIA, said AI technology advances, as well as industry adoption, are accelerating, with the breakthrough of foundation models triggering a tidal wave of research, fostering startups and enabling new enterprise applications. “Our collaboration with Microsoft will provide researchers and companies with state-of-the-art AI infrastructure and software to capitalise on the transformative power of AI,” Das said.

Scott Guthrie, executive vice president of the Cloud + AI Group at Microsoft, said that AI is fuelling the next wave of automation across enterprises and industrial computing, enabling organisations to do more with less as they navigate economic uncertainties. “Our collaboration with NVIDIA unlocks the world’s most scalable supercomputer platform, which delivers state-of-the-art AI capabilities for every enterprise on Microsoft Azure,” Guthrie said.

Microsoft Azure’s AI-optimised virtual machine instances are architected with NVIDIA’s advanced data centre GPUs; they also incorporate NVIDIA Quantum-2 400 Gb/s InfiniBand networking, enabling customers to deploy thousands of GPUs in a single cluster to train large language models, build complex recommender systems at scale, and enable generative AI at scale. The current Azure instances feature NVIDIA Quantum 200 GB/s InfiniBand networking with NVIDIA A100 GPUs. Future models will be integrated with NVIDIA Quantum-2 400 Gb/s InfiniBand networking and NVIDIA H100 GPUs. Combined with Azure’s advanced compute cloud infrastructure, networking and storage, these AI-optimised offerings will provide scalable performance for AI training and deep learning inference workloads of any size. Additionally, the platform will support a range of AI applications and services, including Microsoft DeepSpeed and the NVIDIA AI Enterprise software suite.

Microsoft DeepSpeed will leverage the NVIDIA H100 Transformer Engine to accelerate transformer-based models used for large language models, generative AI and writing computer code, among other applications. This technology applies 8-bit floating point precision capabilities to DeepSpeed to accelerate AI calculations for transformers — at twice the throughput of 16-bit operations.

NVIDIA Enterprise is certified and supported on Microsoft Azure instances with NVIDIA A100 GPUs. Support for Azure instances with NVIDIA H100 GPUs will be added in a future software release. NVIDIA AI Enterprise, which includes the NVIDIA Riva for speech AI and NVIDIA Morpheus cybersecurity application frameworks, streamlines the AI workflow, from data processing and AI model training to simulation and large-scale deployment.

Image credit: iStock.com/gorodenkoff