Fostering AI infrastructure advancements through standardization

17

Oct

Fostering AI infrastructure advancements through standardization

Since joining in 2014, Microsoft has actively engaged with the Open Compute Project (OCP) community to apply the benefits of open-source collaboration to hardware, resulting in hyperscale innovation across the industry. At last year’s OCP Global Summit, we introduced Project Caliptra, a new security offering in partnership with industry leaders including Google, AMD, and NVIDIA, and a new modular chassis design (Mt. Shasta) to bring form factor, power, and management interface into one converged design. These built upon many other contributions in the areas of rack-level architecture, security, and systems-level design.

With the rise of generative AI, the computing industry now faces a significant challenge: to evolve our underlying building blocks to meet the increasing infrastructure demands. At this year’s OCP Global Summit, Microsoft will share our latest contributions to supercomputing architecture and hardware intended to support this new era through standardization and innovation.

GPU and accelerator standardization for rapid adoption in hyperscaler fleets

Thanks to the growing number of generative AI applications, datacenters have increasingly adopted graphics processing units (GPUs) and accelerators. The resulting range of new products and corresponding integration requirements have created a new need for hyperscalers to invest in bespoke processes and tools to adapt various AI hardware for their fleets. We are pleased to join a collaborative effort with AMD, Google, Meta, and NVIDIA to create OCP standard requirements for GPU management to address this challenge.

Standardizing allows suppliers to seamlessly collaborate with hyperscalers and enables them to host various suppliers in their datacenters within an accelerated timeframe. This new OCP initiative focuses on two models of accelerators and GPU cards: Universal Base Board and Discrete. Initial specifications have been driven via different OCP workgroups focused on GPU firmware update requirements, interfaces, and Reliability, Availability and Serviceability (RAS) requirements for hardware.

This is a pioneering approach to viewing compliance as a fundamental catalyst for driving innovation via standardized requirements in a common OCP tool, which provides acceptance testing for accelerator management in cloud datacenter. 

Optimizing AI performance and efficiency with MX data formats

As AI continues to be applied to every aspect of our lives, the need for more efficient, scalable, and cost-effective AI systems is evident. This includes optimization across the AI stack, including advancements in narrow-precision AI data formats to address the rapidly growing complexity and requirements of current AI models. Advances in AI hardware technology such as these narrow-precision formats and associated optimized algorithms create opportunities like never before to address fundamental challenges in maintaining scalable and sustainable AI solutions.

Earlier this year, Microsoft partnered with AMD, Arm, Intel, Meta, NVIDIA and Qualcomm to form the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. Building on years of design space exploration and research at Microsoft, Microscaling technology enables sub 8-bit formats while also enhancing the strength and ease-of-use of existing 8-bit formats such as FP8 and INT8. These advancements also help contribute to broader sustainability goals like reducing the environmental impact of AI technologies as demand continues to grow by improving the energy efficiency of AI in datacenters as well as on many AI endpoints. 

The Microscaling Formats (MX) Specification v1.0 released through OCP introduces four common data formats (MXFP8, MXFP6, MXFP4, and MXINT8) that are compatible with current AI stacks, support implementation flexibility across both hardware and software, and enable fine-grain Microscaling at the hardware level. Extensive studies from Microsoft’s AI team confirm that MX formats can be easily deployed for many diverse, real-world cases such as language models, computer vision, and recommender systems. MX technology also enables LLM pre-training at 6- and 4-bit precisions without modifications to conventional training recipes. In addition to the initial specification, a whitepaper and emulation libraries have also been published with more details. 

OCP-SAFE: Strengthening datacenter security and transparency

Today’s datacenter infrastructure includes a diverse array of processing devices and peripherals that run firmware. Ensuring the security of this firmware is of paramount importance, demanding rigorous verification of the code quality and supply chain provenance. 

To meet the unique security demands of Cloud Service Providers and other market segments, many datacenter providers have opted for in-house or third-party security audits on device firmware. However, this approach often confines security assurances to individual cloud providers. 

To address this challenge, Microsoft and Google collaborated with OCP to introduce the OCP Security Appraisal Framework Enablement (OCP-SAFE). This framework standardizes security requirements and integrates Security Review Providers (SRP) to offer independent assurances, empowering hardware manufacturers to meet security standards across markets while enhancing product quality. 

OCP-SAFE also opens doors for end-users by providing concise assessment results, eliminating barriers to obtaining hardware security assurance. Datacenter operators and consumers alike can utilize these assessments to make informed deployment decisions about the security of components. Several companies, including AMD and SK-Hynix, have embraced OCP-SAFE, publishing concise security audits.  

For more information on OCP-SAFE review, visit our technical blog.  

We welcome attendees of this year’s OCP Global Summit to visit Microsoft at booth #B7 to explore our latest cloud hardware demonstrations featuring contributions with partners in the OCP community, including:  

  • Virtual Client library for Azure: an open source, standardized library of industry benchmarks and cloud customer workloads from Microsoft .
  • Caliptra 1.0: The newest design for our specification of Caliptra, an open source, reusable silicon IP lock for Root of Trust for Measurement (RTM).
  • Shasta Open Rack V3 Modular Chassis: The latest open source modular chassis design for the Shasta Open Rack.
  • QSFPDD 1.6T: A new backwards-compatible form factor specification providing aggregate bandwidth capacity of 1.6 Tbps and mated performance at 224 Gbps using PAM4.

Connect with Microsoft at the OCP Global Summit 2023 and beyond:

The post Fostering AI infrastructure advancements through standardization appeared first on Azure Blog.