HPC

NVIDIA Debuts Ampere Architecture with NVIDIA A100 & DGX A100 – A Game Changer for AI & HPC Workloads

9 min read
A100-3.png

Due to the COVID-19 outbreak, GTC 2020 was shifted to an all virtual conference, and the long anticipated next generation NVIDIA GPU architecture announcement was subsequently delayed. However today, NVIDIA CEO Jensen Huang’s keynote took the virtual stage to announce NVIDIA’s all new GPU architecture, Ampere, and the first products that will be using it. At the center of this is a new GPU, the NVIDIA A100, and a new system, the NVIDIA DGX™ A100.

Five Key Innovations of the NVIDIA A100 Tensor Core GPU

The NVIDIA A100 GPU is a technical design breakthrough fueled by five key innovations:

  • NVIDIA Ampere architecture — At the heart of A100 is the NVIDIA Ampere GPU architecture, which contains more than 54 billion transistors, making it the world’s largest 7-nanometer processor.
  • Tensor Cores with TF32 — NVIDIA’s widely adopted Tensor Cores are now more flexible, faster and easier to use. Their expanded capabilities include new TF32 for AI (TensorFloat32), which allows for up to 20x the AI compute of FP32 precision compared to previous generation, without any code changes. In addition, Tensor Cores now support FP64, delivering up to 2.5x more compute than the previous generation for HPC applications.
  • Multi-instance GPU — MIG, a new technical feature, enables a single A100 GPU to be partitioned into as many as seven GPU instances so it can deliver varying degrees of compute for jobs of different sizes, providing optimal utilization and maximizing return on investment.
  • Third-generation NVIDIA NVLink — Doubles the high-speed connectivity between GPUs to provide efficient performance scaling in a server.
  • Structural Sparsity — This new efficiency technique harnesses the inherently sparse nature of AI math to double inference compute.

NVIDIA DGX A100 now Available

The NVIDIA A100 Tensor Core GPU delivers the next giant leap in NVIDIA’s accelerated data center platform, providing unmatched acceleration at every scale and enabling these innovators to do their life’s work within their lifetime. A100 powers numerous application areas including HPC, genomics, 5G, rendering, deep learning, data analytics, data science, and robotics.

A100-main-2-1024x413.png


NVIDIA today set out a vision for the next generation of computing that shifts the focus of the global information economy from servers to a new class of powerful, flexible data centers.

DeepLearning_PE_Banner_Dynamic-051618-1024x108.jpg


NVIDIA A100 Specs Compared with Previous Generation Server GPUs

Product ArchitecturePascal P100Volta V100NVIDIA A100
GPU CodenameGP100GV100GA100
GPU ArchitectureNVIDIA PascalNVIDIA VoltaNVIDIA Ampere
GPU Board Form FactorSXM2SXM2SXM4
SMs5680108
TPCs284054
FP32 Cores / SM646464
FP32 Cores / GPU358451206912
FP64 Cores / SM323232
FP64 Cores / GPU179225603456
INT32 Cores / SMNA6464
INT32 Cores / GPUNA51206912
Tensor Cores / SMNA842
Tensor Cores / GPUNA640432
GPU Boost Clock1480 MHz1530 MHz1410 MHz
Peak FP16 Tensor TFLOPS with FP16 Accumulate1NA125312/6243
Peak FP16 Tensor TFLOPS with FP32 Accumulate1NA125312/6243
Peak BF16 Tensor TFLOPS with FP32 Accumulate1NANA312/6243
Peak TF32 Tensor TFLOPS1NANA156/3123
Peak FP64 Tensor TFLOPS1NANA19.5
Peak INT8 Tensor TOPS1NANA624/12483
Peak INT4 Tensor TOPS1NANA1248/24963
Peak FP16 TFLOPS121.231.478
Peak BF16 TFLOPS1NANA39
Peak FP32 TFLOPS110.615.719.5
Peak FP64 TFLOPS15.37.89.7
Peak INT32 TOPS1NA15.719.5
Texture Units224320432
Memory Interface4096-bit HBM24096-bit HBM25120-bit HBM2
Memory Size16 GB32 GB / 16 GB40 GB
Memory Data Rate703 MHz DDR877.5 MHz DDR1215 MHz DDR
Memory Bandwidth720 GB/sec900 GB/sec1.6 TB/sec
L2 Cache Size4096 KB6144 KB40960 KB
Shared Memory Size / SM64 KBConfigurable up to 96 KBConfigurable up to 164 KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU14336 KB20480 KB27648 KB
TDP300 Watts300 Watts400 Watts
Transistors15.3 billion21.1 billion54.2 billion
GPU Die Size610 mm²815 mm²826 mm2
TSMC Manufacturing Process16 nm FinFET+12 nm FFN7 nm N7

1) Peak rates are based on the GPU boost clock.
2) Four Tensor Cores in an A100 SM have 2x the raw FMA computational power of eight Tensor Cores in a GV100 SM.
3) Effective TOPS / TFLOPS using the new Sparsity feature.

source: https://devblogs.nvidia.com/nv...

ebook-dl-banner.png


Deep Learning Training Performance With NVIDIA A100

AI models are exploding in complexity as they take on next-level challenges such as accurate conversational AI and deep recommender systems. Training them requires massive compute power and scalability.

NVIDIA A100’s third-generation Tensor Cores with Tensor Float (TF32) precision provide up to 20x higher performance over the prior generation with zero code changes and an additional 2x boost with automatic mixed precision and FP16. When combined with third-generation NVIDIA® NVLink®, NVIDIA NVSwitch™, PCI Gen4, NVIDIA Mellanox InfiniBand, and the NVIDIA Magnum IO™ software SDK, it’s possible to scale to thousands of A100 GPUs. This means that large AI models like BERT can be trained in just 37 minutes on a cluster of 1,024 A100s, offering unprecedented performance and scalability.

a100-bert-training.png


Deep Learning Inference Performance with NVIDIA A100

A100 introduces groundbreaking new features to optimize inference workloads. It brings unprecedented versatility by accelerating a full range of precisions, from FP32 to FP16 to INT8 and all the way down to INT4. Multi-Instance GPU (MIG) technology allows multiple networks to operate simultaneously on a single A100 GPU for optimal utilization of compute resources. And structural sparsity support delivers up to 2x more performance on top of A100’s other inference performance gains.

a100-bert-large-inference.png


High-Performance Computing with NVIDIA A100

To unlock next-generation discoveries, scientists look to simulations to better understand complex molecules for drug discovery, physics for potential new sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns.

A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10x higher throughput for single-precision dense matrix multiply operations.

a100-hpc.png


Other Key Announcements from NVIDIA

  • NVIDIA GPUs will power software applications for accelerating three critical usages: managing big data, recommender systems and conversational AI.
  • NVIDIA also continues to push forward with its initiatives in AI and Robotics.

have-any-questions-1024x202.jpg

A100-3.png
HPC

NVIDIA Debuts Ampere Architecture with NVIDIA A100 & DGX A100 – A Game Changer for AI & HPC Workloads

9 min read

Due to the COVID-19 outbreak, GTC 2020 was shifted to an all virtual conference, and the long anticipated next generation NVIDIA GPU architecture announcement was subsequently delayed. However today, NVIDIA CEO Jensen Huang’s keynote took the virtual stage to announce NVIDIA’s all new GPU architecture, Ampere, and the first products that will be using it. At the center of this is a new GPU, the NVIDIA A100, and a new system, the NVIDIA DGX™ A100.

Five Key Innovations of the NVIDIA A100 Tensor Core GPU

The NVIDIA A100 GPU is a technical design breakthrough fueled by five key innovations:

  • NVIDIA Ampere architecture — At the heart of A100 is the NVIDIA Ampere GPU architecture, which contains more than 54 billion transistors, making it the world’s largest 7-nanometer processor.
  • Tensor Cores with TF32 — NVIDIA’s widely adopted Tensor Cores are now more flexible, faster and easier to use. Their expanded capabilities include new TF32 for AI (TensorFloat32), which allows for up to 20x the AI compute of FP32 precision compared to previous generation, without any code changes. In addition, Tensor Cores now support FP64, delivering up to 2.5x more compute than the previous generation for HPC applications.
  • Multi-instance GPU — MIG, a new technical feature, enables a single A100 GPU to be partitioned into as many as seven GPU instances so it can deliver varying degrees of compute for jobs of different sizes, providing optimal utilization and maximizing return on investment.
  • Third-generation NVIDIA NVLink — Doubles the high-speed connectivity between GPUs to provide efficient performance scaling in a server.
  • Structural Sparsity — This new efficiency technique harnesses the inherently sparse nature of AI math to double inference compute.

NVIDIA DGX A100 now Available

The NVIDIA A100 Tensor Core GPU delivers the next giant leap in NVIDIA’s accelerated data center platform, providing unmatched acceleration at every scale and enabling these innovators to do their life’s work within their lifetime. A100 powers numerous application areas including HPC, genomics, 5G, rendering, deep learning, data analytics, data science, and robotics.

A100-main-2-1024x413.png


NVIDIA today set out a vision for the next generation of computing that shifts the focus of the global information economy from servers to a new class of powerful, flexible data centers.

DeepLearning_PE_Banner_Dynamic-051618-1024x108.jpg


NVIDIA A100 Specs Compared with Previous Generation Server GPUs

Product ArchitecturePascal P100Volta V100NVIDIA A100
GPU CodenameGP100GV100GA100
GPU ArchitectureNVIDIA PascalNVIDIA VoltaNVIDIA Ampere
GPU Board Form FactorSXM2SXM2SXM4
SMs5680108
TPCs284054
FP32 Cores / SM646464
FP32 Cores / GPU358451206912
FP64 Cores / SM323232
FP64 Cores / GPU179225603456
INT32 Cores / SMNA6464
INT32 Cores / GPUNA51206912
Tensor Cores / SMNA842
Tensor Cores / GPUNA640432
GPU Boost Clock1480 MHz1530 MHz1410 MHz
Peak FP16 Tensor TFLOPS with FP16 Accumulate1NA125312/6243
Peak FP16 Tensor TFLOPS with FP32 Accumulate1NA125312/6243
Peak BF16 Tensor TFLOPS with FP32 Accumulate1NANA312/6243
Peak TF32 Tensor TFLOPS1NANA156/3123
Peak FP64 Tensor TFLOPS1NANA19.5
Peak INT8 Tensor TOPS1NANA624/12483
Peak INT4 Tensor TOPS1NANA1248/24963
Peak FP16 TFLOPS121.231.478
Peak BF16 TFLOPS1NANA39
Peak FP32 TFLOPS110.615.719.5
Peak FP64 TFLOPS15.37.89.7
Peak INT32 TOPS1NA15.719.5
Texture Units224320432
Memory Interface4096-bit HBM24096-bit HBM25120-bit HBM2
Memory Size16 GB32 GB / 16 GB40 GB
Memory Data Rate703 MHz DDR877.5 MHz DDR1215 MHz DDR
Memory Bandwidth720 GB/sec900 GB/sec1.6 TB/sec
L2 Cache Size4096 KB6144 KB40960 KB
Shared Memory Size / SM64 KBConfigurable up to 96 KBConfigurable up to 164 KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU14336 KB20480 KB27648 KB
TDP300 Watts300 Watts400 Watts
Transistors15.3 billion21.1 billion54.2 billion
GPU Die Size610 mm²815 mm²826 mm2
TSMC Manufacturing Process16 nm FinFET+12 nm FFN7 nm N7

1) Peak rates are based on the GPU boost clock.
2) Four Tensor Cores in an A100 SM have 2x the raw FMA computational power of eight Tensor Cores in a GV100 SM.
3) Effective TOPS / TFLOPS using the new Sparsity feature.

source: https://devblogs.nvidia.com/nv...

ebook-dl-banner.png


Deep Learning Training Performance With NVIDIA A100

AI models are exploding in complexity as they take on next-level challenges such as accurate conversational AI and deep recommender systems. Training them requires massive compute power and scalability.

NVIDIA A100’s third-generation Tensor Cores with Tensor Float (TF32) precision provide up to 20x higher performance over the prior generation with zero code changes and an additional 2x boost with automatic mixed precision and FP16. When combined with third-generation NVIDIA® NVLink®, NVIDIA NVSwitch™, PCI Gen4, NVIDIA Mellanox InfiniBand, and the NVIDIA Magnum IO™ software SDK, it’s possible to scale to thousands of A100 GPUs. This means that large AI models like BERT can be trained in just 37 minutes on a cluster of 1,024 A100s, offering unprecedented performance and scalability.

a100-bert-training.png


Deep Learning Inference Performance with NVIDIA A100

A100 introduces groundbreaking new features to optimize inference workloads. It brings unprecedented versatility by accelerating a full range of precisions, from FP32 to FP16 to INT8 and all the way down to INT4. Multi-Instance GPU (MIG) technology allows multiple networks to operate simultaneously on a single A100 GPU for optimal utilization of compute resources. And structural sparsity support delivers up to 2x more performance on top of A100’s other inference performance gains.

a100-bert-large-inference.png


High-Performance Computing with NVIDIA A100

To unlock next-generation discoveries, scientists look to simulations to better understand complex molecules for drug discovery, physics for potential new sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns.

A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10x higher throughput for single-precision dense matrix multiply operations.

a100-hpc.png


Other Key Announcements from NVIDIA

  • NVIDIA GPUs will power software applications for accelerating three critical usages: managing big data, recommender systems and conversational AI.
  • NVIDIA also continues to push forward with its initiatives in AI and Robotics.

have-any-questions-1024x202.jpg