Benchmarks

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2

June 21, 2022

11 min read

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.

Key Points and Observations

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?

Learn more about the only AMBER Certified GPU Systems start at $6000

Exxact Workstation System Specs:

Nodes	1
Processor / Count	2x AMD EPYC 7552
Total Logical Cores	48
Memory	DDR4 512 GB
Storage	NVMe 3.84 TB
OS	Ubuntu 18.04
CUDA Version	11.4
BERT Dataset	squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A5500 BERT LARGE Inference Benchmark

Model	Sequence-Length	Batch-size	Precision	Total-Inference-Time	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-50%(ms)	Latency-90%(ms)	Latency-95%(ms)	iLatency-99%(ms)	Latency-100%(ms)
base	384	1	fp16	14.31	183.03	5.46	5.2	6.54	6.68	6.98	7.74
base	384	2	fp16	16.04	320.54	6.24	6.23	6.62	6.7	6.95	7.49
base	384	4	fp16	19.62	407.11	9.83	9.76	10.17	10.25	10.41	10.84
base	384	8	fp16	26.36	482.06	16.6	16.58	16.84	16.99	17.25	17.9
base	384	1	fp32	10.89	171.98	5.81	5.71	6.67	6.8	7.11	9.5
base	384	2	fp32	14.32	224.81	8.9	8.79	9.28	9.37	9.59	10.3
base	384	4	fp32	19.99	274.5	14.57	14.58	14.88	14.96	15.31	19.02
base	384	8	fp32	32.76	292.2	27.38	27.52	27.76	27.81	27.96	28.27
large	384	1	fp16	25.92	84.54	11.83	11.88	12.58	12.7	13.08	14.56
large	384	2	fp16	32.47	114.08	17.53	17.55	18.3	18.42	18.66	21.21
large	384	4	fp16	42.87	142.13	28.14	27.89	29.06	29.19	29.5	29.99
large	384	8	fp16	62.71	167.39	47.79	47.58	48.59	48.7	48.91	49.9
large	384	1	fp32	22.36	70.45	14.19	14.3	14.97	15.1	15.43	16.18
large	384	2	fp32	31.69	85.03	23.52	23.65	24.27	24.42	24.67	25.63
large	384	4	fp32	54.64	86.12	46.45	46.28	47.33	47.43	47.57	48.53
large	384	8	fp32	91.18	96.6	82.82	82.77	83.66	83.91	84.13	85.04

Data Chart

FP = Floating Point Precision, Seq = Sequence Length,

Batch Size for all runs below = 8

8x RTX A5500 BERT LARGE Inference Benchmark

Number GPUs	Model	Precision	XLA	Batch	Training Time (sec)	Performance Time (sec)
2	base	fp16	TRUE	8	172.39	174.57
4	base	fp16	TRUE	8	177.22	340.87
6	base	fp16	TRUE	8	187.03	454.5
8	base	fp16	TRUE	8	189.97	599.73
2	base	fp32	TRUE	8	161.84	123.95
4	base	fp32	TRUE	8	170.89	231.76
6	base	fp32	TRUE	8	186.79	307.47
8	base	fp32	TRUE	8	190.02	404.42
2	base	fp16	FALSE	8	156.57	104.45
4	base	fp16	FALSE	8	161.1	202.13
6	base	fp16	FALSE	8	168.62	285.75
8	base	fp16	FALSE	8	169.75	378.49
2	base	fp32	FALSE	8	179.42	83.01
4	base	fp32	FALSE	8	186.52	159.24
6	base	fp32	FALSE	8	201.07	219.01
8	base	fp32	FALSE	8	204.11	287.95
2	large	fp16	TRUE	8	398.34	63.24
4	large	fp16	TRUE	8	410.66	121.13
6	large	fp16	TRUE	8	433.85	164.81
8	large	fp16	TRUE	8	438.53	216.53
2	large	fp32	TRUE	8	413.88	42.29
4	large	fp32	TRUE	8	437.6	79
6	large	fp32	TRUE	8	480.33	104.99
8	large	fp32	TRUE	8	No Data	No Data
2	large	fp16	FALSE	8	382.36	40.47
4	large	fp16	FALSE	8	385.07	81.78
6	large	fp16	FALSE	8	411.98	111.27
8	large	fp16	FALSE	8	414.62	147.47
2	large	fp32	FALSE	8	471.13	30.38
4	large	fp32	FALSE	8	488.54	58.42
6	large	fp32	FALSE	8	534.49	79.32
8	large	fp32	FALSE	8	538.51	104.93

Data Chart

Batch Size for all runs below = 8

Nvidia RTX A Series GPU Specs

	NVIDIA RTX A4000	NVIDIA RTX A4500	NVIDIA RTX A5000	NVIDIA RTX A5500	NVIDIA RTX A6000
Architecture	Ampere	Ampere	Ampere	Ampere	Ampere
GPU Memory	16 GB GDDR6	20 GB GDDR6	24 GB GDDR6	24 GB GDDR6	48 GB GDDR6
ECC Memory	Yes	Yes	Yes	Yes	Yes
CUDA Cores	6,144	7,168	8,192	10,240	10,752
Tensor Cores	192	224	256	320	336
RT Cores	48	56	64	80	84
SP perf	19.2 TFLOPS	23.7 TFLOPS	27.8 TFLOPS	34.1 TFLOPS	38.7 TFLOPS
RT Core perf	37.4 TFLOPS	46.2 TFLOPS	54.2 TFLOPS	66.6 TFLOPS	75.6 TFLOPS
Tensor perf	153.4 TFLOPS	189.2 TFLOPS	222.2 TFLOPS	272.8 TFLOPS	309.7 TFLOPS
Max Power	140W	200W	230W	230W	300W
Graphic bus	PCI-E 4.0 x16	PCI-E 4.0 x16	PCI-E 4.0 x16	PCI-E 4.0 x16	PCI-E 4.0 x16
Connectors	DP 1.4 (4)	DP 1.4 (4)	DP 1.4 (4)	DP 1.4 (4)	DP 1.4 (4)
Form Factor	Single Slot	Dual Slot	Dual Slot	Dual Slot	Dual Slot
vGPU Software	No	No	NVIDIA RTX vWS	NVIDIA RTX vWS	NVIDIA RTX vWS
NVLink	N/A	2x RTX A4500	2x RTX A5000	2x RTX A5500	2x RTX A6000
Power Connector	1 x 6-pin PCIe	1 x 8-pin PCIe	1 x 8-pin PCIe	1 x 8-pin PCIe	1 x 8-pin PCI

Topics

Have any questions?

Benchmarks