Benchmarks

NVIDIA RTX 3080 Ti BERT Large Fine Tuning Benchmarks in TensorFlow

September 23, 2021

11 min read

Fine-tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA GeForce RTX 3080 Ti GPUs. For testing we used an Exxact Valence Workstation fitted with 4x RTX 3080 Ti GPUs with 12GB GPU memory per GPU.

Benchmark scripts we used for evaluation:

finetune_train_benchmark.sh

and

finetune_inference_benchmark.sh

from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board.

Key Points and Observations

Scenarios that are not typically used in real-world training such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
When doing performance comparisons, results showed that the RTX 3080 Ti delivers 3.3% better performance compared to the RTX A5000.
For those interested in training BERT Large, a 2x RTX 3080 Ti system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700

Exxact Workstation System Specs:

Nodes	1
Processor / Count	2x AMD EPYC 7552
Total Logical Cores	48
Memory	DDR4 512GB
Storage	NVMe 3.84TB
OS	Ubuntu 18.04
CUDA Version	11.2
BERT Dataset	squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x Quadro RTX 3080 Ti BERT Large Inference Benchmark

Model	Sequence-Length	Batch-size	Precision	Total-Inference-Time	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-50%(ms)	Latency-90%(ms)	Latency-95%(ms)	iLatency-99%(ms)	Latency-100%(ms)
base	128	1	fp16	23.09	182.3	8.8	5.69	6.21	6.37	6.78	5934.1
base	128	1	fp32	21.01	179.67	8.89	5.81	6.38	6.55	6.98	5965.55
base	128	2	fp16	20.07	381.07	8.8	5.47	6.03	6.16	6.51	6016.77
base	128	2	fp32	20.18	379.85	8.86	5.4	6.14	6.28	6.71	6093.34
base	128	4	fp16	27.36	669.13	12.29	5.99	6.41	6.58	6.99	6188.94
base	128	4	fp32	27.6	671.57	12.26	5.96	6.36	6.53	7.01	6213.83
base	128	8	fp16	35.36	949.71	14.2	8.4	9	9.2	9.79	6357
base	128	8	fp32	35.36	955.41	14.1	8.35	8.88	9.12	9.61	6340.38
base	384	1	fp16	17.03	181.83	11.7	5.65	6.29	6.45	6.77	6461.01
base	384	1	fp32	17.2	180.91	11.75	5.69	6.28	6.41	6.76	6484.46
base	384	2	fp16	25.2	283.1	18.92	7.03	7.52	7.7	8.45	6918.94
base	384	2	fp32	24.95	282.86	18.88	7.06	7.47	7.71	8.59	6907.74
base	384	4	fp16	30.09	357.03	23.21	11.19	11.69	12	12.99	7059.7
base	384	4	fp32	30.03	358.97	23.15	11.1	11.65	11.96	13.11	7061.75
base	384	8	fp16	41.64	411.64	32.27	19.4	19.96	20.21	21.48	7440.31
base	384	8	fp32	41.71	411.85	32.36	19.4	19.91	20.26	21.55	7406.49
large	128	1	fp16	41.8	108.36	15	9.61	10.66	10.89	11.4	10322.8
large	128	1	fp32	36.63	100.6	15.74	10.36	11.26	11.56	12.16	10407.55
large	128	2	fp16	34.84	214.7	15.53	9.12	10.09	10.27	10.89	10541.14
large	128	2	fp32	34.93	214	15.58	9.22	10.14	10.3	11.23	10575.28
large	128	4	fp16	52.47	301.82	24.34	13.36	13.86	14.05	14.66	10740.68
large	128	4	fp32	52.85	298.83	24.54	13.47	13.88	14.05	15.27	10685.7
large	128	8	fp16	71.15	391.03	30.58	20.47	21.23	21.46	22.23	10741.95
large	128	8	fp32	71.79	389	30.88	20.66	21.33	21.5	22.04	10941.51
large	384	1	fp16	33.8	75.96	24.42	13.25	13.97	14.1	15.07	11737.02
large	384	1	fp32	33.1	78.03	23.82	12.54	13.7	13.83	14.75	11466.12
large	384	2	fp16	52.28	97.8	42.13	20.31	21.26	21.38	22.33	12373.92
large	384	2	fp32	52.71	97.65	42.54	20.6	21.25	21.37	22.99	12629.24
large	384	4	fp16	65.04	124.69	53.88	32.17	32.77	32.99	34.29	12558.83
large	384	4	fp32	65.62	124.66	54.41	32.18	32.85	33.03	33.97	12861.03
large	384	8	fp16	96.27	139.09	81.01	57.54	58.45	58.82	60.57	13293.83
large	384	8	fp32	96.13	139.42	80.85	57.37	58.21	58.55	60.44	13304.43

NVIDIA RTX-30 Series GPUs

	NVIDIA GeForce RTX 3060	NVIDIA GeForce RTX 3060 Ti	NVIDIA GeForce RTX 3070	NVIDIA GeForce RTX 3080	NVIDIA GeForce RTX 3090
NVIDIA CUDA Cores	3,584	4,864	5,888	8,704	10,496
Boost Clock (GHz)	1.78	1.67	1.73	1.71	1.70
Memory Size	12GB	8GB	8GB	10GB	24GB
Memory Type	GDDR6	GDDR6	GDDR6	GDDR6X	GDDR6X
Dimensions	9.5 x 4.4 inches	9.5 x 4.4 inches	9.5 x 4.4 inches	11.2 x 4.4 inches	12.3 x 5.4 inches
Power Draw	170W	200W	220W	320W	350W

Additional GPU Benchmarks

Have any questions?
Contact Exxact Today

Topics

Have any questions?

Benchmarks

NVIDIA RTX 3080 Ti BERT Large Fine Tuning Benchmarks in TensorFlow

September 23, 202111 min read