
Fine Tuning BERT Large on a GPU Workstation
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.
Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.
Key Points and Observations
- Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
- When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
- For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Exxact Workstation System Specs:
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 48 |
Memory | DDR4 512 GB |
Storage | NVMe 3.84 TB |
OS | Ubuntu 18.04 |
CUDA Version | 11.4 |
BERT Dataset |
squad v1 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
1x RTX A5500 BERT LARGE Inference Benchmark
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 384 | 1 | fp16 | 14.31 | 183.03 | 5.46 | 5.2 | 6.54 | 6.68 | 6.98 | 7.74 |
base | 384 | 2 | fp16 | 16.04 | 320.54 | 6.24 | 6.23 | 6.62 | 6.7 | 6.95 | 7.49 |
base | 384 | 4 | fp16 | 19.62 | 407.11 | 9.83 | 9.76 | 10.17 | 10.25 | 10.41 | 10.84 |
base | 384 | 8 | fp16 | 26.36 | 482.06 | 16.6 | 16.58 | 16.84 | 16.99 | 17.25 | 17.9 |
base | 384 | 1 | fp32 | 10.89 | 171.98 | 5.81 | 5.71 | 6.67 | 6.8 | 7.11 | 9.5 |
base | 384 | 2 | fp32 | 14.32 | 224.81 | 8.9 | 8.79 | 9.28 | 9.37 | 9.59 | 10.3 |
base | 384 | 4 | fp32 | 19.99 | 274.5 | 14.57 | 14.58 | 14.88 | 14.96 | 15.31 | 19.02 |
base | 384 | 8 | fp32 | 32.76 | 292.2 | 27.38 | 27.52 | 27.76 | 27.81 | 27.96 | 28.27 |
large | 384 | 1 | fp16 | 25.92 | 84.54 | 11.83 | 11.88 | 12.58 | 12.7 | 13.08 | 14.56 |
large | 384 | 2 | fp16 | 32.47 | 114.08 | 17.53 | 17.55 | 18.3 | 18.42 | 18.66 | 21.21 |
large | 384 | 4 | fp16 | 42.87 | 142.13 | 28.14 | 27.89 | 29.06 | 29.19 | 29.5 | 29.99 |
large | 384 | 8 | fp16 | 62.71 | 167.39 | 47.79 | 47.58 | 48.59 | 48.7 | 48.91 | 49.9 |
large | 384 | 1 | fp32 | 22.36 | 70.45 | 14.19 | 14.3 | 14.97 | 15.1 | 15.43 | 16.18 |
large | 384 | 2 | fp32 | 31.69 | 85.03 | 23.52 | 23.65 | 24.27 | 24.42 | 24.67 | 25.63 |
large | 384 | 4 | fp32 | 54.64 | 86.12 | 46.45 | 46.28 | 47.33 | 47.43 | 47.57 | 48.53 |
large | 384 | 8 | fp32 | 91.18 | 96.6 | 82.82 | 82.77 | 83.66 | 83.91 | 84.13 | 85.04 |
Data Chart
FP = Floating Point Precision, Seq = Sequence Length,
Batch Size for all runs below = 8
8x RTX A5500 BERT LARGE Inference Benchmark
Number GPUs | Model | Precision | XLA | Batch | Training Time (sec) | Performance Time (sec) |
2 | base | fp16 | TRUE | 8 | 172.39 | 174.57 |
4 | base | fp16 | TRUE | 8 | 177.22 | 340.87 |
6 | base | fp16 | TRUE | 8 | 187.03 | 454.5 |
8 | base | fp16 | TRUE | 8 | 189.97 | 599.73 |
2 | base | fp32 | TRUE | 8 | 161.84 | 123.95 |
4 | base | fp32 | TRUE | 8 | 170.89 | 231.76 |
6 | base | fp32 | TRUE | 8 | 186.79 | 307.47 |
8 | base | fp32 | TRUE | 8 | 190.02 | 404.42 |
2 | base | fp16 | FALSE | 8 | 156.57 | 104.45 |
4 | base | fp16 | FALSE | 8 | 161.1 | 202.13 |
6 | base | fp16 | FALSE | 8 | 168.62 | 285.75 |
8 | base | fp16 | FALSE | 8 | 169.75 | 378.49 |
2 | base | fp32 | FALSE | 8 | 179.42 | 83.01 |
4 | base | fp32 | FALSE | 8 | 186.52 | 159.24 |
6 | base | fp32 | FALSE | 8 | 201.07 | 219.01 |
8 | base | fp32 | FALSE | 8 | 204.11 | 287.95 |
2 | large | fp16 | TRUE | 8 | 398.34 | 63.24 |
4 | large | fp16 | TRUE | 8 | 410.66 | 121.13 |
6 | large | fp16 | TRUE | 8 | 433.85 | 164.81 |
8 | large | fp16 | TRUE | 8 | 438.53 | 216.53 |
2 | large | fp32 | TRUE | 8 | 413.88 | 42.29 |
4 | large | fp32 | TRUE | 8 | 437.6 | 79 |
6 | large | fp32 | TRUE | 8 | 480.33 | 104.99 |
8 | large | fp32 | TRUE | 8 | No Data | No Data |
2 | large | fp16 | FALSE | 8 | 382.36 | 40.47 |
4 | large | fp16 | FALSE | 8 | 385.07 | 81.78 |
6 | large | fp16 | FALSE | 8 | 411.98 | 111.27 |
8 | large | fp16 | FALSE | 8 | 414.62 | 147.47 |
2 | large | fp32 | FALSE | 8 | 471.13 | 30.38 |
4 | large | fp32 | FALSE | 8 | 488.54 | 58.42 |
6 | large | fp32 | FALSE | 8 | 534.49 | 79.32 |
8 | large | fp32 | FALSE | 8 | 538.51 | 104.93 |
Data Chart
Batch Size for all runs below = 8
Nvidia RTX A Series GPU Specs
NVIDIA RTX A4000 | NVIDIA RTX A4500 | NVIDIA RTX A5000 | NVIDIA RTX A5500 | NVIDIA RTX A6000 | |
Architecture | Ampere | Ampere | Ampere | Ampere | Ampere |
GPU Memory | 16 GB GDDR6 | 20 GB GDDR6 | 24 GB GDDR6 | 24 GB GDDR6 | 48 GB GDDR6 |
ECC Memory | Yes | Yes | Yes | Yes | Yes |
CUDA Cores | 6,144 | 7,168 | 8,192 | 10,240 | 10,752 |
Tensor Cores | 192 | 224 | 256 | 320 | 336 |
RT Cores | 48 | 56 | 64 | 80 | 84 |
SP perf | 19.2 TFLOPS | 23.7 TFLOPS | 27.8 TFLOPS | 34.1 TFLOPS | 38.7 TFLOPS |
RT Core perf | 37.4 TFLOPS | 46.2 TFLOPS | 54.2 TFLOPS | 66.6 TFLOPS | 75.6 TFLOPS |
Tensor perf | 153.4 TFLOPS | 189.2 TFLOPS | 222.2 TFLOPS | 272.8 TFLOPS | 309.7 TFLOPS |
Max Power | 140W | 200W | 230W | 230W | 300W |
Graphic bus | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 |
Connectors | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) |
Form Factor | Single Slot | Dual Slot | Dual Slot | Dual Slot | Dual Slot |
vGPU Software | No | No | NVIDIA RTX vWS | NVIDIA RTX vWS | NVIDIA RTX vWS |
NVLink | N/A | 2x RTX A4500 | 2x RTX A5000 | 2x RTX A5500 | 2x RTX A6000 |
Power Connector | 1 x 6-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCI |

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2
Fine Tuning BERT Large on a GPU Workstation
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.
Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.
Key Points and Observations
- Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
- When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
- For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Exxact Workstation System Specs:
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 48 |
Memory | DDR4 512 GB |
Storage | NVMe 3.84 TB |
OS | Ubuntu 18.04 |
CUDA Version | 11.4 |
BERT Dataset |
squad v1 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
1x RTX A5500 BERT LARGE Inference Benchmark
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 384 | 1 | fp16 | 14.31 | 183.03 | 5.46 | 5.2 | 6.54 | 6.68 | 6.98 | 7.74 |
base | 384 | 2 | fp16 | 16.04 | 320.54 | 6.24 | 6.23 | 6.62 | 6.7 | 6.95 | 7.49 |
base | 384 | 4 | fp16 | 19.62 | 407.11 | 9.83 | 9.76 | 10.17 | 10.25 | 10.41 | 10.84 |
base | 384 | 8 | fp16 | 26.36 | 482.06 | 16.6 | 16.58 | 16.84 | 16.99 | 17.25 | 17.9 |
base | 384 | 1 | fp32 | 10.89 | 171.98 | 5.81 | 5.71 | 6.67 | 6.8 | 7.11 | 9.5 |
base | 384 | 2 | fp32 | 14.32 | 224.81 | 8.9 | 8.79 | 9.28 | 9.37 | 9.59 | 10.3 |
base | 384 | 4 | fp32 | 19.99 | 274.5 | 14.57 | 14.58 | 14.88 | 14.96 | 15.31 | 19.02 |
base | 384 | 8 | fp32 | 32.76 | 292.2 | 27.38 | 27.52 | 27.76 | 27.81 | 27.96 | 28.27 |
large | 384 | 1 | fp16 | 25.92 | 84.54 | 11.83 | 11.88 | 12.58 | 12.7 | 13.08 | 14.56 |
large | 384 | 2 | fp16 | 32.47 | 114.08 | 17.53 | 17.55 | 18.3 | 18.42 | 18.66 | 21.21 |
large | 384 | 4 | fp16 | 42.87 | 142.13 | 28.14 | 27.89 | 29.06 | 29.19 | 29.5 | 29.99 |
large | 384 | 8 | fp16 | 62.71 | 167.39 | 47.79 | 47.58 | 48.59 | 48.7 | 48.91 | 49.9 |
large | 384 | 1 | fp32 | 22.36 | 70.45 | 14.19 | 14.3 | 14.97 | 15.1 | 15.43 | 16.18 |
large | 384 | 2 | fp32 | 31.69 | 85.03 | 23.52 | 23.65 | 24.27 | 24.42 | 24.67 | 25.63 |
large | 384 | 4 | fp32 | 54.64 | 86.12 | 46.45 | 46.28 | 47.33 | 47.43 | 47.57 | 48.53 |
large | 384 | 8 | fp32 | 91.18 | 96.6 | 82.82 | 82.77 | 83.66 | 83.91 | 84.13 | 85.04 |
Data Chart
FP = Floating Point Precision, Seq = Sequence Length,
Batch Size for all runs below = 8
8x RTX A5500 BERT LARGE Inference Benchmark
Number GPUs | Model | Precision | XLA | Batch | Training Time (sec) | Performance Time (sec) |
2 | base | fp16 | TRUE | 8 | 172.39 | 174.57 |
4 | base | fp16 | TRUE | 8 | 177.22 | 340.87 |
6 | base | fp16 | TRUE | 8 | 187.03 | 454.5 |
8 | base | fp16 | TRUE | 8 | 189.97 | 599.73 |
2 | base | fp32 | TRUE | 8 | 161.84 | 123.95 |
4 | base | fp32 | TRUE | 8 | 170.89 | 231.76 |
6 | base | fp32 | TRUE | 8 | 186.79 | 307.47 |
8 | base | fp32 | TRUE | 8 | 190.02 | 404.42 |
2 | base | fp16 | FALSE | 8 | 156.57 | 104.45 |
4 | base | fp16 | FALSE | 8 | 161.1 | 202.13 |
6 | base | fp16 | FALSE | 8 | 168.62 | 285.75 |
8 | base | fp16 | FALSE | 8 | 169.75 | 378.49 |
2 | base | fp32 | FALSE | 8 | 179.42 | 83.01 |
4 | base | fp32 | FALSE | 8 | 186.52 | 159.24 |
6 | base | fp32 | FALSE | 8 | 201.07 | 219.01 |
8 | base | fp32 | FALSE | 8 | 204.11 | 287.95 |
2 | large | fp16 | TRUE | 8 | 398.34 | 63.24 |
4 | large | fp16 | TRUE | 8 | 410.66 | 121.13 |
6 | large | fp16 | TRUE | 8 | 433.85 | 164.81 |
8 | large | fp16 | TRUE | 8 | 438.53 | 216.53 |
2 | large | fp32 | TRUE | 8 | 413.88 | 42.29 |
4 | large | fp32 | TRUE | 8 | 437.6 | 79 |
6 | large | fp32 | TRUE | 8 | 480.33 | 104.99 |
8 | large | fp32 | TRUE | 8 | No Data | No Data |
2 | large | fp16 | FALSE | 8 | 382.36 | 40.47 |
4 | large | fp16 | FALSE | 8 | 385.07 | 81.78 |
6 | large | fp16 | FALSE | 8 | 411.98 | 111.27 |
8 | large | fp16 | FALSE | 8 | 414.62 | 147.47 |
2 | large | fp32 | FALSE | 8 | 471.13 | 30.38 |
4 | large | fp32 | FALSE | 8 | 488.54 | 58.42 |
6 | large | fp32 | FALSE | 8 | 534.49 | 79.32 |
8 | large | fp32 | FALSE | 8 | 538.51 | 104.93 |
Data Chart
Batch Size for all runs below = 8
Nvidia RTX A Series GPU Specs
NVIDIA RTX A4000 | NVIDIA RTX A4500 | NVIDIA RTX A5000 | NVIDIA RTX A5500 | NVIDIA RTX A6000 | |
Architecture | Ampere | Ampere | Ampere | Ampere | Ampere |
GPU Memory | 16 GB GDDR6 | 20 GB GDDR6 | 24 GB GDDR6 | 24 GB GDDR6 | 48 GB GDDR6 |
ECC Memory | Yes | Yes | Yes | Yes | Yes |
CUDA Cores | 6,144 | 7,168 | 8,192 | 10,240 | 10,752 |
Tensor Cores | 192 | 224 | 256 | 320 | 336 |
RT Cores | 48 | 56 | 64 | 80 | 84 |
SP perf | 19.2 TFLOPS | 23.7 TFLOPS | 27.8 TFLOPS | 34.1 TFLOPS | 38.7 TFLOPS |
RT Core perf | 37.4 TFLOPS | 46.2 TFLOPS | 54.2 TFLOPS | 66.6 TFLOPS | 75.6 TFLOPS |
Tensor perf | 153.4 TFLOPS | 189.2 TFLOPS | 222.2 TFLOPS | 272.8 TFLOPS | 309.7 TFLOPS |
Max Power | 140W | 200W | 230W | 230W | 300W |
Graphic bus | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 | PCI-E 4.0 x16 |
Connectors | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) | DP 1.4 (4) |
Form Factor | Single Slot | Dual Slot | Dual Slot | Dual Slot | Dual Slot |
vGPU Software | No | No | NVIDIA RTX vWS | NVIDIA RTX vWS | NVIDIA RTX vWS |
NVLink | N/A | 2x RTX A4500 | 2x RTX A5000 | 2x RTX A5500 | 2x RTX A6000 |
Power Connector | 1 x 6-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCIe | 1 x 8-pin PCI |