NVIDIA RTX A4000: BERT Inferencing and Training Benchmarks in TensorFlow
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A4000 GPUs. For testing we used an Exxact Valence Workstation that was fitted with 4x RTX A4000 GPUs with 16GB GPU memory per GPU.
Benchmark scripts we used for evaluation:
finetune_train_benchmark.sh
and
finetune_inference_benchmark.sh
from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, and 8. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog.
Key Points and Observations
- Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
- For those interested in training BERT Large, a 2x RTX A4000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or to be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700
Exxact Workstation System Specs:
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 192 |
Memory | DDR4 512GB |
Storage | NVMe 3.84TB |
OS | Ubuntu 18.04 |
CUDA Version | 11.2 |
BERT Dataset | squad v1 |
Tensorflow | 2.40 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
1x RTX A4000 BERT LARGE Inference Benchmark
Raw Data
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 128 | 1 | fp16 | 13.59 | 155.38 | 6.44 | 6.43 | 6.83 | 6.93 | 7.17 | 7.81 |
base | 128 | 1 | fp32 | 12.73 | 128.8 | 7.76 | 7.73 | 8.1 | 8.21 | 8.49 | 11.4 |
base | 128 | 2 | fp16 | 18.25 | 220.84 | 9.06 | 8.93 | 9.42 | 9.5 | 9.71 | 10.52 |
base | 128 | 2 | fp32 | 18.09 | 156.25 | 12.8 | 12.72 | 13.14 | 13.2 | 13.46 | 16.49 |
base | 128 | 4 | fp16 | 24.14 | 268.32 | 14.91 | 14.87 | 15.24 | 15.32 | 15.65 | 29.78 |
base | 128 | 4 | fp32 | 29.36 | 165.49 | 24.17 | 24.21 | 24.54 | 24.62 | 24.74 | 24.93 |
base | 128 | 8 | fp16 | 35.6 | 303.29 | 26.38 | 26.38 | 26.73 | 26.85 | 26.95 | 27.05 |
base | 128 | 8 | fp32 | 49.5 | 181.21 | 44.15 | 44.23 | 44.62 | 44.68 | 44.8 | 44.91 |
base | 384 | 1 | fp16 | 13.38 | 160.37 | 6.24 | 6.11 | 6.65 | 6.7 | 6.97 | 7.38 |
base | 384 | 1 | fp32 | 12.63 | 130.45 | 7.67 | 7.55 | 8.08 | 8.18 | 8.52 | 12.04 |
base | 384 | 2 | fp16 | 18.28 | 221.3 | 9.04 | 8.93 | 9.42 | 9.49 | 9.64 | 10.47 |
base | 384 | 2 | fp32 | 18.28 | 155.08 | 12.9 | 12.88 | 13.2 | 13.3 | 13.45 | 16.3 |
base | 384 | 4 | fp16 | 24.12 | 267.27 | 14.97 | 14.99 | 15.26 | 15.32 | 15.54 | 16.06 |
base | 384 | 4 | fp32 | 29.43 | 165.07 | 24.23 | 24.25 | 24.58 | 24.68 | 24.8 | 26.3 |
base | 384 | 8 | fp16 | 35.74 | 304.75 | 26.25 | 26.28 | 26.73 | 26.83 | 26.98 | 27.19 |
base | 384 | 8 | fp32 | 49.53 | 181.19 | 44.15 | 44.2 | 44.64 | 44.7 | 44.81 | 45.62 |
large | 128 | 1 | fp16 | 29.75 | 62.83 | 15.92 | 15.97 | 16.69 | 16.85 | 17.12 | 17.84 |
large | 128 | 1 | fp32 | 26.85 | 53.15 | 18.82 | 18.73 | 19.71 | 19.86 | 20.27 | 24.23 |
large | 128 | 2 | fp16 | 39.01 | 82.16 | 24.34 | 24.18 | 25.24 | 25.42 | 25.84 | 27.49 |
large | 128 | 2 | fp32 | 42.69 | 57.8 | 34.6 | 34.51 | 35.57 | 35.78 | 36.15 | 38.11 |
large | 128 | 4 | fp16 | 57.45 | 94.04 | 42.54 | 42.46 | 43.59 | 43.79 | 44.15 | 45.14 |
large | 128 | 4 | fp32 | 74.18 | 60.61 | 66 | 66.07 | 67.2 | 67.46 | 67.68 | 67.93 |
large | 128 | 8 | fp16 | 90.61 | 105.54 | 75.8 | 75.85 | 76.86 | 77.1 | 77.4 | 78.4 |
large | 128 | 8 | fp32 | 139.8 | 60.89 | 131.37 | 131.73 | 132.79 | 133.01 | 133.37 | 133.67 |
large | 384 | 1 | fp16 | 29.74 | 62.56 | 15.98 | 16.06 | 16.77 | 16.91 | 17.16 | 17.99 |
large | 384 | 1 | fp32 | 27.12 | 52.4 | 19.09 | 19.03 | 20.04 | 20.22 | 20.57 | 22.28 |
large | 384 | 2 | fp16 | 38.91 | 82.38 | 24.28 | 24.07 | 25.17 | 25.3 | 25.52 | 26.05 |
large | 384 | 2 | fp32 | 42.77 | 58.01 | 34.48 | 34.37 | 35.47 | 35.66 | 36.05 | 36.45 |
large | 384 | 4 | fp16 | 57.33 | 93.92 | 42.59 | 42.55 | 43.53 | 43.72 | 44.1 | 44.65 |
large | 384 | 4 | fp32 | 74.38 | 60.44 | 66.19 | 66.23 | 67.18 | 67.43 | 67.64 | 67.93 |
large | 384 | 8 | fp16 | 90.59 | 105.62 | 75.74 | 75.81 | 76.67 | 76.97 | 77.32 | 78.46 |
large | 384 | 8 | fp32 | 139.75 | 60.93 | 131.29 | 131.52 | 132.72 | 133.05 | 133.57 | 134.62 |
Data Chart
1x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP32, BS1 | 1.07 | 15.42 |
Base FP16, BS1 | 1.23 | 14.34 |
Base FP32, BS2 | 1.45 | 21.34 |
Base FP16, BS2 | 1.5 | 22.3 |
Base FP16 XLA,BS1 | 1.86 | 21.22 |
Base FP16, BS4 | 2.05 | 30.59 |
Base FP16 XLA,BS2 | 2.08 | 35.3 |
Base FP32, BS4 | 2.3 | 25.44 |
Base FP16 XLA,BS4 | 2.4 | 53.32 |
Large FP32, BS1 | 2.7 | 5.82 |
Base FP16 XLA,BS8 | 2.85 | 70.82 |
Large FP16, BS1 | 2.89 | 5.85 |
Base FP16, BS8 | 3.13 | 37.81 |
Large FP16, BS2 | 3.72 | 8.61 |
Base FP32, BS8 | 3.85 | 29.18 |
Large FP32, BS2 | 3.87 | 7.71 |
Large FP16 XLA,BS1 | 4.23 | 8.32 |
Large FP16 XLA,BS2 | 4.77 | 13.5 |
Large FP16, BS4 | 5.23 | 11.57 |
Large FP16 XLA,BS4 | 5.58 | 19.73 |
Large FP32, BS4 | 6.1 | 9.37 |
Large FP16 XLA,BS8 | 7.13 | 25.09 |
Large FP16, BS8 | 8.25 | 13.99 |
Base FP32 XLA,BS1 | 42.4 | 0.75 |
Base FP32 XLA,BS2 | 75 | 1.04 |
Base FP32 XLA,BS4 | 93.35 | 1.07 |
Base FP32 XLA,BS8 | 117.2 | 1.37 |
Data Chart
2x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP32 , BS1 | 1.64 | 21.82 |
Base FP16, BS1 | 1.7 | 23.39 |
Base FP32 XLA , BS1 | 1.85 | 26.72 |
Base FP32 , BS2 | 2.01 | 33.36 |
Base FP16, BS2 | 2.04 | 35.99 |
Base FP32 XLA , BS2 | 2.12 | 44.35 |
Base FP16 XLA , BS1 | 2.4 | 29.73 |
Base FP16, BS4 | 2.57 | 53.15 |
Base FP16 XLA , BS2 | 2.6 | 51.83 |
Base FP32 XLA , BS4 | 2.62 | 63.18 |
Base FP32 , BS4 | 2.84 | 43.9 |
Base FP16 XLA , BS4 | 2.92 | 83 |
Base FP16 XLA , BS8 | 3.38 | 120.19 |
Base FP32 XLA , BS8 | 3.52 | 82.78 |
Base FP16, BS8 | 3.64 | 69.01 |
Large FP16, BS1 | 4.07 | 8.94 |
Large FP32, BS1 | 4.2 | 7.93 |
Base FP32 , BS8 | 4.4 | 53.68 |
Large FP16, BS2 | 4.85 | 14.21 |
Large FP32, BS2 | 5.3 | 11.98 |
Large FP16 XLA , BS1 | 5.46 | 11.14 |
Large FP16 XLA , BS2 | 6 | 19.13 |
Large FP16, BS4 | 6.34 | 20.31 |
Large FP16 XLA , BS4 | 6.81 | 30.4 |
Large FP32, BS4 | 7.48 | 15.97 |
Large FP16 XLA , BS8 | 8.29 | 42.3 |
Large FP16, BS8 | 9.31 | 25.94 |
Data Chart
4x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP16, BS1 | 1.79 | 43.52 |
Base FP32 XLA, BS1 | 1.78 | 39.02 |
Base FP32, BS1 | 1.79 | 39.02 |
Base FP16, BS2 | 2.09 | 69.84 |
Base FP32, BS2 | 2.15 | 61.86 |
Base FP32 XLA, BS2 | 2.15 | 61.82 |
Base FP16 XLA, BS1 | 2.49 | 53.84 |
Base FP16, BS4 | 2.63 | 102.85 |
Base FP16 XLA, BS2 | 2.7 | 96.01 |
Base FP32 XLA, BS4 | 2.96 | 83.8 |
Base FP32, BS4 | 2.96 | 83.73 |
Base FP16 XLA, BS4 | 3 | 158.56 |
Base FP16 XLA, BS8 | 3.45 | 233.43 |
Base FP16, BS8 | 3.67 | 137.78 |
Large FP16, BS1 | 4.29 | 16.68 |
Base FP32, BS8 | 4.48 | 105.07 |
Base FP32 XLA, BS8 | 4.49 | 104.6 |
Large FP32, BS1 | 4.61 | 14.12 |
Large FP16, BS2 | 5.06 | 26.85 |
Large FP32, BS2 | 5.64 | 22.15 |
Large FP16 XLA, BS1 | 5.66 | 20.56 |
Large FP16 XLA, BS2 | 6.21 | 35.69 |
Large FP16 XLA, BS8 | 6.52 | 39.21 |
Large FP16 XLA, BS4 | 7.01 | 57.43 |
Large FP32, BS4 | 7.81 | 30.51 |
Large FP16 XLA, BS8 | 8.43 | 82.55 |
Large FP16, BS8 | 9.34 | 52.13 |
Data Chart
NVIDIA RTX A4000 Series GPUs
GPU Features | NVIDIA RTX A4000 |
---|---|
GPU Memory | 16GB GDDR6 with error-correction code (ECC) |
Display Ports | 4x DisplayPort 1.4 |
Max Power Consumption | 140 W |
Graphics Bus | PCI Express Gen 4 x 16 |
Form Factor | 4.4” (H) x 9.5” (L) Single Slot |
Thermal | Active |
VR Ready | Yes |
Additional GPU Benchmarks
- NVIDIA A5000 Deep Learning Benchmarks for TensorFlow
- NVIDIA A30 Deep Learning Benchmarks for TensorFlow
- NVIDIA RTX A6000 Deep Learning Benchmarks for TensorFlow
- NVIDIA RTX A6000 Benchmarks for RELION Cryo-EM
- NVIDIA A100 Deep Learning Benchmarks for TensorFlow
Have any questions?
Contact Exxact Today
NVIDIA RTX A4000 BERT Large Fine Tuning Benchmarks in TensorFlow
NVIDIA RTX A4000: BERT Inferencing and Training Benchmarks in TensorFlow
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A4000 GPUs. For testing we used an Exxact Valence Workstation that was fitted with 4x RTX A4000 GPUs with 16GB GPU memory per GPU.
Benchmark scripts we used for evaluation:
finetune_train_benchmark.sh
and
finetune_inference_benchmark.sh
from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, and 8. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog.
Key Points and Observations
- Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
- For those interested in training BERT Large, a 2x RTX A4000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or to be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700
Exxact Workstation System Specs:
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 192 |
Memory | DDR4 512GB |
Storage | NVMe 3.84TB |
OS | Ubuntu 18.04 |
CUDA Version | 11.2 |
BERT Dataset | squad v1 |
Tensorflow | 2.40 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
1x RTX A4000 BERT LARGE Inference Benchmark
Raw Data
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 128 | 1 | fp16 | 13.59 | 155.38 | 6.44 | 6.43 | 6.83 | 6.93 | 7.17 | 7.81 |
base | 128 | 1 | fp32 | 12.73 | 128.8 | 7.76 | 7.73 | 8.1 | 8.21 | 8.49 | 11.4 |
base | 128 | 2 | fp16 | 18.25 | 220.84 | 9.06 | 8.93 | 9.42 | 9.5 | 9.71 | 10.52 |
base | 128 | 2 | fp32 | 18.09 | 156.25 | 12.8 | 12.72 | 13.14 | 13.2 | 13.46 | 16.49 |
base | 128 | 4 | fp16 | 24.14 | 268.32 | 14.91 | 14.87 | 15.24 | 15.32 | 15.65 | 29.78 |
base | 128 | 4 | fp32 | 29.36 | 165.49 | 24.17 | 24.21 | 24.54 | 24.62 | 24.74 | 24.93 |
base | 128 | 8 | fp16 | 35.6 | 303.29 | 26.38 | 26.38 | 26.73 | 26.85 | 26.95 | 27.05 |
base | 128 | 8 | fp32 | 49.5 | 181.21 | 44.15 | 44.23 | 44.62 | 44.68 | 44.8 | 44.91 |
base | 384 | 1 | fp16 | 13.38 | 160.37 | 6.24 | 6.11 | 6.65 | 6.7 | 6.97 | 7.38 |
base | 384 | 1 | fp32 | 12.63 | 130.45 | 7.67 | 7.55 | 8.08 | 8.18 | 8.52 | 12.04 |
base | 384 | 2 | fp16 | 18.28 | 221.3 | 9.04 | 8.93 | 9.42 | 9.49 | 9.64 | 10.47 |
base | 384 | 2 | fp32 | 18.28 | 155.08 | 12.9 | 12.88 | 13.2 | 13.3 | 13.45 | 16.3 |
base | 384 | 4 | fp16 | 24.12 | 267.27 | 14.97 | 14.99 | 15.26 | 15.32 | 15.54 | 16.06 |
base | 384 | 4 | fp32 | 29.43 | 165.07 | 24.23 | 24.25 | 24.58 | 24.68 | 24.8 | 26.3 |
base | 384 | 8 | fp16 | 35.74 | 304.75 | 26.25 | 26.28 | 26.73 | 26.83 | 26.98 | 27.19 |
base | 384 | 8 | fp32 | 49.53 | 181.19 | 44.15 | 44.2 | 44.64 | 44.7 | 44.81 | 45.62 |
large | 128 | 1 | fp16 | 29.75 | 62.83 | 15.92 | 15.97 | 16.69 | 16.85 | 17.12 | 17.84 |
large | 128 | 1 | fp32 | 26.85 | 53.15 | 18.82 | 18.73 | 19.71 | 19.86 | 20.27 | 24.23 |
large | 128 | 2 | fp16 | 39.01 | 82.16 | 24.34 | 24.18 | 25.24 | 25.42 | 25.84 | 27.49 |
large | 128 | 2 | fp32 | 42.69 | 57.8 | 34.6 | 34.51 | 35.57 | 35.78 | 36.15 | 38.11 |
large | 128 | 4 | fp16 | 57.45 | 94.04 | 42.54 | 42.46 | 43.59 | 43.79 | 44.15 | 45.14 |
large | 128 | 4 | fp32 | 74.18 | 60.61 | 66 | 66.07 | 67.2 | 67.46 | 67.68 | 67.93 |
large | 128 | 8 | fp16 | 90.61 | 105.54 | 75.8 | 75.85 | 76.86 | 77.1 | 77.4 | 78.4 |
large | 128 | 8 | fp32 | 139.8 | 60.89 | 131.37 | 131.73 | 132.79 | 133.01 | 133.37 | 133.67 |
large | 384 | 1 | fp16 | 29.74 | 62.56 | 15.98 | 16.06 | 16.77 | 16.91 | 17.16 | 17.99 |
large | 384 | 1 | fp32 | 27.12 | 52.4 | 19.09 | 19.03 | 20.04 | 20.22 | 20.57 | 22.28 |
large | 384 | 2 | fp16 | 38.91 | 82.38 | 24.28 | 24.07 | 25.17 | 25.3 | 25.52 | 26.05 |
large | 384 | 2 | fp32 | 42.77 | 58.01 | 34.48 | 34.37 | 35.47 | 35.66 | 36.05 | 36.45 |
large | 384 | 4 | fp16 | 57.33 | 93.92 | 42.59 | 42.55 | 43.53 | 43.72 | 44.1 | 44.65 |
large | 384 | 4 | fp32 | 74.38 | 60.44 | 66.19 | 66.23 | 67.18 | 67.43 | 67.64 | 67.93 |
large | 384 | 8 | fp16 | 90.59 | 105.62 | 75.74 | 75.81 | 76.67 | 76.97 | 77.32 | 78.46 |
large | 384 | 8 | fp32 | 139.75 | 60.93 | 131.29 | 131.52 | 132.72 | 133.05 | 133.57 | 134.62 |
Data Chart
1x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP32, BS1 | 1.07 | 15.42 |
Base FP16, BS1 | 1.23 | 14.34 |
Base FP32, BS2 | 1.45 | 21.34 |
Base FP16, BS2 | 1.5 | 22.3 |
Base FP16 XLA,BS1 | 1.86 | 21.22 |
Base FP16, BS4 | 2.05 | 30.59 |
Base FP16 XLA,BS2 | 2.08 | 35.3 |
Base FP32, BS4 | 2.3 | 25.44 |
Base FP16 XLA,BS4 | 2.4 | 53.32 |
Large FP32, BS1 | 2.7 | 5.82 |
Base FP16 XLA,BS8 | 2.85 | 70.82 |
Large FP16, BS1 | 2.89 | 5.85 |
Base FP16, BS8 | 3.13 | 37.81 |
Large FP16, BS2 | 3.72 | 8.61 |
Base FP32, BS8 | 3.85 | 29.18 |
Large FP32, BS2 | 3.87 | 7.71 |
Large FP16 XLA,BS1 | 4.23 | 8.32 |
Large FP16 XLA,BS2 | 4.77 | 13.5 |
Large FP16, BS4 | 5.23 | 11.57 |
Large FP16 XLA,BS4 | 5.58 | 19.73 |
Large FP32, BS4 | 6.1 | 9.37 |
Large FP16 XLA,BS8 | 7.13 | 25.09 |
Large FP16, BS8 | 8.25 | 13.99 |
Base FP32 XLA,BS1 | 42.4 | 0.75 |
Base FP32 XLA,BS2 | 75 | 1.04 |
Base FP32 XLA,BS4 | 93.35 | 1.07 |
Base FP32 XLA,BS8 | 117.2 | 1.37 |
Data Chart
2x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP32 , BS1 | 1.64 | 21.82 |
Base FP16, BS1 | 1.7 | 23.39 |
Base FP32 XLA , BS1 | 1.85 | 26.72 |
Base FP32 , BS2 | 2.01 | 33.36 |
Base FP16, BS2 | 2.04 | 35.99 |
Base FP32 XLA , BS2 | 2.12 | 44.35 |
Base FP16 XLA , BS1 | 2.4 | 29.73 |
Base FP16, BS4 | 2.57 | 53.15 |
Base FP16 XLA , BS2 | 2.6 | 51.83 |
Base FP32 XLA , BS4 | 2.62 | 63.18 |
Base FP32 , BS4 | 2.84 | 43.9 |
Base FP16 XLA , BS4 | 2.92 | 83 |
Base FP16 XLA , BS8 | 3.38 | 120.19 |
Base FP32 XLA , BS8 | 3.52 | 82.78 |
Base FP16, BS8 | 3.64 | 69.01 |
Large FP16, BS1 | 4.07 | 8.94 |
Large FP32, BS1 | 4.2 | 7.93 |
Base FP32 , BS8 | 4.4 | 53.68 |
Large FP16, BS2 | 4.85 | 14.21 |
Large FP32, BS2 | 5.3 | 11.98 |
Large FP16 XLA , BS1 | 5.46 | 11.14 |
Large FP16 XLA , BS2 | 6 | 19.13 |
Large FP16, BS4 | 6.34 | 20.31 |
Large FP16 XLA , BS4 | 6.81 | 30.4 |
Large FP32, BS4 | 7.48 | 15.97 |
Large FP16 XLA , BS8 | 8.29 | 42.3 |
Large FP16, BS8 | 9.31 | 25.94 |
Data Chart
4x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training
Raw Data
Training Time Hours | Throughput sentences/sec | |
---|---|---|
Base FP16, BS1 | 1.79 | 43.52 |
Base FP32 XLA, BS1 | 1.78 | 39.02 |
Base FP32, BS1 | 1.79 | 39.02 |
Base FP16, BS2 | 2.09 | 69.84 |
Base FP32, BS2 | 2.15 | 61.86 |
Base FP32 XLA, BS2 | 2.15 | 61.82 |
Base FP16 XLA, BS1 | 2.49 | 53.84 |
Base FP16, BS4 | 2.63 | 102.85 |
Base FP16 XLA, BS2 | 2.7 | 96.01 |
Base FP32 XLA, BS4 | 2.96 | 83.8 |
Base FP32, BS4 | 2.96 | 83.73 |
Base FP16 XLA, BS4 | 3 | 158.56 |
Base FP16 XLA, BS8 | 3.45 | 233.43 |
Base FP16, BS8 | 3.67 | 137.78 |
Large FP16, BS1 | 4.29 | 16.68 |
Base FP32, BS8 | 4.48 | 105.07 |
Base FP32 XLA, BS8 | 4.49 | 104.6 |
Large FP32, BS1 | 4.61 | 14.12 |
Large FP16, BS2 | 5.06 | 26.85 |
Large FP32, BS2 | 5.64 | 22.15 |
Large FP16 XLA, BS1 | 5.66 | 20.56 |
Large FP16 XLA, BS2 | 6.21 | 35.69 |
Large FP16 XLA, BS8 | 6.52 | 39.21 |
Large FP16 XLA, BS4 | 7.01 | 57.43 |
Large FP32, BS4 | 7.81 | 30.51 |
Large FP16 XLA, BS8 | 8.43 | 82.55 |
Large FP16, BS8 | 9.34 | 52.13 |
Data Chart
NVIDIA RTX A4000 Series GPUs
GPU Features | NVIDIA RTX A4000 |
---|---|
GPU Memory | 16GB GDDR6 with error-correction code (ECC) |
Display Ports | 4x DisplayPort 1.4 |
Max Power Consumption | 140 W |
Graphics Bus | PCI Express Gen 4 x 16 |
Form Factor | 4.4” (H) x 9.5” (L) Single Slot |
Thermal | Active |
VR Ready | Yes |
Additional GPU Benchmarks
- NVIDIA A5000 Deep Learning Benchmarks for TensorFlow
- NVIDIA A30 Deep Learning Benchmarks for TensorFlow
- NVIDIA RTX A6000 Deep Learning Benchmarks for TensorFlow
- NVIDIA RTX A6000 Benchmarks for RELION Cryo-EM
- NVIDIA A100 Deep Learning Benchmarks for TensorFlow
Have any questions?
Contact Exxact Today