Fine Tuning BERT Large on a GPU Workstation
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A5000 GPUs. For testing we used an Exxact Valence Workstation was fitted with 4x RTX A5000 GPUs with 24GB GPU memory per GPU.
Benchmark scripts we used for evaluation came from the NVIDIA NGC Repository BERT for TensorFlow: finetune_train_benchmark.sh
and finetune_inference_benchmark.sh
We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32 for training; and 1, 2, 4, and 8 for inference, and conducted tests using 1, 2, and 4 GPU configurations on BERT Large (we used 1 GPU for inference benchmark). In addition, we ran all benchmarks using TensorFlow's XLA across the board.
Key Points and Observations
- Performance-wise, the RTX A5000 performed well, and significantly better than the RTX 6000.
- In terms of throughput during training, the 4x configs really started to shine, performing well as the batch size increased.
- For those interested in training BERT Large, a 2x RTX A5000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs, you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700
Make / Model | Supermicro AS -4124GS-TN |
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 48 |
Memory | DDR4 512 GB |
Storage | NVMe 3.84 TB |
OS | Centos 7 |
CUDA Version | 11.2 |
BERT Dataset | squad v1 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
script in the TensorFlow 21.06-tf1-py3 NGC container 1x A5000 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model (i.e. no pipelining).
1x RTX A5000 BERT LARGE Inference Benchmark
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 128 | 1 | fp16 | 23.01 | 169.35 | 9.11 | 6.09 | 6.61 | 6.8 | 7.35 | 5711.58 |
base | 128 | 1 | fp32 | 21.15 | 179.34 | 8.69 | 5.78 | 6.26 | 6.39 | 6.73 | 5567.95 |
base | 128 | 2 | fp16 | 20.87 | 351.11 | 9.13 | 5.84 | 6.4 | 6.56 | 6.98 | 5805.16 |
base | 128 | 2 | fp32 | 21.02 | 349.67 | 9.14 | 5.86 | 6.39 | 6.54 | 7.01 | 5785.09 |
base | 128 | 4 | fp16 | 26.31 | 680.97 | 11.61 | 5.9 | 6.32 | 6.48 | 7.13 | 5749.98 |
base | 128 | 4 | fp32 | 26.61 | 672.74 | 11.74 | 6.02 | 6.5 | 6.69 | 7.58 | 5745.76 |
base | 128 | 8 | fp16 | 32.91 | 1039.56 | 13.01 | 7.7 | 8.09 | 8.27 | 8.88 | 5855.3 |
base | 128 | 8 | fp32 | 32.9 | 1037.42 | 13.01 | 7.7 | 8.12 | 8.32 | 8.91 | 5855.67 |
base | 384 | 1 | fp16 | 17.5 | 175.96 | 11.7 | 5.77 | 6.28 | 6.44 | 7.72 | 6263.5 |
base | 384 | 1 | fp32 | 17.17 | 176.4 | 11.52 | 5.71 | 6.35 | 6.55 | 7.16 | 6087.21 |
base | 384 | 2 | fp16 | 25.29 | 258.87 | 18.88 | 7.76 | 8.06 | 8.26 | 9.36 | 6676.43 |
base | 384 | 2 | fp32 | 25.02 | 256.01 | 18.82 | 7.79 | 8.22 | 8.37 | 9.72 | 6473.25 |
base | 384 | 4 | fp16 | 29.94 | 338.67 | 23.03 | 11.78 | 12.16 | 12.38 | 13.82 | 6623.03 |
base | 384 | 4 | fp32 | 30.43 | 337.72 | 23.14 | 11.84 | 12.27 | 12.42 | 13.57 | 6802.34 |
base | 384 | 8 | fp16 | 41.07 | 402.89 | 31.8 | 19.83 | 20.29 | 20.54 | 21.64 | 6988.95 |
base | 384 | 8 | fp32 | 41.33 | 401.45 | 31.99 | 19.9 | 20.35 | 20.63 | 21.98 | 7063.42 |
large | 128 | 1 | fp16 | 40.89 | 104.15 | 15.79 | 10.17 | 10.92 | 11.07 | 11.44 | 11112.47 |
large | 128 | 1 | fp32 | 37.26 | 103.92 | 15.83 | 10.25 | 11.09 | 11.26 | 11.79 | 11137.36 |
large | 128 | 2 | fp16 | 37.81 | 193.01 | 16.95 | 10.58 | 11.04 | 11.2 | 12.06 | 11149.47 |
large | 128 | 2 | fp32 | 37.5 | 194.01 | 16.91 | 10.49 | 11 | 11.18 | 12.26 | 11147.67 |
large | 128 | 4 | fp16 | 52.43 | 318.74 | 24.24 | 12.71 | 13.16 | 13.31 | 13.8 | 11354.08 |
large | 128 | 4 | fp32 | 52.81 | 318.7 | 24.25 | 12.74 | 13.19 | 13.38 | 14.04 | 11304.21 |
large | 128 | 8 | fp16 | 68.8 | 432.67 | 29.34 | 18.61 | 19.13 | 19.32 | 19.88 | 11602.32 |
large | 128 | 8 | fp32 | 68.41 | 435.77 | 29.16 | 18.46 | 19.01 | 19.16 | 19.89 | 11517.24 |
large | 384 | 1 | fp16 | 33.8 | 81.64 | 24.05 | 12.45 | 12.81 | 12.96 | 13.95 | 12308.35 |
large | 384 | 1 | fp32 | 33.75 | 81.83 | 24.07 | 12.42 | 12.86 | 13.03 | 13.55 | 12362.84 |
large | 384 | 2 | fp16 | 51.77 | 108.18 | 41.33 | 18.6 | 19.17 | 19.37 | 20.66 | 13082.53 |
large | 384 | 2 | fp32 | 51.5 | 108.37 | 41.29 | 18.62 | 19.13 | 19.27 | 20.16 | 13072.42 |
large | 384 | 4 | fp16 | 66.02 | 127.29 | 54.69 | 31.5 | 32.08 | 32.28 | 33.48 | 13438.17 |
large | 384 | 4 | fp32 | 66.1 | 126.8 | 54.82 | 31.58 | 32.13 | 32.34 | 33.75 | 13485.86 |
large | 384 | 8 | fp16 | 93.68 | 148.62 | 78.67 | 53.86 | 54.45 | 54.66 | 56 | 14180.87 |
large | 384 | 8 | fp32 | 93.9 | 148.74 | 78.75 | 53.77 | 54.52 | 54.66 | 55.87 | 14261.35 |
1x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
1 | fp16 | 128 | 12991.12 | 1 | 14.31 |
1 | fp32 | 128 | 12932.53 | 1 | 14.38 |
1 | fp16 | 128 | 128 | 2 | 28.21 |
1 | fp32 | 128 | 6875.02 | 2 | 28.11 |
1 | fp16 | 128 | 3992.3 | 4 | 51.17 |
1 | fp32 | 128 | 3969.82 | 4 | 51.43 |
1 | fp16 | 128 | 2689.2 | 8 | 82.47 |
1 | fp32 | 128 | 2718.08 | 8 | 80.71 |
1 | fp16 | 128 | 2052.13 | 16 | 116.46 |
1 | fp32 | 128 | 2034.82 | 16 | 117.67 |
1 | fp16 | 128 | 1732.66 | 32 | 146.93 |
1 | fp32 | 128 | 1722.01 | 32 | 147.85 |
1 | fp16 | 384 | 14256.77 | 1 | 12.98 |
1 | fp32 | 384 | 14279.8 | 1 | 12.97 |
1 | fp16 | 384 | 9047.55 | 2 | 20.82 |
1 | fp32 | 384 | 9044.82 | 2 | 20.83 |
1 | fp16 | 384 | 6365.77 | 4 | 30.42 |
1 | fp32 | 384 | 6380.79 | 4 | 30.35 |
1 | fp16 | 384 | 4939.75 | 8 | 40.14 |
1 | fp32 | 384 | 4900.55 | 8 | 40.44 |
1 | fp16 | 384 | 4310.77 | 16 | 47.08 |
1 | fp32 | 384 | 4299.06 | 16 | 47.19 |
1 | fp16 | 384 | Did not Run | 32 | Did not Run |
1 | fp32 | 384 | Did not Run | 32 | Did not Run |
2x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
2 | fp16 | 128 | 11437.17 | 1 | 16.22 |
2 | fp32 | 128 | 11376.72 | 1 | 16.31 |
2 | fp16 | 128 | 6073.78 | 2 | 31.75 |
2 | fp32 | 128 | 6090.61 | 2 | 31.79 |
2 | fp16 | 128 | 3453.41 | 4 | 59.98 |
2 | fp32 | 128 | 3424.23 | 4 | 58.25 |
2 | fp16 | 128 | 2193.27 | 8 | 105.09 |
2 | fp32 | 128 | 2187.78 | 8 | 105.41 |
2 | fp16 | 128 | 1580.9 | 16 | 166.9 |
2 | fp32 | 128 | 1580.07 | 16 | 167.45 |
2 | fp16 | 128 | 1160.76 | 32 | 233.9 |
2 | fp32 | 128 | 1158.72 | 32 | 233.28 |
2 | fp16 | 384 | 12180.88 | 1 | 15.17 |
2 | fp32 | 384 | 12294.78 | 1 | 15.05 |
2 | fp16 | 384 | 7199.95 | 2 | 26.46 |
2 | fp32 | 384 | 7211.68 | 2 | 26.49 |
2 | fp16 | 384 | 4630.27 | 4 | 43.03 |
2 | fp32 | 384 | 4626.98 | 4 | 43.06 |
2 | fp16 | 384 | 3340.13 | 8 | 63.02 |
2 | fp32 | 384 | 3336.61 | 8 | 63.02 |
4x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
4 | fp16 | 128 | 6721.29 | 1 | 28.53 |
4 | fp32 | 128 | 6720.39 | 1 | 28.53 |
4 | fp16 | 128 | 3657.87 | 2 | 56.11 |
4 | fp32 | 128 | 3643.79 | 2 | 56.34 |
4 | fp16 | 128 | 2161.74 | 4 | 106.77 |
4 | fp32 | 128 | 2163.32 | 4 | 106.72 |
4 | fp16 | 128 | 1448.3 | 8 | 189.41 |
4 | fp32 | 128 | 1447.99 | 8 | 189.19 |
4 | fp16 | 128 | 983.24 | 16 | 306.3 |
4 | fp32 | 128 | 980.91 | 16 | 306.11 |
4 | fp16 | 128 | 815.48 | 32 | 437.08 |
4 | fp32 | 128 | 820.45 | 32 | 436.6 |
4 | fp16 | 384 | 7088 | 1 | 26.99 |
4 | fp32 | 384 | 7109.11 | 1 | 26.91 |
4 | fp16 | 384 | 4220.48 | 2 | 47.77 |
4 | fp32 | 384 | 4227.16 | 2 | 47.69 |
4 | fp16 | 384 | 2777.77 | 4 | 78.55 |
4 | fp32 | 384 | 2775.14 | 4 | 78.72 |
4 | fp16 | 384 | 2047.37 | 8 | 117.43 |
4 | fp32 | 384 | 2039.43 | 8 | 118.1 |
RTX A5000 BERT LARGE Training Benchmark Comparison
Have any questions?
Contact Exxact Today
NVIDIA RTX A5000 BERT Large Fine Tuning Benchmarks in TensorFlow
Fine Tuning BERT Large on a GPU Workstation
For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A5000 GPUs. For testing we used an Exxact Valence Workstation was fitted with 4x RTX A5000 GPUs with 24GB GPU memory per GPU.
Benchmark scripts we used for evaluation came from the NVIDIA NGC Repository BERT for TensorFlow: finetune_train_benchmark.sh
and finetune_inference_benchmark.sh
We made slight modifications to the training benchmark script to get the larger batch size numbers.
The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32 for training; and 1, 2, 4, and 8 for inference, and conducted tests using 1, 2, and 4 GPU configurations on BERT Large (we used 1 GPU for inference benchmark). In addition, we ran all benchmarks using TensorFlow's XLA across the board.
Key Points and Observations
- Performance-wise, the RTX A5000 performed well, and significantly better than the RTX 6000.
- In terms of throughput during training, the 4x configs really started to shine, performing well as the batch size increased.
- For those interested in training BERT Large, a 2x RTX A5000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
- NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs, you'll need a system with at least 64GB RAM.
Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700
Make / Model | Supermicro AS -4124GS-TN |
Nodes | 1 |
Processor / Count | 2x AMD EPYC 7552 |
Total Logical Cores | 48 |
Memory | DDR4 512 GB |
Storage | NVMe 3.84 TB |
OS | Centos 7 |
CUDA Version | 11.2 |
BERT Dataset | squad v1 |
GPU Benchmark Overview
FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size
Our results were obtained by running the scripts/finetune_inference_benchmark.sh
script in the TensorFlow 21.06-tf1-py3 NGC container 1x A5000 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model (i.e. no pipelining).
1x RTX A5000 BERT LARGE Inference Benchmark
Model | Sequence-Length | Batch-size | Precision | Total-Inference-Time | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-50%(ms) | Latency-90%(ms) | Latency-95%(ms) | iLatency-99%(ms) | Latency-100%(ms) |
base | 128 | 1 | fp16 | 23.01 | 169.35 | 9.11 | 6.09 | 6.61 | 6.8 | 7.35 | 5711.58 |
base | 128 | 1 | fp32 | 21.15 | 179.34 | 8.69 | 5.78 | 6.26 | 6.39 | 6.73 | 5567.95 |
base | 128 | 2 | fp16 | 20.87 | 351.11 | 9.13 | 5.84 | 6.4 | 6.56 | 6.98 | 5805.16 |
base | 128 | 2 | fp32 | 21.02 | 349.67 | 9.14 | 5.86 | 6.39 | 6.54 | 7.01 | 5785.09 |
base | 128 | 4 | fp16 | 26.31 | 680.97 | 11.61 | 5.9 | 6.32 | 6.48 | 7.13 | 5749.98 |
base | 128 | 4 | fp32 | 26.61 | 672.74 | 11.74 | 6.02 | 6.5 | 6.69 | 7.58 | 5745.76 |
base | 128 | 8 | fp16 | 32.91 | 1039.56 | 13.01 | 7.7 | 8.09 | 8.27 | 8.88 | 5855.3 |
base | 128 | 8 | fp32 | 32.9 | 1037.42 | 13.01 | 7.7 | 8.12 | 8.32 | 8.91 | 5855.67 |
base | 384 | 1 | fp16 | 17.5 | 175.96 | 11.7 | 5.77 | 6.28 | 6.44 | 7.72 | 6263.5 |
base | 384 | 1 | fp32 | 17.17 | 176.4 | 11.52 | 5.71 | 6.35 | 6.55 | 7.16 | 6087.21 |
base | 384 | 2 | fp16 | 25.29 | 258.87 | 18.88 | 7.76 | 8.06 | 8.26 | 9.36 | 6676.43 |
base | 384 | 2 | fp32 | 25.02 | 256.01 | 18.82 | 7.79 | 8.22 | 8.37 | 9.72 | 6473.25 |
base | 384 | 4 | fp16 | 29.94 | 338.67 | 23.03 | 11.78 | 12.16 | 12.38 | 13.82 | 6623.03 |
base | 384 | 4 | fp32 | 30.43 | 337.72 | 23.14 | 11.84 | 12.27 | 12.42 | 13.57 | 6802.34 |
base | 384 | 8 | fp16 | 41.07 | 402.89 | 31.8 | 19.83 | 20.29 | 20.54 | 21.64 | 6988.95 |
base | 384 | 8 | fp32 | 41.33 | 401.45 | 31.99 | 19.9 | 20.35 | 20.63 | 21.98 | 7063.42 |
large | 128 | 1 | fp16 | 40.89 | 104.15 | 15.79 | 10.17 | 10.92 | 11.07 | 11.44 | 11112.47 |
large | 128 | 1 | fp32 | 37.26 | 103.92 | 15.83 | 10.25 | 11.09 | 11.26 | 11.79 | 11137.36 |
large | 128 | 2 | fp16 | 37.81 | 193.01 | 16.95 | 10.58 | 11.04 | 11.2 | 12.06 | 11149.47 |
large | 128 | 2 | fp32 | 37.5 | 194.01 | 16.91 | 10.49 | 11 | 11.18 | 12.26 | 11147.67 |
large | 128 | 4 | fp16 | 52.43 | 318.74 | 24.24 | 12.71 | 13.16 | 13.31 | 13.8 | 11354.08 |
large | 128 | 4 | fp32 | 52.81 | 318.7 | 24.25 | 12.74 | 13.19 | 13.38 | 14.04 | 11304.21 |
large | 128 | 8 | fp16 | 68.8 | 432.67 | 29.34 | 18.61 | 19.13 | 19.32 | 19.88 | 11602.32 |
large | 128 | 8 | fp32 | 68.41 | 435.77 | 29.16 | 18.46 | 19.01 | 19.16 | 19.89 | 11517.24 |
large | 384 | 1 | fp16 | 33.8 | 81.64 | 24.05 | 12.45 | 12.81 | 12.96 | 13.95 | 12308.35 |
large | 384 | 1 | fp32 | 33.75 | 81.83 | 24.07 | 12.42 | 12.86 | 13.03 | 13.55 | 12362.84 |
large | 384 | 2 | fp16 | 51.77 | 108.18 | 41.33 | 18.6 | 19.17 | 19.37 | 20.66 | 13082.53 |
large | 384 | 2 | fp32 | 51.5 | 108.37 | 41.29 | 18.62 | 19.13 | 19.27 | 20.16 | 13072.42 |
large | 384 | 4 | fp16 | 66.02 | 127.29 | 54.69 | 31.5 | 32.08 | 32.28 | 33.48 | 13438.17 |
large | 384 | 4 | fp32 | 66.1 | 126.8 | 54.82 | 31.58 | 32.13 | 32.34 | 33.75 | 13485.86 |
large | 384 | 8 | fp16 | 93.68 | 148.62 | 78.67 | 53.86 | 54.45 | 54.66 | 56 | 14180.87 |
large | 384 | 8 | fp32 | 93.9 | 148.74 | 78.75 | 53.77 | 54.52 | 54.66 | 55.87 | 14261.35 |
1x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
1 | fp16 | 128 | 12991.12 | 1 | 14.31 |
1 | fp32 | 128 | 12932.53 | 1 | 14.38 |
1 | fp16 | 128 | 128 | 2 | 28.21 |
1 | fp32 | 128 | 6875.02 | 2 | 28.11 |
1 | fp16 | 128 | 3992.3 | 4 | 51.17 |
1 | fp32 | 128 | 3969.82 | 4 | 51.43 |
1 | fp16 | 128 | 2689.2 | 8 | 82.47 |
1 | fp32 | 128 | 2718.08 | 8 | 80.71 |
1 | fp16 | 128 | 2052.13 | 16 | 116.46 |
1 | fp32 | 128 | 2034.82 | 16 | 117.67 |
1 | fp16 | 128 | 1732.66 | 32 | 146.93 |
1 | fp32 | 128 | 1722.01 | 32 | 147.85 |
1 | fp16 | 384 | 14256.77 | 1 | 12.98 |
1 | fp32 | 384 | 14279.8 | 1 | 12.97 |
1 | fp16 | 384 | 9047.55 | 2 | 20.82 |
1 | fp32 | 384 | 9044.82 | 2 | 20.83 |
1 | fp16 | 384 | 6365.77 | 4 | 30.42 |
1 | fp32 | 384 | 6380.79 | 4 | 30.35 |
1 | fp16 | 384 | 4939.75 | 8 | 40.14 |
1 | fp32 | 384 | 4900.55 | 8 | 40.44 |
1 | fp16 | 384 | 4310.77 | 16 | 47.08 |
1 | fp32 | 384 | 4299.06 | 16 | 47.19 |
1 | fp16 | 384 | Did not Run | 32 | Did not Run |
1 | fp32 | 384 | Did not Run | 32 | Did not Run |
2x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
2 | fp16 | 128 | 11437.17 | 1 | 16.22 |
2 | fp32 | 128 | 11376.72 | 1 | 16.31 |
2 | fp16 | 128 | 6073.78 | 2 | 31.75 |
2 | fp32 | 128 | 6090.61 | 2 | 31.79 |
2 | fp16 | 128 | 3453.41 | 4 | 59.98 |
2 | fp32 | 128 | 3424.23 | 4 | 58.25 |
2 | fp16 | 128 | 2193.27 | 8 | 105.09 |
2 | fp32 | 128 | 2187.78 | 8 | 105.41 |
2 | fp16 | 128 | 1580.9 | 16 | 166.9 |
2 | fp32 | 128 | 1580.07 | 16 | 167.45 |
2 | fp16 | 128 | 1160.76 | 32 | 233.9 |
2 | fp32 | 128 | 1158.72 | 32 | 233.28 |
2 | fp16 | 384 | 12180.88 | 1 | 15.17 |
2 | fp32 | 384 | 12294.78 | 1 | 15.05 |
2 | fp16 | 384 | 7199.95 | 2 | 26.46 |
2 | fp32 | 384 | 7211.68 | 2 | 26.49 |
2 | fp16 | 384 | 4630.27 | 4 | 43.03 |
2 | fp32 | 384 | 4626.98 | 4 | 43.06 |
2 | fp16 | 384 | 3340.13 | 8 | 63.02 |
2 | fp32 | 384 | 3336.61 | 8 | 63.02 |
4x RTX A5000 BERT LARGE Training Benchmark
GPUs | Precision | Sequence-Length | Total-Training-Time | Batch-size | Performance(sent/sec) |
4 | fp16 | 128 | 6721.29 | 1 | 28.53 |
4 | fp32 | 128 | 6720.39 | 1 | 28.53 |
4 | fp16 | 128 | 3657.87 | 2 | 56.11 |
4 | fp32 | 128 | 3643.79 | 2 | 56.34 |
4 | fp16 | 128 | 2161.74 | 4 | 106.77 |
4 | fp32 | 128 | 2163.32 | 4 | 106.72 |
4 | fp16 | 128 | 1448.3 | 8 | 189.41 |
4 | fp32 | 128 | 1447.99 | 8 | 189.19 |
4 | fp16 | 128 | 983.24 | 16 | 306.3 |
4 | fp32 | 128 | 980.91 | 16 | 306.11 |
4 | fp16 | 128 | 815.48 | 32 | 437.08 |
4 | fp32 | 128 | 820.45 | 32 | 436.6 |
4 | fp16 | 384 | 7088 | 1 | 26.99 |
4 | fp32 | 384 | 7109.11 | 1 | 26.91 |
4 | fp16 | 384 | 4220.48 | 2 | 47.77 |
4 | fp32 | 384 | 4227.16 | 2 | 47.69 |
4 | fp16 | 384 | 2777.77 | 4 | 78.55 |
4 | fp32 | 384 | 2775.14 | 4 | 78.72 |
4 | fp16 | 384 | 2047.37 | 8 | 117.43 |
4 | fp32 | 384 | 2039.43 | 8 | 118.1 |
RTX A5000 BERT LARGE Training Benchmark Comparison
Have any questions?
Contact Exxact Today