Benchmarks

NVIDIA Quadro RTX 8000 Benchmarks for Deep Learning in TensorFlow 2019

March 29, 2019

36 min read

NVIDIA Quadro RTX 8000 Benchmarks

Updated 6/11/2019 with XLA FP32 and XLA FP16 metrics.

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 8000 GPUs. Our Exxact Valence Workstation was equipped with 4x Quadro RTX 8000's giving us an awesome 192 GB of GPU memory for our system. To demonstrate, we ran the standard tf_cnn_benchmarks.py benchmark script (found here in the official TensorFlow github). Also, we ran tests on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. For good measure, we compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases). Furthermore, we incrementally doubled the batch size until we threw a memory error. Incidentally, all tests ran on1,2 and 4 GPU configurations.

Key Points and Observations

In most scenarios, large batch size training showed impressive results in images/sec when compared to smaller batch sizes. This is especially true when scaling to the 4 GPU configuration.
- AlexNet and VGG16 performed better using smaller batch size on a single GPU, but larger batch size performed better on these models when scaling up to 4 GPUs.
ResNet-50 and ResNet-152 Showed massive scaling when going from 1-2-4 GPUs, a mind blowing 4193.48 images/sec for ResNet-50 and 1621.96 images/sec for ResNet-152 at FP16 & XLA!
Using FP16 showed impressive gains in images/sec across most models when using 4 GPUs. (exception AlexNet)
The Quadro RTX 8000 with 48 GB RAM is Ideal for training networks that require large batch sizes that otherwise would be limited on lower end GPUs.
The Quadro RTX 8000 is an ideal choice for deep learning if you're restricted to a workstation or single server form factor and want maximum GPU memory.
Our workstations with Quadro RTX 8000 can also train state of the art NLP Transformer networks that require large batch size for best performance, a popular application for the fast growing data science market.
XLA significantly increases the amount of Img/sec across most models, however the most dramatic gains were seen in FP16.

Quadro RTX 8000 Deep Learning Benchmark Snapshot (FP16, FP32, XLA on/off)

Quadro RTX 8000 Deep Learning Benchmarks: FP16, XLA

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
InceptionV4	314.95	468.11	808.72	512
NASNET	406.77	787.47	1557.53	512
ResNet152	429.1	835.26	1621.96	512
VGG16	530.31	1028.79	1982.34	512
InceptionV3	577.05	1039.15	2025.35	512
ResNet50	1096.32	2158.67	4193.48	1024

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=512 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, XLA

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
InceptionV4	113.86	218.12	424.77	256
ResNet152	150.04	287.6	549.79	256
VGG16	163.43	319.69	604.44	512
InceptionV3	236.74	459.86	886.57	256
ResNet50	372.39	719.11	1391.74	512
NASNET	407.48	788.33	1562.55	512

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Batch Size 64

	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	314.87	590.3	952.8	64
ResNet152	127.71	232.42	418.44	64
InceptionV3	207.53	386.86	655.45	64
InceptionV4	102.41	191.4	337.44	64
VGG16	188.91	337.38	536.95	64
NASNET	160.42	280.07	510.15	64

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Large Batch Size

	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	322.66	622.41	1213.3	512
ResNet152	137.12	249.58	452.77	256
InceptionV3	216.27	412.75	716.47	256
InceptionV4	105.2	201.49	345.79	256
VGG16	166.55	316.46	617	512
NASNET	187.69	348.71	614	512

Run these benchmarks

Configure num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Batch Size 64

	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	544.16	972.89	1565.18	64
ResNet152	246.56	412.25	672.87	64
InceptionV3	334.28	596.65	1029.24	64
InceptionV4	178.41	327.89	540.52	64
VGG16	347.01	570.53	637.97	64
NASNET	155.44	282.78	517.06	64

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Large Batch Size

	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	604.76	1184.52	2338.84	1024
ResNet152	285.85	529.05	1062.13	512
InceptionV3	391.3	754.94	1471.66	512
InceptionV4	203.67	384.29	762.32	512
VGG16	276.16	528.88	983.85	512
NASNET	196.52	367.6	726.85	512

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=1024 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: AlexNet (FP32, FP16, XLA on, off)

	1 GPU	2 GPU	4 GPU	Batch Size
Alexnet FP16 (Large Batch)	5911.6	11456.11	21828.99	8192
Alexnet FP16 (Regular Batch)	6013.64	11275.54	14960.97	512
Alexnet FP32 (Large Batch)	2825.61	4421.97	8482.39	8192
Alexnet FP32 (Regular Batch)	4103.27	7814.04	10491.22	512
Alexnet FP16 XLA	6787.5	13101.07	25035.27	8192
Alexnet FP32 XLA	2173.97	4144.43	8007.66	8192

Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in FP32. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

System Specifications

System	Exxact Valence Workstation
GPU	4 x NVIDIA Quadro RTX 8000
CPU	Intel CORE I7-7820X 3.6GHZ
RAM	32GB DDR4
SSD	480 GB SSD
HDD (data)	10 TB HDD
OS	Ubuntu 18.04
NVIDIA DRIVER	410.79
CUDA Version	10
Python	2.7
TensorFlow	1.14
Docker Image	tensorflow/tensorflow:nightly-gpu

Training Parameters (non XLA)

Dataset	Imagenet (synthetic)
Mode:	training
SingleSess:	False
Batch Size:	Varied
Num Batches:	100
Num Epochs:	0.08
Devices:	['/gpu:0']...(varied)
NUMA bind:	False
Data format:	NCHW
Optimizer:	sgd
Variables:	parameter_server

Training Parameters (XLA)

Dataset:	Imagenet (synthetic)
Mode:	training
SingleSess:	False
Batch Size:	Varied
Num Batches:	100
Num Epochs:	0.08
Devices:	['/gpu:0']...(varied)
NUMA bind:	False
Data format:	NCHW
Optimizer:	momentum
Variables:	replicated
AllReduce	nccl

More Deep Learning Benchmarks

That's it for now! Have any questions? Let us know on social media.

https://www.facebook.com/exxactcorp/

https://twitter.com/Exxactcorp

Topics

Have any questions?

Benchmarks