Benchmarks

NVIDIA Quadro RTX 6000 GPU Performance Benchmarks for TensorFlow

April 26, 2019

36 min read

NVIDIA Quadro RTX 6000 Benchmarks

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 6000 GPUs. Our Exxact Valence Workstation was fitted with 4x Quadro RTX 6000's giving us 96 GB of GPU memory for our system.

We ran the standard "tf_cnn_benchmarks.py" benchmark script (found here in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet.

We also compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. In addition we've also ran benchmarks using XLA and noticed substantial improvements.

Key Points and Observations

In terms of purely img/sec, the RTX 6000 is on par, with performance of the RTX 8000. The two cards use the same Turing processor, yet have different memory sizes.
However in terms of batch size, the RTX 6000 cannot fit the large batch sizes that the RTX 8000 can.
XLA significantly increases the amount of Img/sec across most models. This is true for both FP 16 and FP32, however the most dramatic gains were seen in FP 16.

Quadro RTX 6000 Benchmark Snapshot, XLA on/off, FP32, FP16

Quadro RTX 6000 Deep Learning Benchmarks: FP16, Large Batch Size (XLA on)

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	1081.7	2129.51	4132.64	512
ResNet152	419.07	816.22	1547.55	256
InceptionV3	566.32	1018.8	1962.5	256
Inception V4	313.07	454.6	792.61	256
VGG16	535.07	1051.86	1996.54	512
NASNET	389.39	761.65	1521	256

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Batch Size 64 (XLA off)

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	555.66	989.02	1764.57	64
ResNet152	251.12	404.76	687.17	64
InceptionV3	338.07	584.42	1064.14	64
InceptionV4	180.15	327.88	606.47	64
VGG16	354.14	589.33	776.88	64
NASNET	163.96	283.55	518.32	64

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Large Batch Size (XLA off)

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	637.56	1248.54	2450.09	512
ResNet152	281.94	519.76	1019.9	256
InceptionV3	391.08	737.77	1441.87	256
InceptionV4	203.9	392.14	763.02	256
VGG16	277.53	534.93	997.29	512
NASNET	185.68	354.48	684.18	256

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Large Batch Size (XLA on)

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
NASNET	393.08	765.76	1526.41	256
ResNet50	371.39	730.33	1391.21	256
InceptionV3	235.37	455.26	872.55	128
VGG16	212.84	421.94	746.47	256
ResNet152	148.4	287.98	531.83	128
InceptionV4	113.85	220.51	423.12	128

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture use .

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Batch Size 64

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	314.08	589.16	1055.65	64
ResNet152	128.32	235.46	430.65	64
InceptionV3	206.29	388.36	726.53	64
InceptionV4	101.99	192.91	370.47	64
VGG16	192.97	347.27	549.97	64
NASNET	159.2	275.47	517.83	64

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: FP32 Large Batch Size

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	327.73	642.62	1247.82	256
ResNet152	136.5	255.47	496.64	128
InceptionV3*	215.95	266.22	429.75	256 (*128 when using 1 GPU)
InceptionV4	106.34	204.33	397.83	128
VGG16	186.79	365.06	695.92	512
NASNET	181.28	343.81	669.67	256

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: Alexnet (FP32, FP16, FP16 XLA on,FP32 XLA off)

	1 GPU img/sec	2 GPU img/sec	4 GPU img/sec	Batch Size
Alexnet FP16 (Large Batch)	5893.11	11471.56	22062.74	8192
Alexnet FP16 (Normal Batch)	6033.55	11361.15	14933.42	512
Alexnet FP32 (Large Batch)	2869.05	5508.14	10677.19	4096
Alexnet FP32 (Normal Batch)	4143.62	7979.58	10632.83	512
Alexnet XLA FP32	3049.84	5895.65	11129.32	4096
Alexnet XLA FP16	6733.82	13160.13	25290.29	8192

Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in fp32.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

System Specifications:

System	Exxact Valence Workstation
GPU	4 x NVIDIA Quadro RTX 6000
CPU	Intel CORE I7-7820X 3.6GHZ
RAM	32GB DDR4
SSD	480 GB SSD
HDD (data)	10 TB HDD
OS	Ubuntu 18.04
NVIDIA DRIVER	418.43
CUDA Version	10.1
Python	2.7
TensorFlow	1.14
Docker Image	tensorflow/tensorflow:nightly-gpu

Training Parameters (non XLA)

Dataset:	Imagenet (synthetic)
Mode:	training
SingleSess:	False
Batch Size:	Varied
Num Batches:	100
Num Epochs:	0.08
Devices:	['/gpu:0']...(varied)
NUMA bind:	False
Data format:	NCHW
Optimizer:	sgd
Variables:	parameter_server

Training Parameters (XLA)

Dataset:	Imagenet (synthetic)
Mode:	training
SingleSess:	False
Batch Size:	Varied
Num Batches:	100
Num Epochs:	0.08
Devices:	['/gpu:0']...(varied)
NUMA bind:	False
Data format:	NCHW
Optimizer:	momentum
Variables:	replicated
AllReduce	nccl

More Deep Learning Benchmarks

Topics

Have any questions?

Benchmarks