Benchmarks

TITAN RTX Benchmarks for Deep Learning in TensorFlow 2019: XLA, FP16, FP32, & NVLink

May 29, 2019
20 min read
Cover-Photo-2.jpg


NVIDIA Titan RTX Benchmarks

For this blog article, we conducted deep learning performance benchmarks for TensorFlow using NVIDIA TITAN RTX GPUs. Tests were conducted using an Exxact TITAN Workstation outfitted with 2x TITAN RTXs with an NVLink bridge. We ran the standard "tf_cnn_benchmarks.py" benchmark script found here in the official TensorFlow github.

We ran tests on the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Furthermore, we compared FP16 to FP32 performance, and compared numbers using XLA. The same tests were conducted using 1 and 2 GPU configurations, and batch size used was the largest that could fit in memory (powers of two).

Key Points and Observations

  • The TITAN RTX an excellent choice if you will need large batch size for training while keeping costs within decent price point.
  • Performance (img/sec) is comparable to Quadro RTX 6000 Benchmark Performance in most instances.
  • Observing this dual GPU configuration, the workstation ran silently, and very cool during training workloads (Note: The chassis offers a lot of airflow).
  • Significant gains were made using XLA in most cases, especially when in FP16.

TITAN RTX Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

Image-5-1-1024x576.png

PE_TITAN-RTX-Blog-1024x127.jpg

TITAN RTX Deep Learning Benchmarks: FP16 (XLA on)

Image-2-1-1024x576.png

[supsystic-tables id=31]

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP16 (XLA off)

Image-3-1-1024x576.png

1 GPU img/sec2 GPU img/secBatch Size
InceptionV4323.93475.52256
ResNet152421.69836.44256
VGG16561.491113.41512
NASNET400.37797.99256
InceptionV3578.491058.66256
ResNet501096.442186.79512


eBook-DL-1024x202.jpg

Run these benchmarks

Apply the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

TITAN RTX Deep Learning Benchmarks: FP32 (XLA on)

Image-4-1-1024x576.png

1 GPU img/sec2 GPU img/secBatch Size
InceptionV4207.98399.16256
ResNet152284.87530.86256
NASNET191.09369.07256
VGG16287.1544.71512
InceptionV3397.09784.24256
ResNet50646.131287.01512


Run these benchmarks

Change the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP32 (XLA off)

1 GPU img/sec2 GPU img/secBatch Size
VGG16219.09418.78256
ResNet152151.76298.1256
InceptionV4116.04218.2128
InceptionV3241.22477.17128
ResNet50382.63755.26256
NASNET402.65802.59256


Run these benchmarks

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

TITAN RTX Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

Image-6-1-1024x576.png

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=4096--model=alexnet --variable_update=parameter_server --use_fp16=True

Run these benchmarks with XLA

To run with XLA, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=8192 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

SystemExxact TITAN Workstation
GPU2 x NVIDIA TITAN RTX
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7, 3.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks...

Cover-Photo-2.jpg
Benchmarks

TITAN RTX Benchmarks for Deep Learning in TensorFlow 2019: XLA, FP16, FP32, & NVLink

May 29, 201920 min read


NVIDIA Titan RTX Benchmarks

For this blog article, we conducted deep learning performance benchmarks for TensorFlow using NVIDIA TITAN RTX GPUs. Tests were conducted using an Exxact TITAN Workstation outfitted with 2x TITAN RTXs with an NVLink bridge. We ran the standard "tf_cnn_benchmarks.py" benchmark script found here in the official TensorFlow github.

We ran tests on the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Furthermore, we compared FP16 to FP32 performance, and compared numbers using XLA. The same tests were conducted using 1 and 2 GPU configurations, and batch size used was the largest that could fit in memory (powers of two).

Key Points and Observations

  • The TITAN RTX an excellent choice if you will need large batch size for training while keeping costs within decent price point.
  • Performance (img/sec) is comparable to Quadro RTX 6000 Benchmark Performance in most instances.
  • Observing this dual GPU configuration, the workstation ran silently, and very cool during training workloads (Note: The chassis offers a lot of airflow).
  • Significant gains were made using XLA in most cases, especially when in FP16.

TITAN RTX Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

Image-5-1-1024x576.png

PE_TITAN-RTX-Blog-1024x127.jpg

TITAN RTX Deep Learning Benchmarks: FP16 (XLA on)

Image-2-1-1024x576.png

[supsystic-tables id=31]

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP16 (XLA off)

Image-3-1-1024x576.png

1 GPU img/sec2 GPU img/secBatch Size
InceptionV4323.93475.52256
ResNet152421.69836.44256
VGG16561.491113.41512
NASNET400.37797.99256
InceptionV3578.491058.66256
ResNet501096.442186.79512


eBook-DL-1024x202.jpg

Run these benchmarks

Apply the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

TITAN RTX Deep Learning Benchmarks: FP32 (XLA on)

Image-4-1-1024x576.png

1 GPU img/sec2 GPU img/secBatch Size
InceptionV4207.98399.16256
ResNet152284.87530.86256
NASNET191.09369.07256
VGG16287.1544.71512
InceptionV3397.09784.24256
ResNet50646.131287.01512


Run these benchmarks

Change the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP32 (XLA off)

1 GPU img/sec2 GPU img/secBatch Size
VGG16219.09418.78256
ResNet152151.76298.1256
InceptionV4116.04218.2128
InceptionV3241.22477.17128
ResNet50382.63755.26256
NASNET402.65802.59256


Run these benchmarks

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

TITAN RTX Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

Image-6-1-1024x576.png

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=4096--model=alexnet --variable_update=parameter_server --use_fp16=True

Run these benchmarks with XLA

To run with XLA, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=8192 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

SystemExxact TITAN Workstation
GPU2 x NVIDIA TITAN RTX
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7, 3.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks...