Benchmarks

NVIDIA RTX 2080 Ti Benchmarks for Deep Learning with TensorFlow: Updated with XLA & FP16

May 23, 2019
28 min read
Final-Cover-Photo.jpg

NVIDIA RTX 2080 Ti Benchmarks

For this blog article, we conducted more extensive deep learning performance benchmarks for TensorFlow on NVIDIA GeForce RTX 2080 Ti GPUs. We recently discovered that the XLA library (Accelerated Linear Algebra) adds significant performance gains, and felt it was worth running the numbers again. Our Exxact Valence Workstation was fitted with 4x RTX 2080 Ti's and ran the standard "tf_cnn_benchmarks.py" benchmark script found here in the official TensorFlow github. We tested on the the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Also, we compared FP16 to FP32 performance, and compared numbers using the XLA flag. Furthermore, ran the same tests using 1,2, and 4 GPU configurations. Batch size was largest that could fit into available GPU memory (powers of two).

Key Points and Observations

  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP16 and FP32, however the most dramatic gains were seen in FP16 up to 32% (ResNet50, 4GPU Config, FP16).
  • On certain models we ran into errors when performing benchmarks using XLA (VGG and Alexnet models at FP32).
  • The ResNet models (ResNet50, ResNet152) showed massive improvements using XLA + FP16.

GeForce RTX 2080 Ti Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

Image-1-1024x576.png

DL_NVIDIA_GPU_Dynamic-SLB-1024x108.jpg

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA on)

Image-2-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4217.87303.98521.8464
ResNet152290450.72849.9264
VGG16339.57505.99940.0464
NASNET342.9657.151298.89128
InceptionV3425.15708.471354.54128
ResNet50812.241386.492683.34128


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA off)

Image-3-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4150.59247.16497.5464
ResNet152209.27348.8538.1564
NASNET171.78310.02577.88128
VGG16274.24419.28586.96128
InceptionV3310.32569.241106.4128
ResNet50522.52959.781836.61128

eBook-DL-1024x202.jpg



Run these benchmarks

To run these, set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --variable_update=parameter_server --use_fp16=True

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA on)

Image-4-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
VGG16errorerrorerrorerror
ResNet152120.23164.61305.0132
InceptionV4193.85294.28557.5432
InceptionV3211.24358.4694.8464
ResNet50326.62517.55981.3464
NASNET294.21527.211049.8364


Run these benchmarks

To run XLA FP32, set num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA off)

Image-5-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet152112.33182.28266.2232
InceptionV490.34158.79296.9732
VGG16177.84248.7316.264
NASNET151.79264.01459.9864
InceptionV3195.18356.1696.4764
ResNet50300.4551.191005.7964


Run these benchmarks

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

RTX 2080 Ti Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

ALEXNET-UPDATED-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
Alexnet FP322962.54861.828764.992048
Alexnet XLA FP32errorerrorerrorerror
Alexnet FP164979.329108.213779.412048
Alexnet XLA FP164945.818620.7616553.552048


How to run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=2048 --model=alexnet --variable_update=parameter_server --use_fp16=True

To run these benchmarks with XLA

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=2048 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

SystemExxact Valence Workstation
GPU4 x NVIDIA GeForce RTX 2080 Ti
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7, 3.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


Interested in our deep learning systems? Contact our sales team here.

More Deep Learning Benchmarks

Final-Cover-Photo.jpg
Benchmarks

NVIDIA RTX 2080 Ti Benchmarks for Deep Learning with TensorFlow: Updated with XLA & FP16

May 23, 201928 min read

NVIDIA RTX 2080 Ti Benchmarks

For this blog article, we conducted more extensive deep learning performance benchmarks for TensorFlow on NVIDIA GeForce RTX 2080 Ti GPUs. We recently discovered that the XLA library (Accelerated Linear Algebra) adds significant performance gains, and felt it was worth running the numbers again. Our Exxact Valence Workstation was fitted with 4x RTX 2080 Ti's and ran the standard "tf_cnn_benchmarks.py" benchmark script found here in the official TensorFlow github. We tested on the the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Also, we compared FP16 to FP32 performance, and compared numbers using the XLA flag. Furthermore, ran the same tests using 1,2, and 4 GPU configurations. Batch size was largest that could fit into available GPU memory (powers of two).

Key Points and Observations

  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP16 and FP32, however the most dramatic gains were seen in FP16 up to 32% (ResNet50, 4GPU Config, FP16).
  • On certain models we ran into errors when performing benchmarks using XLA (VGG and Alexnet models at FP32).
  • The ResNet models (ResNet50, ResNet152) showed massive improvements using XLA + FP16.

GeForce RTX 2080 Ti Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

Image-1-1024x576.png

DL_NVIDIA_GPU_Dynamic-SLB-1024x108.jpg

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA on)

Image-2-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4217.87303.98521.8464
ResNet152290450.72849.9264
VGG16339.57505.99940.0464
NASNET342.9657.151298.89128
InceptionV3425.15708.471354.54128
ResNet50812.241386.492683.34128


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA off)

Image-3-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4150.59247.16497.5464
ResNet152209.27348.8538.1564
NASNET171.78310.02577.88128
VGG16274.24419.28586.96128
InceptionV3310.32569.241106.4128
ResNet50522.52959.781836.61128

eBook-DL-1024x202.jpg



Run these benchmarks

To run these, set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --variable_update=parameter_server --use_fp16=True

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA on)

Image-4-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
VGG16errorerrorerrorerror
ResNet152120.23164.61305.0132
InceptionV4193.85294.28557.5432
InceptionV3211.24358.4694.8464
ResNet50326.62517.55981.3464
NASNET294.21527.211049.8364


Run these benchmarks

To run XLA FP32, set num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA off)

Image-5-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet152112.33182.28266.2232
InceptionV490.34158.79296.9732
VGG16177.84248.7316.264
NASNET151.79264.01459.9864
InceptionV3195.18356.1696.4764
ResNet50300.4551.191005.7964


Run these benchmarks

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

RTX 2080 Ti Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

ALEXNET-UPDATED-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
Alexnet FP322962.54861.828764.992048
Alexnet XLA FP32errorerrorerrorerror
Alexnet FP164979.329108.213779.412048
Alexnet XLA FP164945.818620.7616553.552048


How to run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=2048 --model=alexnet --variable_update=parameter_server --use_fp16=True

To run these benchmarks with XLA

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=2048 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

SystemExxact Valence Workstation
GPU4 x NVIDIA GeForce RTX 2080 Ti
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7, 3.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


Interested in our deep learning systems? Contact our sales team here.

More Deep Learning Benchmarks