Deep Learning

PyTorch 1.11.0 Now Available

March 10, 2022
145 min read
PyTorch-v1.11.jpg

PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.

The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!

PyTorch 1.11.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:

  • TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
  • functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
  • Distributed Data Parallel (DDP) static graph optimizations available in stable.

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

Fixed python deepcopy to correctly copy all attributes on Tensor objects (#65584)

This change ensures that the deepcopy operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).

1.10.21.11.0
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# Raise AttributeError: "Tensor" object has no attribute "foo"
      
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# 3
      

steps argument is no longer optional in torch.linspace and torch.logspace

This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps). In PyTorch 1.11, it is not longer optional.

1.10.21.11.0
# Works, but raises a deprecation warning
# Steps defaults to 100
a = torch.linspace(1, 10)
# UserWarning: Not providing a value for linspace's steps is deprecated
# and will throw a runtime error in a future release.
# This warning will appear only once per process.
# (Triggered internally at  ../aten/src/ATen/native/RangeFactories.cpp:19
      
# In 1.11, you must specify steps
a = torch.linspace(1, 10, steps=100)
      

Remove torch.hub.import_module function that was mistakenly public (#67990)

This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module.

C++ API

We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)

When you #include a header from the C++ frontend, you can no longer assume that every aten operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h> in your file, which will maintain the old behavior of including every aten operators.

Custom implementation for c10::List and c10::Dict move constructors have been removed (#69370)

The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"

1.10.21.11.0
c10::List list1({"3", "4"});
c10::List list2(std::move(list1));
std::cout << list1.size() // 0
      
c10::List list1({"3", "4"});
c10::List list2(std::move(list1)); // calls copy ctr
std::cout << list1.size() // 2
      

CUDA

Removed THCeilDiv function and corresponding THC/THCDeviceUtils.cuh header (#65472)

As part of cleaning up TH from the codebase, the THCeilDiv function has been removed. Instead, please use at::ceil_div, and include the corresponding ATen/ceil_div.h header

Removed THCudaCheck (#66391)

You can replace it with C10_CUDA_CHECK, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions

Removed THCudaMalloc(), THCudaFree(), THCThrustAllocator.cuh (#65492)

If your extension is using THCThrustAllocator.cuh, please replace it with ATen/cuda/ThrustAllocator.h and corresponding APIs (see examples in this PR).

This PR also removes THCudaMalloc/THCudaFree calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr), or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.

Build

Stopped building shared library for AOT Compiler, libaot_compiler.so (#66227)

Building aot_compiler.cpp as a separate library is not necessary, as it’s already included in libtorch.so.
You can update your build system to only dynamically link libtorch.so.

Mobile

Make typing.Union type unsupported for mobile builds (#65556)

typing.Union support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.

Distributed

torch.distributed.rpc: Final Removal of ProcessGroup RPC backend (#67363)

ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.

The backend type “PROCESS_GROUP” is now deprecated, e.g.
torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:
torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)

Quantization

Disabled the support for getitem in FX Graph Mode Quantization (#66647)

getitem used to be quantized in FX Graph Mode Quantization, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.

1.10.21.11.0
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5,
#      scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = torch.quantize_per_tensor(x,
#         linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
#     x = linear_input_scale_0 = linear_input_zero_point_0 = None
#     linear = self.linear(quantize_per_tensor)
#     quantize_per_tensor = None
#     stack = torch.stack([linear], 0);  linear = None
#     getitem = stack[0]; stack = None
#     dequantize_2 = getitem.dequantize();  getitem = None
#     return getitem
      
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
                    zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0,
                     linear_input_zero_point_0, torch.quint8)
#     x = linear_input_scale_0 = linear_input_zero_point_0 = None
#     linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
#     stack = torch.stack([linear], 0);  linear = None
#     dequantize_2 = stack.dequantize();  stack = None
#     getitem = dequantize_2[0];  dequantize_2 = None
#     return getitem
      

Users should now use fuse_modules for PTQ fusion and fuse_modules_qat for QAT fusion (#69878, #71956)

There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training to decide which mode user wanted, but this was a misuse of the training attribute since that is not the intended purpose. This PR removes the dependency on module.training and uses separate APIs to make the fusion requested by the user explicit.

Previously, fuse_module used to support both cases and distinguished PTQ/QAT fusion based on module.training, but now fuse_module only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat, instead of using fuse_modules, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.

Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.

1.10.21.11.0
import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.bn = torch.nn.BatchNorm2d(3)
    def forward(self, x):
        return self.bn(self.conv(x))
m = M().train()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
      
import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 3, 3)
        self.bn = torch.nn.BatchNorm2d(3)
    def forward(self, x):
        return self.bn(self.conv(x))
m = M().train()
# For Quantization Aware Training, use fuse_modules_qat()
m = fuse_modules_qat(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
# Result (doesn't change):
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
      

ONNX

Removed f arg from onnx.export_to_pretty_string (#69546)

The arg has always been ignored. Simply remove it from your code.

1.10.21.11.0
torch.onnx.export_to_pretty_string(model, inputs, "file_name")
      
torch.onnx.export_to_pretty_string(model, inputs)
      

Removed use_external_data_format arg from onnx.export (#67809)

The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.

1.10.21.11.0
torch.onnx.export(model, inputs, f_name, use_external_data_format=True)
      
torch.onnx.export(model, inputs, f_name)
      

Removed example_outputs arg from torch.onnx.export (#67809)

The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.

1.10.21.11.0
torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))
      
torch.onnx.export(model, inputs, f_name)
      

Removed enable_onnx_checker arg from onnx.export (#67276)

The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError will be raised. Users can catch and ignore that exception.

1.10.21.11.0
torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)
      
try:
    torch.onnx.export(model, inputs, f_name)
except torch.onnx.CheckerError:
    pass # ignore error
      

Moved and renamed onnx.utils.ONNXCheckerError to onnx.CheckerError (#66644)

Previously the documentation was incorrect and stated ONNXCheckerError was in the onnx module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.

1.10.21.11.0
except torch.onnx.utils.ONNXCheckerError:
      
except torch.onnx.CheckerError:
    

Removed _retain_param_name arg from onnx.export (#67276)

The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.

1.10.21.11.0
# NOTE: No way to get same behavior as _retain_param_name=False.
torch.onnx.export(model, inputs, f_name, _retain_param_name=True)
      
torch.onnx.export(model, inputs, f_name)
    

Deprecations

Python API

Deprecated x.T on tensors of dimension other than 0 or 2 (#64180)

x.T only accepts tensors with 0 or 2 dimensions. Calling x.T on tensors with a different number of dimensions has been deprecated.

1.10.21.11.0
a = torch.ones(2, 3, 4)
a.T.size()
# torch.Size([4, 3, 2])
      
a = torch.ones(2, 3, 4)
a.T.size()
# UserWarning: The use of `x.T` on tensors of dimension other than 2
# to reverse their shape is deprecated and it will throw an error in a future release.
# Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))`
# to reverse the dimensions of a tensor. (Triggered internally at 
# aten/src/ATen/native/TensorShape.cpp:2386.)
# torch.Size([4, 3, 2])
    

Quantization

torch.ao.quantization.QConfigDynamic is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig instead (#69875, #69864)

1.10.21.11.0
qconfig = torch.ao.quantization.QConfigDynamic(...)
      
qconfig = torch.ao.quantization.QConfig(...)
    

New features

Python API

  • Added set_deterministic_debug_mode and get_deterministic_debug_mode (#67778, #66233)
  • Added n-dimensional Hermitian FFT: torch.fft.ifftn and torch.fft.hfftn (#63890)
  • Added Wishart distribution to torch.distributions (#70377)
  • Preliminary support for the Python Array API standard has been added to the torch and torch.linalg modules. PyTorch implements over 90% of the operators defined by the Python Array API, including the torch.from_dlpack operation for improved DLPack support (#60627)
  • Moved torch.testing from prototype to beta (#69668)

Autograd

  • Added new torch.utils.checkpoint implementation that does not use reentrant autograd (can be toggled with the new use_reentrant flag) (#69508)
  • Added batched_grad parameter to autograd.grad to allow batched gradient computation (#65564)
  • Forward mode AD:
  • Linear algebra operation support:
    • Added forward AD support for torch.linalg.{eig, inverse, householder_product, qr} and torch.*_solve (#65546, #67043, #67268, #67837)
    • Added forward and backward AD support for torch.linalg.lstsq (#65054)
    • Added support for a wider range of inputs for linalg.pinv (#66092)

Build

  • Added FlexiBLAS build support (#64815)
  • Added IS_LINUX and IS_MACOS global vars for cpp extensions building (#69093)
  • Added ARC for iOS CMake builds (#67884)
  • Added support for IBM z14/15 SIMD (#66407)

Complex Numbers

  • Added complex number support to Adagrad and Adadelta optimizers (#66671, #66587)

Dataloader

  • TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)

LinAlg

  • Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
  • Added the linalg.matrix_exp operation (see the docs here) (#62715)
  • Added the linalg.cross operation (see the docs here) (#63285)
  • Added the linalg.diagonal operation, an alias for torch.diagonal (see the docs here) (#70599)
  • Added the linalg.lu_factor operation (see the docs here) (#66933)

torch.nn

  • Added torch.nn.utils.rnn.{unpack_sequence,unpad_sequence} functions (#66550)

Sparse

  • Added torch.sparse.sampled_addmm for CSR Tensors on GPU (#68007)

CUDA

  • The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
    • Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
    • Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
    • The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
    • Added complex support for Jiterator, port sinc to Jiterator (#71577)
    • Jiterates lcm, i0e, i1e, ndtri, efcx, digamma, trigamma, lgamma (#70663)
    • Jiterates exp2, erfc, erfinv and entr (#71295)
    • Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
    • Jiterates polygamma (#71162)
  • Added cuSPARSE descriptors and updated CSR addmm (#60838)
  • Sparse CSR CUDA: added addmv_out (#61407)
  • Added nvidia-smi memory and utilization as native Python API (#69104)

Vulkan

  • Added Vulkan support for several torch operators:
  • Added the vulkan_perf_test benchmark binary to benchmark Vulkan ops under various input conditions. (#67230)

Mobile

  • Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
    • Build tracer for tracing based workflow (#66267)
    • Used operator.yaml to build LibTorch library (#66237)
    • Unified tracer between internal and external (#64152)
    • Reorganized model tracer dependency (#63421)
    • Added support for the bool and int dtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297)
    • Generic build features for selective build (#67817)
    • Made more classes selective (#67397)
    • Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)

Distributed

TorchScript

  • Enabled running torch.jit.freeze() and torch.jit.optimize_for_inference on functions that are not forward (#68668, #69367)
  • Enabled torch.jit.freeze to work on for sparse COO tensors (#69614)
  • Enabled torch.jit.script(), torch.jit.freeze() and serialization for tensors in Compressed Sparse Row (CSR) format (#69555)
  • Allowed users to set the fusion strategy for torch.jit.fuser through the now public torch.jit.set_fusion_strategy . (#72937)
  • Enabled Dynamic Shape Fusion For GPU & CPU, configurable via torch.jit.set_fusion_strategy (#72036)

Quantization

  • Added bilinear quantized implementation of torch.nn.functional.grid_sample 2d operator (#66879)
  • Added the torch.quantize_per_tensor_dynamic operator (#68004)
  • Added Quantization Aware Training support for torch.nn.Embedding and torch.nn.EmbeddingBag
    • Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
    • Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
    • Eager mode QAT for Embeddings (#66429)
    • Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
    • Supported Embedding QAT via FX API (#69333)
    • Add FX support for QAT EmbeddingBag (#69334)
  • Added support for depthwise quantized torch.nn.Conv3d in qnnpack, for use in quantization
    • Depthwise Conv3d Indirection Buffer Setup (#69311)
    • Depthwise Conv3d Weight Packing (#69312)
    • Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
    • Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
    • Tightened Step Height for Indirection Buffers (#70530)
    • Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
    • Implemented 3d convolution in qnnpack (#66350)

ONNX

  • Supports opset version 15 (#67805)
  • Supports exporting nn.Module calls as ONNX local functions (#66140, #67803)
  • Supports for exporting new ops:
  • Added BFloat16 type support (#66788)
  • Supports exporting with Apex O2 (#66700)

Infra (Releng)

  • Added support for ROCm 4.3.1 (#65624)
  • Added support for ROCm 4.5.2 (#71064)
  • Added support for CUDA 11.5 (#69262)
  • Added support for CUDA enabled Bazel builds (#66241)
  • Added support for Python 3.10 (#71132, #71419)

Improvements

Python API

  • NumPy compatibility:
    • Improved torch.searchsorted to be more consistent with NumPy (#66818)
    • Added torch.argwhere to match NumPy (#64257)
    • Added an alias for torch.special.softmax (#62251)
  • Improved torch.Tensor.view(dtype): enable all dtype combinations (#66493)
  • Improved torch.diff by adding support for n greater than 1 (#67260)
  • Improved torch.movedim to handle scalar as no-op (#69537)
  • Improved cartesian_prod: fixed a warning in the docs example (#68753)
  • Improved error messages for max_unpool{}d operators (#67328)
  • torch.distributions
    • Implemented positive-semidefinite constraint in torch.distributions (#71375)
    • Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
    • Implemented support for non-negative constraint in exponential distribution (allowing it to include zero). (#67184)
    • Implemented kl divergence between normal and laplace distribution. (#68807)
  • Improved meta tensor support for operators:
  • Added support for torch.Tensor.real for real-valued tensors (#71718)
  • torch.logaddexp, torch.logaddexp2, torch.remainder: added BFloat16 support on CPU (#63621)
  • torch.bucketize and searchsorted: added Half precision support (#67077)
  • Added new torch.slice_scatter,torch.select_scatter, torch.diagonal_scatter ops (#64430)
  • Made torch.scatter_reduce a public API (#68580, #73125)

C++ API

  • Added C++ API and docs for hfftn (#66127)
  • Added support for MaybeOwned<IValue> (#68157)
  • Added set_to_none option for zero_grad() to C++ API (#68801)
  • Added an environment variable, TORCH_CPP_LOG_LEVEL, that you can use to toggle the log level in the c10 library (#71746)

Autograd

  • Added nesting support for torch.autograd.graph.saved_tensor_hooks (#70932)
  • Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
  • Added complex autograd support to torch.{col2im,im2col} (#68199)
  • Added new reduce options and autograd support for torch.scatter_reduce (#71788)
  • Added derivatives wrt the second argument for torch.{remainder,fmod} (#69908)
  • Added new strategy flag to autograd.functional.{Jacobian, Hessian} to enable vectorized computation (#67041, #66292)
  • Added check_backward_ad flag to torch.autograd.gradcheck to be able to skip backward mode AD checks (#65040)
  • Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)

Build

  • Improved incremental build times of PyTorch core by removing a dependency on native_functions.yaml in many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735)
  • Enabled bazel build without glog and gflags (#70850)
  • Added support for C++ frontend wrapper on Linux (#69094)
  • Added support for dynamic codegen outputs in CMake (#68246)
  • Max CMake version is now used by default with setup.py (#69355)
  • Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
  • Code base should now be -Wno-unused-variable compliant (#66041)
  • Added lazy import for packaging in torch_version (#71345)

Dataloader

  • Support custom Sequence and Mapping for utils.data.default_collate (#68779)
  • Allowed specifying num_samples to RandomSampler when replacement is False (#71568)
  • Fixed the warning of shape inconsistency utils.data.default_collate (#71065)

ForEach

  • Implemented ForEach L1 & L2 norm (#62646)

LinAlg

  • The linalg.matrix_rank (docs) and linalg.pinv (docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)

torch.nn

  • Added channels_last support for ChannelShuffle (#50247)
  • Added no-batch-dim support for nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer} (#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291)
  • Added BFloat16 support on CPU to nn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d} (#56902, #66929, #66927, #56903)
  • Added maximize support to optim.{Adam, AdamW, SGD} (#68164, #70146, #67847, #68733, #71023)
  • F.interpolate: Add nearest-exact mode to fix off-by-one error in nearest mode (#64501)
  • F.interpolate: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)
  • F.interpolate: Improved error message for invalid shapes (#66417)
  • nn.Conv*d: Accepts 0-sized channel inputs (#66256)
  • nn.LogSigmoid: Used log1p for improved precision (#66441)
  • nn.Module: Added flag for removing duplicates from parameters (#71542)
  • nn.Module: Added register_module alias for registering a sub-module (#65174)
  • nn.ModuleList: Supported concatenation (#70887)
  • nn.MultiheadAttention: Added flag to optionally average output attention weights across heads (#70055)
  • nn.ParameterDict: Supported full set of dict methods (#69403)
  • nn.{RNN, GRU}: Allowed hidden_size to be 0 (#70556)
  • nn.Sequential: Added append method (#71326)
  • nn.Upsample: Exposed recompute_scale_factor (#66419)
  • nn.ZeroPad2d: Added extra_repr for printing purposes (#69206)
  • optim.{ChainedScheduler, SequentialLR}: Added optimizer attribute (#67406, #69817)
  • optim.swa_utils.AveragedModel: Added use_buffers flag for averaging buffers in addition to parameters (#65921, #71763)

torch.fx

  • Improved the customizability of fx.Graph’s code generation function, including support for setting a breakpoint in the generated code (#67139)
  • Supported printing inplace operators in FX (#71887)

Sparse

  • Add CSR support for several operators:
  • Added torch.sparse_coo Layout support to zeros_like (#68108)
  • Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
  • Added support for conversion of CSR to COO Tensor to to_sparse (#66774)
  • Added support for empty COO Tensors to sparse.sum (#71091)

AMD

  • Added sparse mappings for CUDA->HIP translation (#67323)
  • Enabled frexp support for ROCm builds (#67226)
  • Used hipCUB/rocPRIM scan algorithms for large index support (#68487)

CUDA

  • Allows external CUDA streams to be set as current (#66324)
  • Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
  • Improved CUDA memory usage of nanmedian result (#68591)
  • Reduced number of igamma kernel instantiations (#70666)
  • Reduced number of compare kernels by unifying them (#69111)
  • Reduced number of bernoulli tensor tensor kernel instantiations (#70169)
  • Used cub::FutureValue to simplify 64bit indexing split of cub scan (#66711)
  • Added hascuSOLVER flag to Context (#69825)
  • Improved error message from CUDACachingAllocator (#69174)
  • Fixed masked_softmax perf for element_size is not 8 (#70271)
  • Reduced binary size of TensorCompare.cu (#68835)
  • Improved error message for interpolation (#72066)
  • Doesn't compile pow kernels for non-existent case (#70017)

Profiler

  • Added flop count formulas for bmm and baddbmm (#66636)

Vulkan

  • Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
  • Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)

Mobile

  • Introduced multiple improvements for NNAPI
    • Added converters for torchscript ops quantized::mul and quantized::convtranspose2d to converter (torch.backends._nnapi.prepare.convert_model_to_nnapi) (#63913, #63914)
    • Supported int32 and qint16 type in Torchscript expressions (#70197, #70621)
    • Supported runtime flexible shapes and return shapes (#70334)
  • Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
  • Introduced multiple improvements for CoreML
    • Fixed error messages (#67410)
    • Assigned computationUnit to executor (#67411)
    • Cleaned up shape information from TensorSpec (#67412)
  • Type Support in Mobile Lite Interpreter
    • Extended type_parser to handle NamedTuple type (#63130, #62612)

Distributed

  • torch.distributed
    • Improvements to error handling in TCPStore’s socket implementation (#68225)
    • Enabled ncclAvg for reductions (#62835)
    • Init dummy NCCL comms in constructor (#65173, #66393)
    • Added pybind trampoline for ProcessGroup and Work (#66338)
    • Setup c10d extension Backend class attr the same way as builtin ones (#66991)
    • Added barrier to ProcessGroup trampoline (#67236)
    • Raised warning when calling collectives on non-member group objects (#67639)
    • Patched bfloat16 support for NCCL (#67843)
    • Fixed c10d TCP store race condition with mutex (#68499)
    • Surfaced ncclUniqueId store broadcast error (#68597)
    • Checks for file existence before invoking cleanup logic in FileStore destructor (#68603)
    • Implemented gather primitive for ProcessGroupNCCL (#66745)
    • Implemented scatter primitive for ProcessGroupNCCL (#70029)
    • Enabled gather_object on NCCL (#71623)
    • Implemented allreduce_coalesced for ProcessGroupNCCL (#62140)
    • Set non-default backend names to lower case (#69400)
    • Added support for deleteKey for FileStore (#69953)
    • Fixed TSAN issue in TCPStore (#69590)
  • DistributedDataParallel
    • Refactored and removed sync_params (#64514)
    • Used named_params and named_buffers explicitly (#65181)
    • Allow await of custom buffer reduction in backward (#64515)
    • Profiling range for bucket copy (#65769)
    • Logs iteration in debug mode (#65770)
  • torch.distributed.rpc
    • Added a timeout argument to RPC shutdown() (#65425)
    • Released GIL during RPC shutdown. (#69586)
    • Updated RPC shutdown() logic to remove process group usage. (#65946)
    • Removal of Process Group dependency for TensorPipe Agent. (#68128)
  • torch.distributed.autograd
    • Made Kineto + distributed a warning rather than an error (#71120)
  • torch.distributed.elastic
    • Added ability to override sys.executable for torch.distributed.run (#66179)

TorchScript

  • Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single CudaFusionGroup, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137)
  • Enabled profile_ivalue to convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137)
  • Added support in torch.jit.trace for tracing already JITted subgraphs(#59949)
  • We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
  • torch.jit.freeze now can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)
  • The peephole optimizer, which is used in torch.jit.freeze now coalesces consecutive calls to torch.concat into a single call (#67000)
  • Added ability for Torch.JIT C dispatch to convert python None into an undefined Tensor(#67793)
  • torch.jit.script now recognizes union of scalars as a JIT NumberType (#66591)
  • No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
  • In torch.jit.optimize_for_inference, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024)
  • In torch.jit.freeze, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024)
  • Added support for normalizing torch.Tensor.__rsub__ in normalize_ops JIT pass(#65014)

Quantization

  • Quantized op improvements
    • torch.ao.FakeQuantize now supports fp32/fp16 zero_point. (#65836)
    • torch.ops.quantized.add now supports broadcasting (#66049)
    • torch.Tensor.dequantize now supports fp16 + cuda (#67234)
    • Added quantized CPU support for torch.nn.GELU (#69968)
    • torch.nn.quantized.functional.hardsigmoid supports an inplace flag (#65740)
  • Workflow improvements
    • FX graph mode quantization: enable torch.nn.Linear + torch.nn.BatchNorm1d fusion for PTQ (#66484)
    • Added an option in torch.ao.quantization.quantize_fx.convert_fx to accept qconfig_dict to skip quantization (#66878)
    • Added torch.nn.qat.dynamic.modules.Linear module (#67325)
    • Added torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}d fusion support (#70022)
    • Extended torch.ao.quantization.prepare_qat with allow_list argument, to allow custom mapping and custom QAT module (#65119)
    • Added torch.ao.quantization.default_replay_qconfig which allows observer reuse for torch.reshape in FX graph mode quantization (#69249)

ONNX

  • Set ir_version of the exported model based on opset_version. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803)
  • Preserved op input names when op just passes through the input to the output (#67275)
  • Shape inference improvements:
    • Updated slice process shape to support rank only inference (#66149)
    • Represent symbolic shape as value (#69545)
  • Included op type in exported models’ input and output names (#68976)
  • Supports Conv-BatchNorm fusion inside blocks (#67272)
  • Exported torch.reciprocal to ONNX Reciprocal operator instead of Div(1, x) (#67271)
  • Supports beta!=1 in softplus (#66146)
  • Added warning for inplace updates on tensor.shape in tracing mode (#66142)
  • Supports instance_norm in training mode (#64375)
  • Allow registration of custom symbolics for ops specifying aten namespace (i.e. aten::foo is allowed as well as “foo”). (#67810)
  • Allow registration of custom symbolics for prim namespace (#66139)
  • Supports dynamic inputs for OneHot, bool for Einsum (#66147)

Infra (Releng)

  • Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)

torch.package

  • Add ability to retrieve the dependency graph via all_path function(#65602)
  • Add support for pickle v4 (#70642)
  • Add better testing support for Package Exporter (#70641)

Bug fixes

Python API

  • Fixed scalar inputs for aliased binary ops {multiply, subtract, divide} (#65937)
  • Fixed torch.save when saving storages that view same data with different type (#66949)
  • Fixed torch.save error if storages are unallocated (#68787)
  • Fixed k out-of-bounds in torch.kthvalue (cpu kernel) (#68863)
  • Fixed inference_mode decorator: with inference_mode(mode=False) used to ignore the mode argument and always set inference mode. (#68617)
  • Fixed cdist_backward in the case when cdist inputs are not contiguous (#70016)
  • Fixed cdist error message typo (#70178)
  • Fixed scatter for empty indexes (#70662)
  • Fixed torch.{unique, unique_consecutive} out of bound (#71540)
  • Fixed torch.isin in the case when inputs are non-contiguous on CPU (#70659)
  • Fixed hsplit vsplit dsplit crash when section is 0 (#69342)
  • Fixed: torch.gradient ignores dim argument when checking edge_order (#67926)
  • Fixed: TransformedDistribution.icdf should perform validation after applying the inverse transformation rather than before. (#71393)
  • Fixed torch.all and torch.any internal assert error with requires_grad=True (#65714)
  • Fixed torch.logsumexp type promotion: promote integral inputs to floating for(#63393)

C++ API

  • Fixed libtorch at::Tensor::print() linking error (#69615)
  • Avoided UB when indexing into size-0 tensors (#65878)
  • Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)

Autograd

  • Fixed autocast state propagation in the torch.utils.checkpoint API (#71169)
  • Fixed torch.nn.functional.conv_transpose3d backward when grad_out is non-contiguous (#67829)
  • Forward mode AD:
    • Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
    • Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
    • Fixed forward AD codegen for functions that have multiple formulas (#68535)
    • Fixed deadlock when forward and backward AD are used at the same time (#67360)
    • Fixed Tensor.copy_ forward AD to handle broadcasting (#69592)
    • Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
  • Fixed autograd.Function when non-Tensor argument precedes tensor argument (#71530)
  • Fixed autograd.Function forward AD when forward is a no-op to no longer raise an internal error (#71531)

Build

  • Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
  • Disabled SVE when cross-compiling for M1 (#67114)
  • Added failure if pocketfft is not found and at_mkl is not enabled (#67909)
  • Fixed clang issues when compiling with _GLIBCXX_USE_CXX11_ABI (#72081)

Complex Numbers

  • Fixed torch.autograd.gradcheck to generate valid inputs for forward AD computation for complex functions (#68001)
  • Fixed torch.Tensor.copy_ transpose path for tensors with conjugate or negative bit set (#69026)
  • Fixed torch.Tensor.copy_ behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)

Dataloader

  • Made ProcessException picklable (#70118)
  • Fixed persistent worker exiting before pin_memory_thread (#71579)

torch.nn

  • nn.AdaptiveAvgPool*d: Throws an error for negative output_size (#70488)
  • nn.Conv1d: Fixed for 1D convolution on MKL-DNN backend (#68166)
  • nn.CrossEntropyLoss: Fixed for usage of weight, ignore_index, and label_smoothing together (#69511)
  • nn.Fold: Checked that block height and width are positive (#69048)
  • nn.LayerNorm: Fixed incorrect result on CUDA when gamma or bias are missing (#69210)
  • nn.LayerNorm: Avoided overflow by doing computation in float for half (#66920)
  • nn.Module: Throws a proper error message from load_state_dict for non-tensor values (#70596)
  • nn.ModuleList: Fixed incorrect return type in __getitem__ (#69083)
  • nn.MultiheadAttention: Used query dtype for mask type (#68077)
  • nn.NLLLoss: Fixed backward computation with negative weights (#64572)
  • nn.{RNN, GRU}: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)
  • nn.utils.rnn.pad_sequence: Fix regression to support tuples for padding (#72436)
  • optim._LrScheduler: Fixed print formatting (#68338)
  • optim.ChainedScheduler: Fixed get_last_lr() (#69112)
  • optim.CosineAnnealingWarmRestarts: Fixed ordering bug when last_epoch > 0 (#64758)
  • optim.SequentialLR: Updated _last_lr on step (#70558)

torch.fx

  • Supported torch.layout as arg (#66048)
  • Specified a default value when possible for placeholders created from concrete_args (#59569)
  • Fixed issue where GraphModule.delete_all_unused_submodules deletes submodules from called leaf modules (#66430)
  • Fixed torch.fx.subgraph_rewriter.replace_pattern mechanism so that multiple one-liner instances of the pattern are captured correctly (#66442)
  • Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
  • Ensured node stack trace survives copying (#69368)
  • Fixed to_folder not saving dtype (#69983)
  • Added a default_value arg to fx.Graph.placeholder and fix split_module (#71016)

Sparse

  • Fixed CSR storage access to throw when used (#70072)
  • Fixed multiplication of 0-D sparse tensors (#70749)
  • Fixed result dtype for neg if given sparse Tensor (#68885)

CUDA

  • Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
  • Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
  • Fixed error in warning about unsupported GPU (#67900)
  • Disabled TF32 in pinv_jvp and pinv_backward (#67948)
  • Fixed DLPack CUDA stream convention (#67618)
  • Sets device guard in _cudnn_impl functions (#70406)
  • Fixed mem_get_info when querying on a device other than the current device (#69640)

Benchmark

  • Fixed divide-by-zero errors in torch.utils.benchmark.Timer (#70050)

Dispatcher

  • Added explicit OperatorHandle destructor, so that the symbol shows up in windows builds (#70033)

Profiler

  • Fixed race condition in profiler (#65812)
  • Fixed TensorBoard memory profiling (#71417)

Visualization

  • Fixed torch.utils.tensorboard parsing JIT graph incorrectly (#65692)

Vulkan

  • Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
  • Addressed several warnings raised by the Vulkan Validation layers:
    • Updated all texture resources to have the same dimensionality (#67647)
    • Added image format qualifier to shader files (#69330)
    • Disabled SPIR-V compiler size optimization (#69331)

Mobile

  • Fixed quantized logistic converter for NNAPI (#70847)
  • Fixed potential crash if MTLCreateSystemDefaultDevice returns nil (#66859)
  • Used full name to look for the promoted prim operator table (#66081)
  • Fixed function name bug in mobile export (#66915)
  • Fixed issues with irange not having a header included in Metal (#66877)
  • Fixed backward compatibility issue for UnionType on mobile in type_parser. (#71341)
  • Fixed forward flatbuffer type handling with dynamic type in flatbuffer_loader. (#71500)
  • Fixed type equalities issue in pytorch_jni_common (#71508)
  • Fixed missing properties to the executor in CoreML (#67737)
  • Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
  • Ensured that function participating in bundled inputs have their “name" attribute set (#65856)

Distributed

  • torch.distributed
    • Fixed bug on empty GLOO_SOCKET_IFNAME_ENV (#68933)
  • DistributedDataParallel
    • Fixed “Cannot modify in-place due to DDPSink” (#66015)
  • torch.distributed.elastic
    • Fixed scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)

TorchScript

  • Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
  • Fixed a ref counting loop for CompilationUnit, resulting in memory leaks when class objects were in JIT graphs. (#65442)
  • Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
  • Fixed bug where torch.jit.optimize_for_inference did not torch.jit.freeze a module when passed a a non-frozen module (#71436)
  • Fixed bug where running module.forward() on a torch.jit.freeze ed module ran the wrong graph (#68316)
  • Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of torch.split , resulting in invalid optimizations in various JIT optimization passes (#69745)
  • Fixed places where using torch.autocast together with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648)
  • Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
  • Fixed bug where graph input types were incorrect after running torch.jit.trace (#68242)
  • Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when torch.jit.freeze ops are converted to MKLDNN(#66628)
  • Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
  • Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python pickle version.(#69807)
  • Fixed bug where torch.jit.script fails when comments in function has less indent than surrounding code (#70227)
  • Fixed incorrect device type when torch.device is called inside scripted (torch.jit.script) code (#69645)
  • Fixed warning: overloaded virtual function torch::jit::Function::call is only partially overridden in class torch::jit::GraphFunction (4bf1be898d)

Quantization

  • Fixed applying non-zero offset 1 to null pointer in torch.nn.functional.interpolate for quantized tensors (#65570)
  • Doesn't assume bias is a keyword argument to torch.nn.Conv{n}d (#61647, #71426)
  • Made error message when trying to use torch.quantize_per_tensor on non floats more specific (#66050)
  • Quantized torch.nn.Embedding conversion with unsupported dtype: make error message clearer (#66051)
  • Fixed torch.nn.qat.EmbeddingBag from_float error message (#66989)
  • Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in torch.nn.Embedding QAT (#68852)
  • Fixed scale+zp serialization of torch.nn.quantized.BatchNorm{2|3}d (#70432)
  • Fixed torch.nn.Dropout in FX graph mode quantization (#71043, #71438)
  • Fixed qconfig setting for fused modules in FX graph mode quantization (#71254)
  • Removed assumption number of rows is in 32 bit in fbgemm (#69066)
  • Fixed reduce_range warning when using default observers (#71027)

ONNX

  • Doesn’t create invalid index_select op when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493)
  • Shape inference:
    • ConstantMap setters to update existing value instead of emplace, and fix default value of keepdims for Reduce (#67812)
    • Fixed memory leak (#68210)
    • Fixed reshape shape inference regression affecting LSTM (#72532)
  • Fixed inplace fill_ dtype export mismatch (#64580)
  • Fixed remainder (#64578)
  • Fixed reciprocal when input is not floating point (#67808)
  • Fixed new_full and full_like for Python 3.9 (#67806)
  • Fixed reduce ops on binary_cross_entropy_with_logits (#67805)
  • Propagated node metadata across passes (#45256)
  • Ensured outputs don’t have the same name (#66137)
  • Fixed pad with sequence inputs (#64377)
  • Fixed instance_norm with track_running_stats=True (#64375)
  • Fixed all and any with dim arg (#67270)
  • Allows autograd functions (prim::PythonOp) to be exported with OperatorExportTypes.ONNX_FALLTHROUGH (#67273)

torch.package

  • Prevent import race condition that leaves torch.package.PackagePickler with unwanted dispatch table entries. (#71025)

Performance

Python API

  • Speed up pickling for torch.dtype (#65182)
  • Speed up histogram: avoid index_put_ overhead in histogram kernel's inner loop (#67815)
  • Speed up torch.topk with sort for some cases (#68632)
  • Speed up torch.stack: don't unsqueeze every stack arg if possible (#70288)
  • Speed up LayerNorm 4-5% (#71423)
  • Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
  • Speed up indexing functions: release GIL in a few places (#71728)
  • Speed up torch.empty a bit: define check_sizes_nonnegative as inline (#71640)
  • Speed up XLA tensor printing by reducing compilations (#71147)

C++ API

  • Updated c10::SmallVector from LLVM (#69110)
  • Reduced some framework overhead in at::copy_() (#68950)
  • Reduced some overhead in StorageImpl::set_data_ptr (#65432)
  • Improved IValue performance for tuples by inlining tuple storage (#64066)

Autograd

  • Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
  • Rewrote the backward of linalg.lu and linalg.lu_solve to use linalg_solve_triangular (#63569)
  • Updated nn.functional.grid_sample backward to compute input gradient only if required (#66069, #66070)
  • Stopped erroneously saving the output of torch.softplus for backward (#70296)

Complex Numbers

  • Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
  • Restored conjugate and negative bits of a tensor when calling repeat_interleave (#68523)

CUDA

  • Used a better hash table in CUDACachingAllocator (#71667)
  • TopK CUDA Optimization: used multiple block per slice (#71081)
  • Removed sync in Embedding caused by unique (#66091)
  • EmbeddingBackward exclusive_scan thrust->cub (#66566)
  • sort_out_cuda: Used custom kernels to fill index tensors (#66668)
  • masked_scatter: fuse mask count check into one kernel (#66871)
  • Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
  • Improved native layer_norm forward perf (#67977)
  • Improved native layer_norm backward perf (#68238)
  • Fast path for size 0 GPU host malloc (#68532)
  • Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
  • Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
  • Removed call_once from CUDACachingAllocator (#71668)
  • Reworked stat collection in CUDACachingAllocator (#71669)
  • Fixed CUDA LpNormFunctor (#70601)

Dispatcher

  • Made c10::KernelFunction struct smaller, which should reduce some memory usage by the dispatcher (#65618)

torch.fx

  • Made torch.fx.symbolic_trace reuse buffers if they're the same (#66211)

Profiler

Mobile

TorchScript

  • Improved performance of autodiff on small JIT graphs (#71666)
  • Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
  • Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
  • In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)

Quantization

  • Sped up quantized torch.nn.functional.interpolate for channels last (#66525)
  • Sped up torch.nn.functional.upsample for channels last (#70903)
  • Parallelized computation in torch.quantize_per_tensor_affine and torch.dequantize (#65845)

Documentation

Python API

  • Added docs for torch.adjoint. (#68869)
  • Clarified difference in behavior of empty_strided and as_strided (#64568)
  • Added some missing generated doc entries (torch.select, torch.slice_scatter, torch.diagonal_scatter, torch.select_scatter) (#69030), histogramdd (#68273)
  • Typo and formatting fixes. LinearLR (#67840), torch.any (#65310, #70187), torch.futures (#70630), jit docs (#68557), Tensor.type (#67019), torch.lobpcg (#71464), Tensor.triu(), Tensor.tril(), Tensor.ravel(). (#71057), torch.acosh (#66814), (#70439)
  • General Doc improvements for individual ops. torch.finfo (mention torch.bfloat16) (#68496), torch.quantile interpolation kwarg (#70637), from_dlpack and to_dlpack (#70437), set_printoptions added examples (#68324), index_add (#65806), topk doc (#65938), unique (#66132), chi2 (#67379), torch.histc (#64191), empty and empty_like (#68874), torch.cholesky_inverse (#69069), torch.dsplit (#70557)
  • Changed README getting started link to explicit instructions (#66828)
  • Modernized and clarified docs for torch.tensor and torch.as_tensor (#63308)
  • Improved torchhub docs (#69970)
  • Updated docs for torch.Tensor.real to indicate that it's supported for real tensors (#71962)

C++ API

  • Fixed typos in ATen README (#69170)
  • Mentioned TORCH_SHOW_CPP_STACKTRACES in Contributing.md docs (#64052)
  • Updated link to C++ frontend examples (#66095)
  • Added docs for Visual Studio extension (#63944)
  • Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)

Autograd

  • Updated docs for forward AD and make them public (#71643, #71159)
  • Updated “Extending PyTorch” doc to cover forward AD (#66962)
  • Fixed broken code syntax in autograd.rst (#69362)
  • Fixed incorrect variable in autograd docs (#70884)
  • Fixed typo in torch.autograd.Function docs that prevented it from compiling (#66754)

Dataloader

  • Added docstring for default_collate and default_convert (#69862)
  • Updated the documentation for AMP with DataParallel (#69218)

torch.nn

  • F.binary_cross_entropy: Updated examples to avoid deprecated calls (#69816)
  • F.linear: Fixed shape docs to indicate no-batch-dim support (#66884)
  • F.max_pool*d: Added functional docs (#63264)
  • F.multilabel_soft_margin_loss: Added reduction args to signature (#70420)
  • nn.AdaptiveLogSoftmaxWithLoss: Fixed typo in log_prob name (#68926)
  • nn.{BatchNorm1d, InstanceNorm1d}: Fixed input shape notation inconsistencies (#71371)
  • nn.CrossEntropyLoss: Corrected typo in formula for class probability targets (#70220)
  • nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)
  • nn.Flatten: Simplified example code (#67472)
  • nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}: Added activation function images (#65415)
  • nn.KLDivLoss: Fixed rendering of reduction arg (#66583)
  • nn.KLDivLoss: Rewrote docs to clarify math (#67443)
  • nn.MaxUnpool2d: Changed misleading example to better demonstrate output_size usage (#68936)
  • nn.Module: Added note describing required super().__init__() call (#66909)
  • nn.Module: Changed super() usage to Python 3 syntax in example (#65748)
  • nn.Module: Fixed formatting for named_modules() (#70491)
  • nn.NLLLoss: Corrected default value for reduce (#68426)
  • nn.SmoothL1Loss: Clarified equivalence with nn.L1Loss when beta == 0 (#70673)
  • nn.{TransformerDecoderLayer, TransformerEncoderLayer}: Clarified default batch_first=False dimension format (#66574)
  • nn.Upsample: Indicated that align_corners takes effect in bicubic mode (#66756)
  • nn.utils.clip_grad_norm_: Fixed rendering of parameters in error_if_nonfinite arg docs (#69958)
  • optim.Adam: Fixed formatting (#70387)
  • optim.AdamW: Fixed formula (#68587)
  • optim.RAdam: Corrected default value of lr arg (#69186)
  • Removed orphan from cuDNN persistent note (#65160)
  • Updated link to tutorial on defining NN modules (#65534)
  • nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}: Fix doc formatting regressions from no-batch-dim support (#73014)

torch.fx

  • Fixed for retracing documentation which would break for n-ary operators (#71599)
  • Updated torch.fx.passes.split_module docstring (#65542)
  • Updated fx.rst example outputs (#68043)
  • Added document gotcha about training flag (#68915)
  • Defined get_dot_``graph to match documentation (#70541)

Sparse

  • Updated sparse.rst to warn about _values() (#71088)

CUDA

  • Updated Stream wait documentation to reference underlying cudaStreamWaitEvent call (#67973)
  • Documented torch.cuda.ExternalStream, torch.cuda.caching_allocator_alloc and torch.cuda.caching_allocator_delete (#70126)
  • Updated CUDA Graphs docs: Fixed make_graphed_callables example typos (#69379)

Mobile

  • Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
  • Added recipe for bundled inputs in TorchScript models (#1524)

Distributed

  • DistributedDataParallel
    • DDP doc fix (#71363)
    • Clarified how to check memory saving if using gradient_as_bucket_view (#71483)
  • torch.distributed
    • Updated distributed.rst to show that CUDA send/recv on GPU is supported (#65601)
    • Clarified checkpoint support (#68827)
    • Updated distributed.rst for ProcessGroup Extensions (#71482)
  • torch.distributed.elastic
    • Made --max_restarts explicit in the quickstart and runner docs (#65838)
  • torch.distributed.optim
    • Rendered torch.distributed.optim members (#67885)
  • torch.distributed.rpc
    • Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)

TorchScript

  • Added typing.Union to supported types in documentation (#68435)
  • Added documentation to torch.jit.is_tracing() (#67326)
  • Fixed typos in jit_language_reference.rst (#68706)

Quantization

  • Added documentation for quantized model save/load instructions (#69789)
  • Updated link to qnnpack in quantization doc. (#66226)
  • Improved quantization API docs (#66379)
  • Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
  • Documented the quantization custom module APIs (#67449)
  • Improved quantization documentation (#68907)

ONNX

  • Improved documentation of operator_export_type and opset_version args (#69549)
  • Fixed documentation for do_constant_folding arg default (#71348)
  • Documented ExportTypes, CheckerError, and unregister_custom_op_symbolic (#68489)
  • Fixed link to ONNX Runtime custom op documentation (#67944)
  • Added section “Discovering all unconvertible ATen ops at once” (#66143)
  • Fixed typos (#66090)
  • Documented work-arounds for indexing export limitations, and improve error messages (#64579)

torch.package

  • Add some docs describing how to debug torch.package dependencies (#65704)

    Download Release

    This release has 2 assets:

    • pytorch-v1.11.0.tar.gz
    • Source code (zip)
    • Source code (tar.gz)

    Visit the release page to download them.


    Have any questions?
    Contact Exxact Today


    Topics

    PyTorch-v1.11.jpg
    Deep Learning

    PyTorch 1.11.0 Now Available

    March 10, 2022145 min read

    PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.

    The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!

    PyTorch 1.11.0 Release Notes

    • Highlights
    • Backwards Incompatible Change
    • Deprecations
    • New Features
    • Improvements
    • Performance
    • Documentation

    Highlights

    The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:

    • TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
    • functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
    • Distributed Data Parallel (DDP) static graph optimizations available in stable.

    You can check the blogpost that shows the new features here.

    Backwards Incompatible changes

    Python API

    Fixed python deepcopy to correctly copy all attributes on Tensor objects (#65584)

    This change ensures that the deepcopy operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).

    1.10.21.11.0
    a = torch.rand(2)
    a.foo = 3
    torch.save(a, "bar")
    b = torch.load("bar")
    print(b.foo)
    # Raise AttributeError: "Tensor" object has no attribute "foo"
          
    a = torch.rand(2)
    a.foo = 3
    torch.save(a, "bar")
    b = torch.load("bar")
    print(b.foo)
    # 3
          

    steps argument is no longer optional in torch.linspace and torch.logspace

    This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps). In PyTorch 1.11, it is not longer optional.

    1.10.21.11.0
    # Works, but raises a deprecation warning
    # Steps defaults to 100
    a = torch.linspace(1, 10)
    # UserWarning: Not providing a value for linspace's steps is deprecated
    # and will throw a runtime error in a future release.
    # This warning will appear only once per process.
    # (Triggered internally at  ../aten/src/ATen/native/RangeFactories.cpp:19
          
    # In 1.11, you must specify steps
    a = torch.linspace(1, 10, steps=100)
          

    Remove torch.hub.import_module function that was mistakenly public (#67990)

    This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module.

    C++ API

    We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)

    When you #include a header from the C++ frontend, you can no longer assume that every aten operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h> in your file, which will maintain the old behavior of including every aten operators.

    Custom implementation for c10::List and c10::Dict move constructors have been removed (#69370)

    The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"

    1.10.21.11.0
    c10::List list1({"3", "4"});
    c10::List list2(std::move(list1));
    std::cout << list1.size() // 0
          
    c10::List list1({"3", "4"});
    c10::List list2(std::move(list1)); // calls copy ctr
    std::cout << list1.size() // 2
          

    CUDA

    Removed THCeilDiv function and corresponding THC/THCDeviceUtils.cuh header (#65472)

    As part of cleaning up TH from the codebase, the THCeilDiv function has been removed. Instead, please use at::ceil_div, and include the corresponding ATen/ceil_div.h header

    Removed THCudaCheck (#66391)

    You can replace it with C10_CUDA_CHECK, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions

    Removed THCudaMalloc(), THCudaFree(), THCThrustAllocator.cuh (#65492)

    If your extension is using THCThrustAllocator.cuh, please replace it with ATen/cuda/ThrustAllocator.h and corresponding APIs (see examples in this PR).

    This PR also removes THCudaMalloc/THCudaFree calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr), or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.

    Build

    Stopped building shared library for AOT Compiler, libaot_compiler.so (#66227)

    Building aot_compiler.cpp as a separate library is not necessary, as it’s already included in libtorch.so.
    You can update your build system to only dynamically link libtorch.so.

    Mobile

    Make typing.Union type unsupported for mobile builds (#65556)

    typing.Union support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.

    Distributed

    torch.distributed.rpc: Final Removal of ProcessGroup RPC backend (#67363)

    ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.

    The backend type “PROCESS_GROUP” is now deprecated, e.g.
    torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
    and should be replaced with:
    torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)

    Quantization

    Disabled the support for getitem in FX Graph Mode Quantization (#66647)

    getitem used to be quantized in FX Graph Mode Quantization, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.

    1.10.21.11.0
    from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = torch.nn.Linear(5, 5)
        def forward(self, x):
            x = self.linear(x)
            y = torch.stack([x], 0)
            return y[0]
    m = M().eval()
    m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
    m = convert_fx(m)
    print(m)
    # prints
    # GraphModule(
    #   (linear): QuantizedLinear(in_features=5, out_features=5,
    #      scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
    # )
    # def forward(self, x):
    #     linear_input_scale_0 = self.linear_input_scale_0
    #     linear_input_zero_point_0 = self.linear_input_zero_point_0
    #     quantize_per_tensor = torch.quantize_per_tensor(x,
    #         linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
    #     x = linear_input_scale_0 = linear_input_zero_point_0 = None
    #     linear = self.linear(quantize_per_tensor)
    #     quantize_per_tensor = None
    #     stack = torch.stack([linear], 0);  linear = None
    #     getitem = stack[0]; stack = None
    #     dequantize_2 = getitem.dequantize();  getitem = None
    #     return getitem
          
    from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = torch.nn.Linear(5, 5)
        def forward(self, x):
            x = self.linear(x)
            y = torch.stack([x], 0)
            return y[0]
    m = M().eval()
    m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
    m = convert_fx(m)
    print(m)
    # prints
    # GraphModule(
    #   (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
                        zero_point=0, qscheme=torch.per_tensor_affine)
    # )
    # def forward(self, x):
    #     linear_input_scale_0 = self.linear_input_scale_0
    #     linear_input_zero_point_0 = self.linear_input_zero_point_0
    #     quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0,
                         linear_input_zero_point_0, torch.quint8)
    #     x = linear_input_scale_0 = linear_input_zero_point_0 = None
    #     linear = self.linear(quantize_per_tensor);  quantize_per_tensor = None
    #     stack = torch.stack([linear], 0);  linear = None
    #     dequantize_2 = stack.dequantize();  stack = None
    #     getitem = dequantize_2[0];  dequantize_2 = None
    #     return getitem
          

    Users should now use fuse_modules for PTQ fusion and fuse_modules_qat for QAT fusion (#69878, #71956)

    There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training to decide which mode user wanted, but this was a misuse of the training attribute since that is not the intended purpose. This PR removes the dependency on module.training and uses separate APIs to make the fusion requested by the user explicit.

    Previously, fuse_module used to support both cases and distinguished PTQ/QAT fusion based on module.training, but now fuse_module only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat, instead of using fuse_modules, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.

    Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.

    1.10.21.11.0
    import torch
    from torch.ao.quantization import fuse_modules
    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = torch.nn.Conv2d(3, 3, 3)
            self.bn = torch.nn.BatchNorm2d(3)
        def forward(self, x):
            return self.bn(self.conv(x))
    m = M().train()
    m = fuse_modules(m, ["conv", "bn"])
    print(type(m.conv))
    m = M().eval()
    m = fuse_modules(m, ["conv", "bn"])
    print(type(m.conv))
    <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
    <class 'torch.nn.modules.conv.Conv2d'>
          
    import torch
    from torch.ao.quantization import fuse_modules
    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.conv = torch.nn.Conv2d(3, 3, 3)
            self.bn = torch.nn.BatchNorm2d(3)
        def forward(self, x):
            return self.bn(self.conv(x))
    m = M().train()
    # For Quantization Aware Training, use fuse_modules_qat()
    m = fuse_modules_qat(m, ["conv", "bn"])
    print(type(m.conv))
    m = M().eval()
    m = fuse_modules(m, ["conv", "bn"])
    print(type(m.conv))
    # Result (doesn't change):
    <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
    <class 'torch.nn.modules.conv.Conv2d'>
          

    ONNX

    Removed f arg from onnx.export_to_pretty_string (#69546)

    The arg has always been ignored. Simply remove it from your code.

    1.10.21.11.0
    torch.onnx.export_to_pretty_string(model, inputs, "file_name")
          
    torch.onnx.export_to_pretty_string(model, inputs)
          

    Removed use_external_data_format arg from onnx.export (#67809)

    The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.

    1.10.21.11.0
    torch.onnx.export(model, inputs, f_name, use_external_data_format=True)
          
    torch.onnx.export(model, inputs, f_name)
          

    Removed example_outputs arg from torch.onnx.export (#67809)

    The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.

    1.10.21.11.0
    torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))
          
    torch.onnx.export(model, inputs, f_name)
          

    Removed enable_onnx_checker arg from onnx.export (#67276)

    The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError will be raised. Users can catch and ignore that exception.

    1.10.21.11.0
    torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)
          
    try:
        torch.onnx.export(model, inputs, f_name)
    except torch.onnx.CheckerError:
        pass # ignore error
          

    Moved and renamed onnx.utils.ONNXCheckerError to onnx.CheckerError (#66644)

    Previously the documentation was incorrect and stated ONNXCheckerError was in the onnx module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.

    1.10.21.11.0
    except torch.onnx.utils.ONNXCheckerError:
          
    except torch.onnx.CheckerError:
        

    Removed _retain_param_name arg from onnx.export (#67276)

    The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.

    1.10.21.11.0
    # NOTE: No way to get same behavior as _retain_param_name=False.
    torch.onnx.export(model, inputs, f_name, _retain_param_name=True)
          
    torch.onnx.export(model, inputs, f_name)
        

    Deprecations

    Python API

    Deprecated x.T on tensors of dimension other than 0 or 2 (#64180)

    x.T only accepts tensors with 0 or 2 dimensions. Calling x.T on tensors with a different number of dimensions has been deprecated.

    1.10.21.11.0
    a = torch.ones(2, 3, 4)
    a.T.size()
    # torch.Size([4, 3, 2])
          
    a = torch.ones(2, 3, 4)
    a.T.size()
    # UserWarning: The use of `x.T` on tensors of dimension other than 2
    # to reverse their shape is deprecated and it will throw an error in a future release.
    # Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))`
    # to reverse the dimensions of a tensor. (Triggered internally at 
    # aten/src/ATen/native/TensorShape.cpp:2386.)
    # torch.Size([4, 3, 2])
        

    Quantization

    torch.ao.quantization.QConfigDynamic is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig instead (#69875, #69864)

    1.10.21.11.0
    qconfig = torch.ao.quantization.QConfigDynamic(...)
          
    qconfig = torch.ao.quantization.QConfig(...)
        

    New features

    Python API

    • Added set_deterministic_debug_mode and get_deterministic_debug_mode (#67778, #66233)
    • Added n-dimensional Hermitian FFT: torch.fft.ifftn and torch.fft.hfftn (#63890)
    • Added Wishart distribution to torch.distributions (#70377)
    • Preliminary support for the Python Array API standard has been added to the torch and torch.linalg modules. PyTorch implements over 90% of the operators defined by the Python Array API, including the torch.from_dlpack operation for improved DLPack support (#60627)
    • Moved torch.testing from prototype to beta (#69668)

    Autograd

    • Added new torch.utils.checkpoint implementation that does not use reentrant autograd (can be toggled with the new use_reentrant flag) (#69508)
    • Added batched_grad parameter to autograd.grad to allow batched gradient computation (#65564)
    • Forward mode AD:
    • Linear algebra operation support:
      • Added forward AD support for torch.linalg.{eig, inverse, householder_product, qr} and torch.*_solve (#65546, #67043, #67268, #67837)
      • Added forward and backward AD support for torch.linalg.lstsq (#65054)
      • Added support for a wider range of inputs for linalg.pinv (#66092)

    Build

    • Added FlexiBLAS build support (#64815)
    • Added IS_LINUX and IS_MACOS global vars for cpp extensions building (#69093)
    • Added ARC for iOS CMake builds (#67884)
    • Added support for IBM z14/15 SIMD (#66407)

    Complex Numbers

    • Added complex number support to Adagrad and Adadelta optimizers (#66671, #66587)

    Dataloader

    • TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)

    LinAlg

    • Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
    • Added the linalg.matrix_exp operation (see the docs here) (#62715)
    • Added the linalg.cross operation (see the docs here) (#63285)
    • Added the linalg.diagonal operation, an alias for torch.diagonal (see the docs here) (#70599)
    • Added the linalg.lu_factor operation (see the docs here) (#66933)

    torch.nn

    • Added torch.nn.utils.rnn.{unpack_sequence,unpad_sequence} functions (#66550)

    Sparse

    • Added torch.sparse.sampled_addmm for CSR Tensors on GPU (#68007)

    CUDA

    • The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
      • Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
      • Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
      • The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
      • Added complex support for Jiterator, port sinc to Jiterator (#71577)
      • Jiterates lcm, i0e, i1e, ndtri, efcx, digamma, trigamma, lgamma (#70663)
      • Jiterates exp2, erfc, erfinv and entr (#71295)
      • Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
      • Jiterates polygamma (#71162)
    • Added cuSPARSE descriptors and updated CSR addmm (#60838)
    • Sparse CSR CUDA: added addmv_out (#61407)
    • Added nvidia-smi memory and utilization as native Python API (#69104)

    Vulkan

    • Added Vulkan support for several torch operators:
    • Added the vulkan_perf_test benchmark binary to benchmark Vulkan ops under various input conditions. (#67230)

    Mobile

    • Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
      • Build tracer for tracing based workflow (#66267)
      • Used operator.yaml to build LibTorch library (#66237)
      • Unified tracer between internal and external (#64152)
      • Reorganized model tracer dependency (#63421)
      • Added support for the bool and int dtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297)
      • Generic build features for selective build (#67817)
      • Made more classes selective (#67397)
      • Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)

    Distributed

    TorchScript

    • Enabled running torch.jit.freeze() and torch.jit.optimize_for_inference on functions that are not forward (#68668, #69367)
    • Enabled torch.jit.freeze to work on for sparse COO tensors (#69614)
    • Enabled torch.jit.script(), torch.jit.freeze() and serialization for tensors in Compressed Sparse Row (CSR) format (#69555)
    • Allowed users to set the fusion strategy for torch.jit.fuser through the now public torch.jit.set_fusion_strategy . (#72937)
    • Enabled Dynamic Shape Fusion For GPU & CPU, configurable via torch.jit.set_fusion_strategy (#72036)

    Quantization

    • Added bilinear quantized implementation of torch.nn.functional.grid_sample 2d operator (#66879)
    • Added the torch.quantize_per_tensor_dynamic operator (#68004)
    • Added Quantization Aware Training support for torch.nn.Embedding and torch.nn.EmbeddingBag
      • Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
      • Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
      • Eager mode QAT for Embeddings (#66429)
      • Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
      • Supported Embedding QAT via FX API (#69333)
      • Add FX support for QAT EmbeddingBag (#69334)
    • Added support for depthwise quantized torch.nn.Conv3d in qnnpack, for use in quantization
      • Depthwise Conv3d Indirection Buffer Setup (#69311)
      • Depthwise Conv3d Weight Packing (#69312)
      • Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
      • Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
      • Tightened Step Height for Indirection Buffers (#70530)
      • Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
      • Implemented 3d convolution in qnnpack (#66350)

    ONNX

    • Supports opset version 15 (#67805)
    • Supports exporting nn.Module calls as ONNX local functions (#66140, #67803)
    • Supports for exporting new ops:
    • Added BFloat16 type support (#66788)
    • Supports exporting with Apex O2 (#66700)

    Infra (Releng)

    • Added support for ROCm 4.3.1 (#65624)
    • Added support for ROCm 4.5.2 (#71064)
    • Added support for CUDA 11.5 (#69262)
    • Added support for CUDA enabled Bazel builds (#66241)
    • Added support for Python 3.10 (#71132, #71419)

    Improvements

    Python API

    • NumPy compatibility:
      • Improved torch.searchsorted to be more consistent with NumPy (#66818)
      • Added torch.argwhere to match NumPy (#64257)
      • Added an alias for torch.special.softmax (#62251)
    • Improved torch.Tensor.view(dtype): enable all dtype combinations (#66493)
    • Improved torch.diff by adding support for n greater than 1 (#67260)
    • Improved torch.movedim to handle scalar as no-op (#69537)
    • Improved cartesian_prod: fixed a warning in the docs example (#68753)
    • Improved error messages for max_unpool{}d operators (#67328)
    • torch.distributions
      • Implemented positive-semidefinite constraint in torch.distributions (#71375)
      • Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
      • Implemented support for non-negative constraint in exponential distribution (allowing it to include zero). (#67184)
      • Implemented kl divergence between normal and laplace distribution. (#68807)
    • Improved meta tensor support for operators:
    • Added support for torch.Tensor.real for real-valued tensors (#71718)
    • torch.logaddexp, torch.logaddexp2, torch.remainder: added BFloat16 support on CPU (#63621)
    • torch.bucketize and searchsorted: added Half precision support (#67077)
    • Added new torch.slice_scatter,torch.select_scatter, torch.diagonal_scatter ops (#64430)
    • Made torch.scatter_reduce a public API (#68580, #73125)

    C++ API

    • Added C++ API and docs for hfftn (#66127)
    • Added support for MaybeOwned<IValue> (#68157)
    • Added set_to_none option for zero_grad() to C++ API (#68801)
    • Added an environment variable, TORCH_CPP_LOG_LEVEL, that you can use to toggle the log level in the c10 library (#71746)

    Autograd

    • Added nesting support for torch.autograd.graph.saved_tensor_hooks (#70932)
    • Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
    • Added complex autograd support to torch.{col2im,im2col} (#68199)
    • Added new reduce options and autograd support for torch.scatter_reduce (#71788)
    • Added derivatives wrt the second argument for torch.{remainder,fmod} (#69908)
    • Added new strategy flag to autograd.functional.{Jacobian, Hessian} to enable vectorized computation (#67041, #66292)
    • Added check_backward_ad flag to torch.autograd.gradcheck to be able to skip backward mode AD checks (#65040)
    • Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)

    Build

    • Improved incremental build times of PyTorch core by removing a dependency on native_functions.yaml in many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735)
    • Enabled bazel build without glog and gflags (#70850)
    • Added support for C++ frontend wrapper on Linux (#69094)
    • Added support for dynamic codegen outputs in CMake (#68246)
    • Max CMake version is now used by default with setup.py (#69355)
    • Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
    • Code base should now be -Wno-unused-variable compliant (#66041)
    • Added lazy import for packaging in torch_version (#71345)

    Dataloader

    • Support custom Sequence and Mapping for utils.data.default_collate (#68779)
    • Allowed specifying num_samples to RandomSampler when replacement is False (#71568)
    • Fixed the warning of shape inconsistency utils.data.default_collate (#71065)

    ForEach

    • Implemented ForEach L1 & L2 norm (#62646)

    LinAlg

    • The linalg.matrix_rank (docs) and linalg.pinv (docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)

    torch.nn

    • Added channels_last support for ChannelShuffle (#50247)
    • Added no-batch-dim support for nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer} (#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291)
    • Added BFloat16 support on CPU to nn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d} (#56902, #66929, #66927, #56903)
    • Added maximize support to optim.{Adam, AdamW, SGD} (#68164, #70146, #67847, #68733, #71023)
    • F.interpolate: Add nearest-exact mode to fix off-by-one error in nearest mode (#64501)
    • F.interpolate: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)
    • F.interpolate: Improved error message for invalid shapes (#66417)
    • nn.Conv*d: Accepts 0-sized channel inputs (#66256)
    • nn.LogSigmoid: Used log1p for improved precision (#66441)
    • nn.Module: Added flag for removing duplicates from parameters (#71542)
    • nn.Module: Added register_module alias for registering a sub-module (#65174)
    • nn.ModuleList: Supported concatenation (#70887)
    • nn.MultiheadAttention: Added flag to optionally average output attention weights across heads (#70055)
    • nn.ParameterDict: Supported full set of dict methods (#69403)
    • nn.{RNN, GRU}: Allowed hidden_size to be 0 (#70556)
    • nn.Sequential: Added append method (#71326)
    • nn.Upsample: Exposed recompute_scale_factor (#66419)
    • nn.ZeroPad2d: Added extra_repr for printing purposes (#69206)
    • optim.{ChainedScheduler, SequentialLR}: Added optimizer attribute (#67406, #69817)
    • optim.swa_utils.AveragedModel: Added use_buffers flag for averaging buffers in addition to parameters (#65921, #71763)

    torch.fx

    • Improved the customizability of fx.Graph’s code generation function, including support for setting a breakpoint in the generated code (#67139)
    • Supported printing inplace operators in FX (#71887)

    Sparse

    • Add CSR support for several operators:
    • Added torch.sparse_coo Layout support to zeros_like (#68108)
    • Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
    • Added support for conversion of CSR to COO Tensor to to_sparse (#66774)
    • Added support for empty COO Tensors to sparse.sum (#71091)

    AMD

    • Added sparse mappings for CUDA->HIP translation (#67323)
    • Enabled frexp support for ROCm builds (#67226)
    • Used hipCUB/rocPRIM scan algorithms for large index support (#68487)

    CUDA

    • Allows external CUDA streams to be set as current (#66324)
    • Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
    • Improved CUDA memory usage of nanmedian result (#68591)
    • Reduced number of igamma kernel instantiations (#70666)
    • Reduced number of compare kernels by unifying them (#69111)
    • Reduced number of bernoulli tensor tensor kernel instantiations (#70169)
    • Used cub::FutureValue to simplify 64bit indexing split of cub scan (#66711)
    • Added hascuSOLVER flag to Context (#69825)
    • Improved error message from CUDACachingAllocator (#69174)
    • Fixed masked_softmax perf for element_size is not 8 (#70271)
    • Reduced binary size of TensorCompare.cu (#68835)
    • Improved error message for interpolation (#72066)
    • Doesn't compile pow kernels for non-existent case (#70017)

    Profiler

    • Added flop count formulas for bmm and baddbmm (#66636)

    Vulkan

    • Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
    • Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)

    Mobile

    • Introduced multiple improvements for NNAPI
      • Added converters for torchscript ops quantized::mul and quantized::convtranspose2d to converter (torch.backends._nnapi.prepare.convert_model_to_nnapi) (#63913, #63914)
      • Supported int32 and qint16 type in Torchscript expressions (#70197, #70621)
      • Supported runtime flexible shapes and return shapes (#70334)
    • Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
    • Introduced multiple improvements for CoreML
      • Fixed error messages (#67410)
      • Assigned computationUnit to executor (#67411)
      • Cleaned up shape information from TensorSpec (#67412)
    • Type Support in Mobile Lite Interpreter
      • Extended type_parser to handle NamedTuple type (#63130, #62612)

    Distributed

    • torch.distributed
      • Improvements to error handling in TCPStore’s socket implementation (#68225)
      • Enabled ncclAvg for reductions (#62835)
      • Init dummy NCCL comms in constructor (#65173, #66393)
      • Added pybind trampoline for ProcessGroup and Work (#66338)
      • Setup c10d extension Backend class attr the same way as builtin ones (#66991)
      • Added barrier to ProcessGroup trampoline (#67236)
      • Raised warning when calling collectives on non-member group objects (#67639)
      • Patched bfloat16 support for NCCL (#67843)
      • Fixed c10d TCP store race condition with mutex (#68499)
      • Surfaced ncclUniqueId store broadcast error (#68597)
      • Checks for file existence before invoking cleanup logic in FileStore destructor (#68603)
      • Implemented gather primitive for ProcessGroupNCCL (#66745)
      • Implemented scatter primitive for ProcessGroupNCCL (#70029)
      • Enabled gather_object on NCCL (#71623)
      • Implemented allreduce_coalesced for ProcessGroupNCCL (#62140)
      • Set non-default backend names to lower case (#69400)
      • Added support for deleteKey for FileStore (#69953)
      • Fixed TSAN issue in TCPStore (#69590)
    • DistributedDataParallel
      • Refactored and removed sync_params (#64514)
      • Used named_params and named_buffers explicitly (#65181)
      • Allow await of custom buffer reduction in backward (#64515)
      • Profiling range for bucket copy (#65769)
      • Logs iteration in debug mode (#65770)
    • torch.distributed.rpc
      • Added a timeout argument to RPC shutdown() (#65425)
      • Released GIL during RPC shutdown. (#69586)
      • Updated RPC shutdown() logic to remove process group usage. (#65946)
      • Removal of Process Group dependency for TensorPipe Agent. (#68128)
    • torch.distributed.autograd
      • Made Kineto + distributed a warning rather than an error (#71120)
    • torch.distributed.elastic
      • Added ability to override sys.executable for torch.distributed.run (#66179)

    TorchScript

    • Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single CudaFusionGroup, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137)
    • Enabled profile_ivalue to convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137)
    • Added support in torch.jit.trace for tracing already JITted subgraphs(#59949)
    • We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
    • torch.jit.freeze now can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)
    • The peephole optimizer, which is used in torch.jit.freeze now coalesces consecutive calls to torch.concat into a single call (#67000)
    • Added ability for Torch.JIT C dispatch to convert python None into an undefined Tensor(#67793)
    • torch.jit.script now recognizes union of scalars as a JIT NumberType (#66591)
    • No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
    • In torch.jit.optimize_for_inference, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024)
    • In torch.jit.freeze, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024)
    • Added support for normalizing torch.Tensor.__rsub__ in normalize_ops JIT pass(#65014)

    Quantization

    • Quantized op improvements
      • torch.ao.FakeQuantize now supports fp32/fp16 zero_point. (#65836)
      • torch.ops.quantized.add now supports broadcasting (#66049)
      • torch.Tensor.dequantize now supports fp16 + cuda (#67234)
      • Added quantized CPU support for torch.nn.GELU (#69968)
      • torch.nn.quantized.functional.hardsigmoid supports an inplace flag (#65740)
    • Workflow improvements
      • FX graph mode quantization: enable torch.nn.Linear + torch.nn.BatchNorm1d fusion for PTQ (#66484)
      • Added an option in torch.ao.quantization.quantize_fx.convert_fx to accept qconfig_dict to skip quantization (#66878)
      • Added torch.nn.qat.dynamic.modules.Linear module (#67325)
      • Added torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}d fusion support (#70022)
      • Extended torch.ao.quantization.prepare_qat with allow_list argument, to allow custom mapping and custom QAT module (#65119)
      • Added torch.ao.quantization.default_replay_qconfig which allows observer reuse for torch.reshape in FX graph mode quantization (#69249)

    ONNX

    • Set ir_version of the exported model based on opset_version. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803)
    • Preserved op input names when op just passes through the input to the output (#67275)
    • Shape inference improvements:
      • Updated slice process shape to support rank only inference (#66149)
      • Represent symbolic shape as value (#69545)
    • Included op type in exported models’ input and output names (#68976)
    • Supports Conv-BatchNorm fusion inside blocks (#67272)
    • Exported torch.reciprocal to ONNX Reciprocal operator instead of Div(1, x) (#67271)
    • Supports beta!=1 in softplus (#66146)
    • Added warning for inplace updates on tensor.shape in tracing mode (#66142)
    • Supports instance_norm in training mode (#64375)
    • Allow registration of custom symbolics for ops specifying aten namespace (i.e. aten::foo is allowed as well as “foo”). (#67810)
    • Allow registration of custom symbolics for prim namespace (#66139)
    • Supports dynamic inputs for OneHot, bool for Einsum (#66147)

    Infra (Releng)

    • Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)

    torch.package

    • Add ability to retrieve the dependency graph via all_path function(#65602)
    • Add support for pickle v4 (#70642)
    • Add better testing support for Package Exporter (#70641)

    Bug fixes

    Python API

    • Fixed scalar inputs for aliased binary ops {multiply, subtract, divide} (#65937)
    • Fixed torch.save when saving storages that view same data with different type (#66949)
    • Fixed torch.save error if storages are unallocated (#68787)
    • Fixed k out-of-bounds in torch.kthvalue (cpu kernel) (#68863)
    • Fixed inference_mode decorator: with inference_mode(mode=False) used to ignore the mode argument and always set inference mode. (#68617)
    • Fixed cdist_backward in the case when cdist inputs are not contiguous (#70016)
    • Fixed cdist error message typo (#70178)
    • Fixed scatter for empty indexes (#70662)
    • Fixed torch.{unique, unique_consecutive} out of bound (#71540)
    • Fixed torch.isin in the case when inputs are non-contiguous on CPU (#70659)
    • Fixed hsplit vsplit dsplit crash when section is 0 (#69342)
    • Fixed: torch.gradient ignores dim argument when checking edge_order (#67926)
    • Fixed: TransformedDistribution.icdf should perform validation after applying the inverse transformation rather than before. (#71393)
    • Fixed torch.all and torch.any internal assert error with requires_grad=True (#65714)
    • Fixed torch.logsumexp type promotion: promote integral inputs to floating for(#63393)

    C++ API

    • Fixed libtorch at::Tensor::print() linking error (#69615)
    • Avoided UB when indexing into size-0 tensors (#65878)
    • Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)

    Autograd

    • Fixed autocast state propagation in the torch.utils.checkpoint API (#71169)
    • Fixed torch.nn.functional.conv_transpose3d backward when grad_out is non-contiguous (#67829)
    • Forward mode AD:
      • Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
      • Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
      • Fixed forward AD codegen for functions that have multiple formulas (#68535)
      • Fixed deadlock when forward and backward AD are used at the same time (#67360)
      • Fixed Tensor.copy_ forward AD to handle broadcasting (#69592)
      • Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
    • Fixed autograd.Function when non-Tensor argument precedes tensor argument (#71530)
    • Fixed autograd.Function forward AD when forward is a no-op to no longer raise an internal error (#71531)

    Build

    • Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
    • Disabled SVE when cross-compiling for M1 (#67114)
    • Added failure if pocketfft is not found and at_mkl is not enabled (#67909)
    • Fixed clang issues when compiling with _GLIBCXX_USE_CXX11_ABI (#72081)

    Complex Numbers

    • Fixed torch.autograd.gradcheck to generate valid inputs for forward AD computation for complex functions (#68001)
    • Fixed torch.Tensor.copy_ transpose path for tensors with conjugate or negative bit set (#69026)
    • Fixed torch.Tensor.copy_ behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)

    Dataloader

    • Made ProcessException picklable (#70118)
    • Fixed persistent worker exiting before pin_memory_thread (#71579)

    torch.nn

    • nn.AdaptiveAvgPool*d: Throws an error for negative output_size (#70488)
    • nn.Conv1d: Fixed for 1D convolution on MKL-DNN backend (#68166)
    • nn.CrossEntropyLoss: Fixed for usage of weight, ignore_index, and label_smoothing together (#69511)
    • nn.Fold: Checked that block height and width are positive (#69048)
    • nn.LayerNorm: Fixed incorrect result on CUDA when gamma or bias are missing (#69210)
    • nn.LayerNorm: Avoided overflow by doing computation in float for half (#66920)
    • nn.Module: Throws a proper error message from load_state_dict for non-tensor values (#70596)
    • nn.ModuleList: Fixed incorrect return type in __getitem__ (#69083)
    • nn.MultiheadAttention: Used query dtype for mask type (#68077)
    • nn.NLLLoss: Fixed backward computation with negative weights (#64572)
    • nn.{RNN, GRU}: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)
    • nn.utils.rnn.pad_sequence: Fix regression to support tuples for padding (#72436)
    • optim._LrScheduler: Fixed print formatting (#68338)
    • optim.ChainedScheduler: Fixed get_last_lr() (#69112)
    • optim.CosineAnnealingWarmRestarts: Fixed ordering bug when last_epoch > 0 (#64758)
    • optim.SequentialLR: Updated _last_lr on step (#70558)

    torch.fx

    • Supported torch.layout as arg (#66048)
    • Specified a default value when possible for placeholders created from concrete_args (#59569)
    • Fixed issue where GraphModule.delete_all_unused_submodules deletes submodules from called leaf modules (#66430)
    • Fixed torch.fx.subgraph_rewriter.replace_pattern mechanism so that multiple one-liner instances of the pattern are captured correctly (#66442)
    • Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
    • Ensured node stack trace survives copying (#69368)
    • Fixed to_folder not saving dtype (#69983)
    • Added a default_value arg to fx.Graph.placeholder and fix split_module (#71016)

    Sparse

    • Fixed CSR storage access to throw when used (#70072)
    • Fixed multiplication of 0-D sparse tensors (#70749)
    • Fixed result dtype for neg if given sparse Tensor (#68885)

    CUDA

    • Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
    • Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
    • Fixed error in warning about unsupported GPU (#67900)
    • Disabled TF32 in pinv_jvp and pinv_backward (#67948)
    • Fixed DLPack CUDA stream convention (#67618)
    • Sets device guard in _cudnn_impl functions (#70406)
    • Fixed mem_get_info when querying on a device other than the current device (#69640)

    Benchmark

    • Fixed divide-by-zero errors in torch.utils.benchmark.Timer (#70050)

    Dispatcher

    • Added explicit OperatorHandle destructor, so that the symbol shows up in windows builds (#70033)

    Profiler

    • Fixed race condition in profiler (#65812)
    • Fixed TensorBoard memory profiling (#71417)

    Visualization

    • Fixed torch.utils.tensorboard parsing JIT graph incorrectly (#65692)

    Vulkan

    • Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
    • Addressed several warnings raised by the Vulkan Validation layers:
      • Updated all texture resources to have the same dimensionality (#67647)
      • Added image format qualifier to shader files (#69330)
      • Disabled SPIR-V compiler size optimization (#69331)

    Mobile

    • Fixed quantized logistic converter for NNAPI (#70847)
    • Fixed potential crash if MTLCreateSystemDefaultDevice returns nil (#66859)
    • Used full name to look for the promoted prim operator table (#66081)
    • Fixed function name bug in mobile export (#66915)
    • Fixed issues with irange not having a header included in Metal (#66877)
    • Fixed backward compatibility issue for UnionType on mobile in type_parser. (#71341)
    • Fixed forward flatbuffer type handling with dynamic type in flatbuffer_loader. (#71500)
    • Fixed type equalities issue in pytorch_jni_common (#71508)
    • Fixed missing properties to the executor in CoreML (#67737)
    • Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
    • Ensured that function participating in bundled inputs have their “name" attribute set (#65856)

    Distributed

    • torch.distributed
      • Fixed bug on empty GLOO_SOCKET_IFNAME_ENV (#68933)
    • DistributedDataParallel
      • Fixed “Cannot modify in-place due to DDPSink” (#66015)
    • torch.distributed.elastic
      • Fixed scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749)

    TorchScript

    • Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
    • Fixed a ref counting loop for CompilationUnit, resulting in memory leaks when class objects were in JIT graphs. (#65442)
    • Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
    • Fixed bug where torch.jit.optimize_for_inference did not torch.jit.freeze a module when passed a a non-frozen module (#71436)
    • Fixed bug where running module.forward() on a torch.jit.freeze ed module ran the wrong graph (#68316)
    • Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of torch.split , resulting in invalid optimizations in various JIT optimization passes (#69745)
    • Fixed places where using torch.autocast together with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648)
    • Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
    • Fixed bug where graph input types were incorrect after running torch.jit.trace (#68242)
    • Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when torch.jit.freeze ops are converted to MKLDNN(#66628)
    • Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
    • Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python pickle version.(#69807)
    • Fixed bug where torch.jit.script fails when comments in function has less indent than surrounding code (#70227)
    • Fixed incorrect device type when torch.device is called inside scripted (torch.jit.script) code (#69645)
    • Fixed warning: overloaded virtual function torch::jit::Function::call is only partially overridden in class torch::jit::GraphFunction (4bf1be898d)

    Quantization

    • Fixed applying non-zero offset 1 to null pointer in torch.nn.functional.interpolate for quantized tensors (#65570)
    • Doesn't assume bias is a keyword argument to torch.nn.Conv{n}d (#61647, #71426)
    • Made error message when trying to use torch.quantize_per_tensor on non floats more specific (#66050)
    • Quantized torch.nn.Embedding conversion with unsupported dtype: make error message clearer (#66051)
    • Fixed torch.nn.qat.EmbeddingBag from_float error message (#66989)
    • Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in torch.nn.Embedding QAT (#68852)
    • Fixed scale+zp serialization of torch.nn.quantized.BatchNorm{2|3}d (#70432)
    • Fixed torch.nn.Dropout in FX graph mode quantization (#71043, #71438)
    • Fixed qconfig setting for fused modules in FX graph mode quantization (#71254)
    • Removed assumption number of rows is in 32 bit in fbgemm (#69066)
    • Fixed reduce_range warning when using default observers (#71027)

    ONNX

    • Doesn’t create invalid index_select op when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493)
    • Shape inference:
      • ConstantMap setters to update existing value instead of emplace, and fix default value of keepdims for Reduce (#67812)
      • Fixed memory leak (#68210)
      • Fixed reshape shape inference regression affecting LSTM (#72532)
    • Fixed inplace fill_ dtype export mismatch (#64580)
    • Fixed remainder (#64578)
    • Fixed reciprocal when input is not floating point (#67808)
    • Fixed new_full and full_like for Python 3.9 (#67806)
    • Fixed reduce ops on binary_cross_entropy_with_logits (#67805)
    • Propagated node metadata across passes (#45256)
    • Ensured outputs don’t have the same name (#66137)
    • Fixed pad with sequence inputs (#64377)
    • Fixed instance_norm with track_running_stats=True (#64375)
    • Fixed all and any with dim arg (#67270)
    • Allows autograd functions (prim::PythonOp) to be exported with OperatorExportTypes.ONNX_FALLTHROUGH (#67273)

    torch.package

    • Prevent import race condition that leaves torch.package.PackagePickler with unwanted dispatch table entries. (#71025)

    Performance

    Python API

    • Speed up pickling for torch.dtype (#65182)
    • Speed up histogram: avoid index_put_ overhead in histogram kernel's inner loop (#67815)
    • Speed up torch.topk with sort for some cases (#68632)
    • Speed up torch.stack: don't unsqueeze every stack arg if possible (#70288)
    • Speed up LayerNorm 4-5% (#71423)
    • Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
    • Speed up indexing functions: release GIL in a few places (#71728)
    • Speed up torch.empty a bit: define check_sizes_nonnegative as inline (#71640)
    • Speed up XLA tensor printing by reducing compilations (#71147)

    C++ API

    • Updated c10::SmallVector from LLVM (#69110)
    • Reduced some framework overhead in at::copy_() (#68950)
    • Reduced some overhead in StorageImpl::set_data_ptr (#65432)
    • Improved IValue performance for tuples by inlining tuple storage (#64066)

    Autograd

    • Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
    • Rewrote the backward of linalg.lu and linalg.lu_solve to use linalg_solve_triangular (#63569)
    • Updated nn.functional.grid_sample backward to compute input gradient only if required (#66069, #66070)
    • Stopped erroneously saving the output of torch.softplus for backward (#70296)

    Complex Numbers

    • Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
    • Restored conjugate and negative bits of a tensor when calling repeat_interleave (#68523)

    CUDA

    • Used a better hash table in CUDACachingAllocator (#71667)
    • TopK CUDA Optimization: used multiple block per slice (#71081)
    • Removed sync in Embedding caused by unique (#66091)
    • EmbeddingBackward exclusive_scan thrust->cub (#66566)
    • sort_out_cuda: Used custom kernels to fill index tensors (#66668)
    • masked_scatter: fuse mask count check into one kernel (#66871)
    • Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
    • Improved native layer_norm forward perf (#67977)
    • Improved native layer_norm backward perf (#68238)
    • Fast path for size 0 GPU host malloc (#68532)
    • Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
    • Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
    • Removed call_once from CUDACachingAllocator (#71668)
    • Reworked stat collection in CUDACachingAllocator (#71669)
    • Fixed CUDA LpNormFunctor (#70601)

    Dispatcher

    • Made c10::KernelFunction struct smaller, which should reduce some memory usage by the dispatcher (#65618)

    torch.fx

    • Made torch.fx.symbolic_trace reuse buffers if they're the same (#66211)

    Profiler

    Mobile

    TorchScript

    • Improved performance of autodiff on small JIT graphs (#71666)
    • Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
    • Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
    • In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)

    Quantization

    • Sped up quantized torch.nn.functional.interpolate for channels last (#66525)
    • Sped up torch.nn.functional.upsample for channels last (#70903)
    • Parallelized computation in torch.quantize_per_tensor_affine and torch.dequantize (#65845)

    Documentation

    Python API

    • Added docs for torch.adjoint. (#68869)
    • Clarified difference in behavior of empty_strided and as_strided (#64568)
    • Added some missing generated doc entries (torch.select, torch.slice_scatter, torch.diagonal_scatter, torch.select_scatter) (#69030), histogramdd (#68273)
    • Typo and formatting fixes. LinearLR (#67840), torch.any (#65310, #70187), torch.futures (#70630), jit docs (#68557), Tensor.type (#67019), torch.lobpcg (#71464), Tensor.triu(), Tensor.tril(), Tensor.ravel(). (#71057), torch.acosh (#66814), (#70439)
    • General Doc improvements for individual ops. torch.finfo (mention torch.bfloat16) (#68496), torch.quantile interpolation kwarg (#70637), from_dlpack and to_dlpack (#70437), set_printoptions added examples (#68324), index_add (#65806), topk doc (#65938), unique (#66132), chi2 (#67379), torch.histc (#64191), empty and empty_like (#68874), torch.cholesky_inverse (#69069), torch.dsplit (#70557)
    • Changed README getting started link to explicit instructions (#66828)
    • Modernized and clarified docs for torch.tensor and torch.as_tensor (#63308)
    • Improved torchhub docs (#69970)
    • Updated docs for torch.Tensor.real to indicate that it's supported for real tensors (#71962)

    C++ API

    • Fixed typos in ATen README (#69170)
    • Mentioned TORCH_SHOW_CPP_STACKTRACES in Contributing.md docs (#64052)
    • Updated link to C++ frontend examples (#66095)
    • Added docs for Visual Studio extension (#63944)
    • Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)

    Autograd

    • Updated docs for forward AD and make them public (#71643, #71159)
    • Updated “Extending PyTorch” doc to cover forward AD (#66962)
    • Fixed broken code syntax in autograd.rst (#69362)
    • Fixed incorrect variable in autograd docs (#70884)
    • Fixed typo in torch.autograd.Function docs that prevented it from compiling (#66754)

    Dataloader

    • Added docstring for default_collate and default_convert (#69862)
    • Updated the documentation for AMP with DataParallel (#69218)

    torch.nn

    • F.binary_cross_entropy: Updated examples to avoid deprecated calls (#69816)
    • F.linear: Fixed shape docs to indicate no-batch-dim support (#66884)
    • F.max_pool*d: Added functional docs (#63264)
    • F.multilabel_soft_margin_loss: Added reduction args to signature (#70420)
    • nn.AdaptiveLogSoftmaxWithLoss: Fixed typo in log_prob name (#68926)
    • nn.{BatchNorm1d, InstanceNorm1d}: Fixed input shape notation inconsistencies (#71371)
    • nn.CrossEntropyLoss: Corrected typo in formula for class probability targets (#70220)
    • nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)
    • nn.Flatten: Simplified example code (#67472)
    • nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}: Added activation function images (#65415)
    • nn.KLDivLoss: Fixed rendering of reduction arg (#66583)
    • nn.KLDivLoss: Rewrote docs to clarify math (#67443)
    • nn.MaxUnpool2d: Changed misleading example to better demonstrate output_size usage (#68936)
    • nn.Module: Added note describing required super().__init__() call (#66909)
    • nn.Module: Changed super() usage to Python 3 syntax in example (#65748)
    • nn.Module: Fixed formatting for named_modules() (#70491)
    • nn.NLLLoss: Corrected default value for reduce (#68426)
    • nn.SmoothL1Loss: Clarified equivalence with nn.L1Loss when beta == 0 (#70673)
    • nn.{TransformerDecoderLayer, TransformerEncoderLayer}: Clarified default batch_first=False dimension format (#66574)
    • nn.Upsample: Indicated that align_corners takes effect in bicubic mode (#66756)
    • nn.utils.clip_grad_norm_: Fixed rendering of parameters in error_if_nonfinite arg docs (#69958)
    • optim.Adam: Fixed formatting (#70387)
    • optim.AdamW: Fixed formula (#68587)
    • optim.RAdam: Corrected default value of lr arg (#69186)
    • Removed orphan from cuDNN persistent note (#65160)
    • Updated link to tutorial on defining NN modules (#65534)
    • nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}: Fix doc formatting regressions from no-batch-dim support (#73014)

    torch.fx

    • Fixed for retracing documentation which would break for n-ary operators (#71599)
    • Updated torch.fx.passes.split_module docstring (#65542)
    • Updated fx.rst example outputs (#68043)
    • Added document gotcha about training flag (#68915)
    • Defined get_dot_``graph to match documentation (#70541)

    Sparse

    • Updated sparse.rst to warn about _values() (#71088)

    CUDA

    • Updated Stream wait documentation to reference underlying cudaStreamWaitEvent call (#67973)
    • Documented torch.cuda.ExternalStream, torch.cuda.caching_allocator_alloc and torch.cuda.caching_allocator_delete (#70126)
    • Updated CUDA Graphs docs: Fixed make_graphed_callables example typos (#69379)

    Mobile

    • Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
    • Added recipe for bundled inputs in TorchScript models (#1524)

    Distributed

    • DistributedDataParallel
      • DDP doc fix (#71363)
      • Clarified how to check memory saving if using gradient_as_bucket_view (#71483)
    • torch.distributed
      • Updated distributed.rst to show that CUDA send/recv on GPU is supported (#65601)
      • Clarified checkpoint support (#68827)
      • Updated distributed.rst for ProcessGroup Extensions (#71482)
    • torch.distributed.elastic
      • Made --max_restarts explicit in the quickstart and runner docs (#65838)
    • torch.distributed.optim
      • Rendered torch.distributed.optim members (#67885)
    • torch.distributed.rpc
      • Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)

    TorchScript

    • Added typing.Union to supported types in documentation (#68435)
    • Added documentation to torch.jit.is_tracing() (#67326)
    • Fixed typos in jit_language_reference.rst (#68706)

    Quantization

    • Added documentation for quantized model save/load instructions (#69789)
    • Updated link to qnnpack in quantization doc. (#66226)
    • Improved quantization API docs (#66379)
    • Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
    • Documented the quantization custom module APIs (#67449)
    • Improved quantization documentation (#68907)

    ONNX

    • Improved documentation of operator_export_type and opset_version args (#69549)
    • Fixed documentation for do_constant_folding arg default (#71348)
    • Documented ExportTypes, CheckerError, and unregister_custom_op_symbolic (#68489)
    • Fixed link to ONNX Runtime custom op documentation (#67944)
    • Added section “Discovering all unconvertible ATen ops at once” (#66143)
    • Fixed typos (#66090)
    • Documented work-arounds for indexing export limitations, and improve error messages (#64579)

    torch.package

    • Add some docs describing how to debug torch.package dependencies (#65704)

      Download Release

      This release has 2 assets:

      • pytorch-v1.11.0.tar.gz
      • Source code (zip)
      • Source code (tar.gz)

      Visit the release page to download them.


      Have any questions?
      Contact Exxact Today


      Topics