PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.
The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!
PyTorch 1.11.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
Highlights
The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:
- TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
- functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
- Distributed Data Parallel (DDP) static graph optimizations available in stable.
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
Fixed python deepcopy
to correctly copy all attributes on Tensor
objects (#65584)
This change ensures that the deepcopy
operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).
1.10.2 | 1.11.0 |
---|---|
a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # Raise AttributeError: "Tensor" object has no attribute "foo" | a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # 3 |
steps
argument is no longer optional in torch.linspace
and torch.logspace
This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps
). In PyTorch 1.11, it is not longer optional.
1.10.2 | 1.11.0 |
---|---|
# Works, but raises a deprecation warning # Steps defaults to 100 a = torch.linspace(1, 10) # UserWarning: Not providing a value for linspace's steps is deprecated # and will throw a runtime error in a future release. # This warning will appear only once per process. # (Triggered internally at ../aten/src/ATen/native/RangeFactories.cpp:19 | # In 1.11, you must specify steps a = torch.linspace(1, 10, steps=100) |
Remove torch.hub.import_module
function that was mistakenly public (#67990)
This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module
.
C++ API
We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten
operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)
When you #include
a header from the C++ frontend, you can no longer assume that every aten
operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h>
in your file, which will maintain the old behavior of including every aten
operators.
Custom implementation for c10::List
and c10::Dict
move constructors have been removed (#69370)
The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"
1.10.2 | 1.11.0 |
---|---|
c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); std::cout << list1.size() // 0 | c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); // calls copy ctr std::cout << list1.size() // 2 |
CUDA
Removed THCeilDiv
function and corresponding THC/THCDeviceUtils.cuh
header (#65472)
As part of cleaning up TH
from the codebase, the THCeilDiv
function has been removed. Instead, please use at::ceil_div
, and include the corresponding ATen/ceil_div.h
header
Removed THCudaCheck
(#66391)
You can replace it with C10_CUDA_CHECK
, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions
Removed THCudaMalloc()
, THCudaFree()
, THCThrustAllocator.cuh
(#65492)
If your extension is using THCThrustAllocator.cuh
, please replace it with ATen/cuda/ThrustAllocator.h
and corresponding APIs (see examples in this PR).
This PR also removes THCudaMalloc/THCudaFree
calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr)
, or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate
which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.
Build
Stopped building shared library for AOT Compiler, libaot_compiler.so
(#66227)
Building aot_compiler.cpp
as a separate library is not necessary, as it’s already included in libtorch.so
.
You can update your build system to only dynamically link libtorch.so
.
Mobile
Make typing.Union
type unsupported for mobile builds (#65556)
typing.Union
support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.
Distributed
torch.distributed.rpc
: Final Removal of ProcessGroup RPC backend (#67363)
ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.
The backend type “PROCESS_GROUP” is now deprecated, e.g.torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)
Quantization
Disabled the support for getitem
in FX Graph Mode Quantization (#66647)
getitem
used to be quantized in FX Graph Mode Quantization
, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.
1.10.2 | 1.11.0 |
---|---|
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, # scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, # linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor) # quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # getitem = stack[0]; stack = None # dequantize_2 = getitem.dequantize(); getitem = None # return getitem | from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor); quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # dequantize_2 = stack.dequantize(); stack = None # getitem = dequantize_2[0]; dequantize_2 = None # return getitem |
Users should now use fuse_modules
for PTQ fusion and fuse_modules_qat
for QAT fusion (#69878, #71956)
There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training
to decide which mode user wanted, but this was a misuse of the training
attribute since that is not the intended purpose. This PR removes the dependency on module.training
and uses separate APIs to make the fusion requested by the user explicit.
Previously, fuse_module
used to support both cases and distinguished PTQ/QAT fusion based on module.training
, but now fuse_module
only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat
, instead of using fuse_modules
, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.
Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.
1.10.2 | 1.11.0 |
---|---|
import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'> | import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() # For Quantization Aware Training, use fuse_modules_qat() m = fuse_modules_qat(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) # Result (doesn't change): <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'> |
ONNX
Removed f
arg from onnx.export_to_pretty_string
(#69546)
The arg has always been ignored. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export_to_pretty_string(model, inputs, "file_name") | torch.onnx.export_to_pretty_string(model, inputs) |
Removed use_external_data_format
arg from onnx.export
(#67809)
The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, use_external_data_format=True) | torch.onnx.export(model, inputs, f_name) |
Removed example_outputs
arg from torch.onnx.export
(#67809)
The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,)) | torch.onnx.export(model, inputs, f_name) |
Removed enable_onnx_checker
arg from onnx.export
(#67276)
The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError
will be raised. Users can catch and ignore that exception.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False) | try: torch.onnx.export(model, inputs, f_name) except torch.onnx.CheckerError: pass # ignore error |
Moved and renamed onnx.utils.ONNXCheckerError
to onnx.CheckerError
(#66644)
Previously the documentation was incorrect and stated ONNXCheckerError
was in the onnx
module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.
1.10.2 | 1.11.0 |
---|---|
except torch.onnx.utils.ONNXCheckerError: | except torch.onnx.CheckerError: |
Removed _retain_param_name
arg from onnx.export
(#67276)
The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.
1.10.2 | 1.11.0 |
---|---|
# NOTE: No way to get same behavior as _retain_param_name=False. torch.onnx.export(model, inputs, f_name, _retain_param_name=True) | torch.onnx.export(model, inputs, f_name) |
Deprecations
Python API
Deprecated x.T
on tensors of dimension other than 0 or 2 (#64180)
x.T
only accepts tensors with 0 or 2 dimensions. Calling x.T
on tensors with a different number of dimensions has been deprecated.
1.10.2 | 1.11.0 |
---|---|
a = torch.ones(2, 3, 4) a.T.size() # torch.Size([4, 3, 2]) | a = torch.ones(2, 3, 4) a.T.size() # UserWarning: The use of `x.T` on tensors of dimension other than 2 # to reverse their shape is deprecated and it will throw an error in a future release. # Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` # to reverse the dimensions of a tensor. (Triggered internally at # aten/src/ATen/native/TensorShape.cpp:2386.) # torch.Size([4, 3, 2]) |
Quantization
torch.ao.quantization.QConfigDynamic
is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig
instead (#69875, #69864)
1.10.2 | 1.11.0 |
---|---|
qconfig = torch.ao.quantization.QConfigDynamic(...) | qconfig = torch.ao.quantization.QConfig(...) |
New features
Python API
- Added
set_deterministic_debug_mode
andget_deterministic_debug_mode
(#67778, #66233) - Added n-dimensional Hermitian FFT:
torch.fft.ifftn
andtorch.fft.hfftn
(#63890) - Added
Wishart
distribution totorch.distributions
(#70377) - Preliminary support for the Python Array API standard has been added to the
torch
andtorch.linalg
modules. PyTorch implements over 90% of the operators defined by the Python Array API, including thetorch.from_dlpack
operation for improved DLPack support (#60627) - Moved
torch.testing
from prototype to beta (#69668)
Autograd
- Added new
torch.utils.checkpoint
implementation that does not use reentrant autograd (can be toggled with the newuse_reentrant
flag) (#69508) - Added
batched_grad
parameter toautograd.grad
to allow batched gradient computation (#65564) - Forward mode AD:
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Check the following issue (#71117) to see the list of ops that do not yet support forward AD. Please comment there if you run into any ops that don’t support forward AD that you want prioritized or are missing from that list.
- Added
ctx.save_for_forward
function toautograd.Function
(#71569) autograd.forward_ad.unpack_dual
returns a named tuple instead of plain tuple (#68062, #68628)
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Linear algebra operation support:
Build
- Added FlexiBLAS build support (#64815)
- Added
IS_LINUX
andIS_MACOS
global vars for cpp extensions building (#69093) - Added ARC for iOS CMake builds (#67884)
- Added support for IBM z14/15 SIMD (#66407)
Complex Numbers
Dataloader
- TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)
LinAlg
- Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
- Added the
linalg.matrix_exp
operation (see the docs here) (#62715) - Added the
linalg.cross
operation (see the docs here) (#63285) - Added the
linalg.diagonal
operation, an alias for torch.diagonal (see the docs here) (#70599) - Added the
linalg.lu_factor
operation (see the docs here) (#66933)
torch.nn
- Added
torch.nn.utils.rnn.{unpack_sequence,unpad_sequence}
functions (#66550)
Sparse
- Added
torch.sparse.sampled_addmm
for CSR Tensors on GPU (#68007)
CUDA
- The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
- Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
- Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
- The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
- Added complex support for Jiterator, port sinc to Jiterator (#71577)
- Jiterates
lcm
,i0e
,i1e
,ndtri
,efcx
,digamma
,trigamma
,lgamma
(#70663) - Jiterates
exp2
,erfc
,erfinv
andentr
(#71295) - Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
- Jiterates
polygamma
(#71162)
- Added cuSPARSE descriptors and updated CSR addmm (#60838)
- Sparse CSR CUDA: added
addmv_out
(#61407) - Added nvidia-smi memory and utilization as native Python API (#69104)
Vulkan
- Added Vulkan support for several torch operators:
- Added the
vulkan_perf_test
benchmark binary to benchmark Vulkan ops under various input conditions. (#67230)
Mobile
- Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
- Build tracer for tracing based workflow (#66267)
- Used operator.yaml to build LibTorch library (#66237)
- Unified tracer between internal and external (#64152)
- Reorganized model tracer dependency (#63421)
- Added support for the
bool
andint
dtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297) - Generic build features for selective build (#67817)
- Made more classes selective (#67397)
- Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)
Distributed
FullyShardedDataParallel
- FSDP is a type of data-parallel training but unlike traditional data-parallel it shards model’s parameters, gradients and optimizer states across data parallel workers and can optionally offload the sharded model parameters to the CPUs. This new API can help users to scale their large model training with minimal code change when switching from DDP to FSDP. (#63881, #64964, #66578, #66904, #66956, #66957, #67117, #67292, #67249, #67135, #67813, #68308, #68155, #68417, #68776, #69356, #69357, #69358, #70340, #71803, #71804, #70341, #70235, #72084)
DistributedDataParallel
TorchScript
- Enabled running
torch.jit.freeze()
andtorch.jit.optimize_for_inference
on functions that are not forward (#68668, #69367) - Enabled
torch.jit.freeze
to work on for sparse COO tensors (#69614) - Enabled
torch.jit.script()
,torch.jit.freeze()
and serialization for tensors in Compressed Sparse Row (CSR) format (#69555) - Allowed users to set the fusion strategy for
torch.jit.fuser
through the now publictorch.jit.set_fusion_strategy
. (#72937) - Enabled Dynamic Shape Fusion For GPU & CPU, configurable via
torch.jit.set_fusion_strategy
(#72036)
Quantization
- Added bilinear quantized implementation of
torch.nn.functional.grid_sample
2d operator (#66879) - Added the
torch.quantize_per_tensor_dynamic
operator (#68004) - Added Quantization Aware Training support for
torch.nn.Embedding
andtorch.nn.EmbeddingBag
- Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
- Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
- Eager mode QAT for Embeddings (#66429)
- Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
- Supported Embedding QAT via FX API (#69333)
- Add FX support for QAT EmbeddingBag (#69334)
- Added support for depthwise quantized
torch.nn.Conv3d
in qnnpack, for use in quantization- Depthwise Conv3d Indirection Buffer Setup (#69311)
- Depthwise Conv3d Weight Packing (#69312)
- Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
- Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
- Tightened Step Height for Indirection Buffers (#70530)
- Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
- Implemented 3d convolution in qnnpack (#66350)
ONNX
- Supports opset version 15 (#67805)
- Supports exporting
nn.Module
calls as ONNX local functions (#66140, #67803) - Supports for exporting new ops:
- Added BFloat16 type support (#66788)
- Supports exporting with Apex O2 (#66700)
Infra (Releng)
- Added support for ROCm 4.3.1 (#65624)
- Added support for ROCm 4.5.2 (#71064)
- Added support for CUDA 11.5 (#69262)
- Added support for CUDA enabled Bazel builds (#66241)
- Added support for Python 3.10 (#71132, #71419)
Improvements
Python API
- NumPy compatibility:
- Improved
torch.Tensor.view(dtype)
: enable all dtype combinations (#66493) - Improved
torch.diff
by adding support for n greater than 1 (#67260) - Improved
torch.movedim
to handle scalar as no-op (#69537) - Improved
cartesian_prod
: fixed a warning in the docs example (#68753) - Improved error messages for
max_unpool{}d
operators (#67328) torch.distributions
- Implemented positive-semidefinite constraint in
torch.distributions
(#71375) - Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
- Implemented support for
non-negative
constraint in exponential distribution (allowing it to include zero). (#67184) - Implemented
kl divergence
betweennormal
andlaplace
distribution. (#68807)
- Implemented positive-semidefinite constraint in
- Improved meta tensor support for operators:
- Added support for
torch.Tensor.real
for real-valued tensors (#71718) torch.logaddexp, torch.logaddexp2, torch.remainder
: added BFloat16 support on CPU (#63621)torch.bucketize
andsearchsorted
: added Half precision support (#67077)- Added new
torch.slice_scatter
,torch.select_scatter
,torch.diagonal_scatter
ops (#64430) - Made
torch.scatter_reduce
a public API (#68580, #73125)
C++ API
- Added C++ API and docs for
hfftn
(#66127) - Added support for
MaybeOwned<IValue>
(#68157) - Added
set_to_none
option forzero_grad()
to C++ API (#68801) - Added an environment variable,
TORCH_CPP_LOG_LEVEL
, that you can use to toggle the log level in the c10 library (#71746)
Autograd
- Added nesting support for
torch.autograd.graph.saved_tensor_hooks
(#70932) - Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
- Added complex autograd support to
torch.{col2im,im2col}
(#68199) - Added new reduce options and autograd support for
torch.scatter_reduce
(#71788) - Added derivatives wrt the second argument for
torch.{remainder,fmod}
(#69908) - Added new
strategy
flag toautograd.functional.{Jacobian, Hessian}
to enable vectorized computation (#67041, #66292) - Added
check_backward_ad
flag totorch.autograd.gradcheck
to be able to skip backward mode AD checks (#65040) - Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)
Build
- Improved incremental build times of PyTorch core by removing a dependency on
native_functions.yaml
in many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735) - Enabled bazel build without glog and gflags (#70850)
- Added support for C++ frontend wrapper on Linux (#69094)
- Added support for dynamic codegen outputs in CMake (#68246)
- Max CMake version is now used by default with setup.py (#69355)
- Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
- Code base should now be
-Wno-unused-variable
compliant (#66041) - Added lazy import for
packaging
intorch_version
(#71345)
Dataloader
- Support custom
Sequence
andMapping
forutils.data.default_collate
(#68779) - Allowed specifying
num_samples
toRandomSampler
whenreplacement
isFalse
(#71568) - Fixed the warning of shape inconsistency
utils.data.default_collate
(#71065)
ForEach
- Implemented
ForEach
L1 & L2 norm (#62646)
LinAlg
- The
linalg.matrix_rank
(docs) andlinalg.pinv
(docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)
torch.nn
- Added
channels_last
support forChannelShuffle
(#50247) - Added no-batch-dim support for
nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}
(#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291) - Added
BFloat16
support on CPU tonn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d}
(#56902, #66929, #66927, #56903) - Added
maximize
support tooptim.{Adam, AdamW, SGD}
(#68164, #70146, #67847, #68733, #71023) F.interpolate
: Addnearest-exact
mode to fix off-by-one error innearest
mode (#64501)F.interpolate
: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)F.interpolate
: Improved error message for invalid shapes (#66417)nn.Conv*d
: Accepts 0-sized channel inputs (#66256)nn.LogSigmoid
: Usedlog1p
for improved precision (#66441)nn.Module
: Added flag for removing duplicates from parameters (#71542)nn.Module
: Addedregister_module
alias for registering a sub-module (#65174)nn.ModuleList
: Supported concatenation (#70887)nn.MultiheadAttention
: Added flag to optionally average output attention weights across heads (#70055)nn.ParameterDict
: Supported full set ofdict
methods (#69403)nn.{RNN, GRU}
: Allowedhidden_size
to be 0 (#70556)nn.Sequential
: Addedappend
method (#71326)nn.Upsample
: Exposedrecompute_scale_factor
(#66419)nn.ZeroPad2d
: Addedextra_repr
for printing purposes (#69206)optim.{ChainedScheduler, SequentialLR}
: Addedoptimizer
attribute (#67406, #69817)optim.swa_utils.AveragedModel
: Addeduse_buffers
flag for averaging buffers in addition to parameters (#65921, #71763)
torch.fx
- Improved the customizability of
fx.Graph
’s code generation function, including support for setting a breakpoint in the generated code (#67139) - Supported printing inplace operators in FX (#71887)
Sparse
- Add CSR support for several operators:
torch.triangular_solve
,torch.addmv
,torch.addmm
,torch.add
for all arguments on CPU (#62180, #61536, #65606, #64391)torch.triangular_solve
,torch.addmv
,torch.addmm
,torch.add
for all arguments on GPU (#61407, #61858, #63511, #63948)- zero-preserving unary functions (#68123, #69292)
torch.empty
,torch.resize_
,torch.copy_
,torch.randn_like
,torch.clone
(#63509, #63510, #68083, #70581)transpose
(#70582)
- Added torch.sparse_coo Layout support to
zeros_like
(#68108) - Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
- Added support for conversion of CSR to COO Tensor to
to_sparse
(#66774) - Added support for empty COO Tensors to sparse.sum (#71091)
AMD
- Added sparse mappings for CUDA->HIP translation (#67323)
- Enabled frexp support for ROCm builds (#67226)
- Used hipCUB/rocPRIM scan algorithms for large index support (#68487)
CUDA
- Allows external CUDA streams to be set as current (#66324)
- Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
- Improved CUDA memory usage of
nanmedian
result (#68591) - Reduced number of
igamma
kernel instantiations (#70666) - Reduced number of
compare
kernels by unifying them (#69111) - Reduced number of
bernoulli
tensor tensor kernel instantiations (#70169) - Used
cub::FutureValue
to simplify 64bit indexing split of cub scan (#66711) - Added
hascuSOLVER
flag to Context (#69825) - Improved error message from
CUDACachingAllocator
(#69174) - Fixed
masked_softmax
perf for element_size is not 8 (#70271) - Reduced binary size of
TensorCompare.cu
(#68835) - Improved error message for
interpolation
(#72066) - Doesn't compile
pow
kernels for non-existent case (#70017)
Profiler
- Added flop count formulas for
bmm
andbaddbmm
(#66636)
Vulkan
- Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
- Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)
Mobile
- Introduced multiple improvements for
NNAPI
- Added converters for torchscript ops
quantized::mul
andquantized::convtranspose2d
to converter (torch.backends._nnapi.prepare.convert_model_to_nnapi
) (#63913, #63914) - Supported
int32
andqint16
type in Torchscript expressions (#70197, #70621) - Supported runtime flexible shapes and return shapes (#70334)
- Added converters for torchscript ops
- Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
- Introduced multiple improvements for
CoreML
- Type Support in Mobile Lite Interpreter
Distributed
torch.distributed
- Improvements to error handling in
TCPStore’
s socket implementation (#68225) - Enabled
ncclAvg
for reductions (#62835) - Init dummy
NCCL
comms in constructor (#65173, #66393) - Added pybind trampoline for
ProcessGroup
andWork
(#66338) - Setup
c10d
extension Backend class attr the same way as builtin ones (#66991) - Added barrier to
ProcessGroup
trampoline (#67236) - Raised warning when calling collectives on non-member group objects (#67639)
- Patched
bfloat16
support for NCCL (#67843) - Fixed
c10d
TCP store race condition with mutex (#68499) - Surfaced
ncclUniqueId
store broadcast error (#68597) - Checks for file existence before invoking cleanup logic in
FileStore
destructor (#68603) - Implemented gather primitive for
ProcessGroupNCCL
(#66745) - Implemented scatter primitive for
ProcessGroupNCCL
(#70029) - Enabled
gather_object
onNCCL
(#71623) - Implemented
allreduce_coalesced
forProcessGroupNCCL
(#62140) - Set non-default backend names to lower case (#69400)
- Added support for
deleteKey
forFileStore
(#69953) - Fixed
TSAN
issue inTCPStore
(#69590)
- Improvements to error handling in
DistributedDataParallel
torch.distributed.rpc
torch.distributed.autograd
- Made Kineto + distributed a warning rather than an error (#71120)
torch.distributed.elastic
- Added ability to override sys.executable for
torch.distributed.run
(#66179)
- Added ability to override sys.executable for
TorchScript
- Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single
CudaFusionGroup
, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137) - Enabled
profile_ivalue
to convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137) - Added support in
torch.jit.trace
for tracing already JITted subgraphs(#59949) - We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
torch.jit.freeze
now can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)- The peephole optimizer, which is used in
torch.jit.freeze
now coalesces consecutive calls totorch.concat
into a single call (#67000) - Added ability for Torch.JIT C dispatch to convert python
None
into an undefined Tensor(#67793) torch.jit.script
now recognizes union of scalars as a JIT NumberType (#66591)- No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
- In
torch.jit.optimize_for_inference
, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024) - In
torch.jit.freeze
, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024) - Added support for normalizing
torch.Tensor.__rsub__
innormalize_ops
JIT pass(#65014)
Quantization
- Quantized op improvements
torch.ao.FakeQuantize
now supportsfp32/fp16
zero_point
. (#65836)torch.ops.quantized.add
now supports broadcasting (#66049)torch.Tensor.dequantize
now supports fp16 + cuda (#67234)- Added quantized CPU support for
torch.nn.GELU
(#69968) torch.nn.quantized.functional.hardsigmoid
supports aninplace
flag (#65740)
- Workflow improvements
- FX graph mode quantization: enable
torch.nn.Linear + torch.nn.BatchNorm1d
fusion for PTQ (#66484) - Added an option in
torch.ao.quantization.quantize_fx.convert_fx
to acceptqconfig_dict
to skip quantization (#66878) - Added
torch.nn.qat.dynamic.modules.Linear
module (#67325) - Added
torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}d
fusion support (#70022) - Extended
torch.ao.quantization.prepare_qat
withallow_list
argument, to allow custom mapping and custom QAT module (#65119) - Added
torch.ao.quantization.default_replay_qconfig
which allows observer reuse fortorch.reshape
in FX graph mode quantization (#69249)
- FX graph mode quantization: enable
ONNX
- Set
ir_version
of the exported model based onopset_version
. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803) - Preserved op input names when op just passes through the input to the output (#67275)
- Shape inference improvements:
- Included op type in exported models’ input and output names (#68976)
- Supports Conv-BatchNorm fusion inside blocks (#67272)
- Exported
torch.reciprocal
to ONNX Reciprocal operator instead ofDiv(1, x)
(#67271) - Supports
beta!=1
in softplus (#66146) - Added warning for inplace updates on
tensor.shape
in tracing mode (#66142) - Supports
instance_norm
in training mode (#64375) - Allow registration of custom symbolics for ops specifying aten namespace (i.e.
aten::foo
is allowed as well as “foo”). (#67810) - Allow registration of custom symbolics for
prim
namespace (#66139) - Supports dynamic inputs for
OneHot
, bool forEinsum
(#66147)
Infra (Releng)
- Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)
torch.package
- Add ability to retrieve the dependency graph via
all_path
function(#65602) - Add support for pickle v4 (#70642)
- Add better testing support for Package Exporter (#70641)
Bug fixes
Python API
- Fixed scalar inputs for aliased binary ops {
multiply
,subtract
,divide
} (#65937) - Fixed
torch.save
when saving storages that view same data with different type (#66949) - Fixed
torch.save
error if storages are unallocated (#68787) - Fixed
k
out-of-bounds intorch.kthvalue
(cpu kernel) (#68863) - Fixed
inference_mode
decorator:with inference_mode(mode=False)
used to ignore themode
argument and always set inference mode. (#68617) - Fixed
cdist_backward
in the case whencdist
inputs are not contiguous (#70016) - Fixed
cdist
error message typo (#70178) - Fixed
scatter
for empty indexes (#70662) - Fixed
torch.{unique, unique_consecutive}
out of bound (#71540) - Fixed
torch.isin
in the case when inputs are non-contiguous on CPU (#70659) - Fixed
hsplit vsplit dsplit
crash when section is 0 (#69342) - Fixed:
torch.gradient
ignores dim argument when checking edge_order (#67926) - Fixed:
TransformedDistribution.icdf
should perform validation after applying the inverse transformation rather than before. (#71393) - Fixed
torch.all and torch.any
internal assert error with requires_grad=True (#65714) - Fixed
torch.logsumexp
type promotion: promote integral inputs to floating for(#63393)
C++ API
- Fixed libtorch
at::Tensor::print()
linking error (#69615) - Avoided UB when indexing into size-0 tensors (#65878)
- Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)
Autograd
- Fixed autocast state propagation in the
torch.utils.checkpoint
API (#71169) - Fixed
torch.nn.functional.conv_transpose3d
backward when grad_out is non-contiguous (#67829) - Forward mode AD:
- Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
- Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
- Fixed forward AD codegen for functions that have multiple formulas (#68535)
- Fixed deadlock when forward and backward AD are used at the same time (#67360)
- Fixed
Tensor.copy_
forward AD to handle broadcasting (#69592) - Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
- Fixed
autograd.Function
when non-Tensor argument precedes tensor argument (#71530) - Fixed
autograd.Function
forward AD when forward is a no-op to no longer raise an internal error (#71531)
Build
- Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
- Disabled SVE when cross-compiling for M1 (#67114)
- Added failure if
pocketfft
is not found andat_mkl
is not enabled (#67909) - Fixed clang issues when compiling with
_GLIBCXX_USE_CXX11_ABI
(#72081)
Complex Numbers
- Fixed
torch.autograd.gradcheck
to generate valid inputs for forward AD computation for complex functions (#68001) - Fixed
torch.Tensor.copy_
transpose path for tensors with conjugate or negative bit set (#69026) - Fixed
torch.Tensor.copy_
behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)
Dataloader
- Made
ProcessException
picklable (#70118) - Fixed persistent worker exiting before
pin_memory_thread
(#71579)
torch.nn
nn.AdaptiveAvgPool*d
: Throws an error for negativeoutput_size
(#70488)nn.Conv1d
: Fixed for 1D convolution on MKL-DNN backend (#68166)nn.CrossEntropyLoss
: Fixed for usage ofweight
,ignore_index
, andlabel_smoothing
together (#69511)nn.Fold
: Checked that block height and width are positive (#69048)nn.LayerNorm
: Fixed incorrect result on CUDA whengamma
orbias
are missing (#69210)nn.LayerNorm
: Avoided overflow by doing computation infloat
forhalf
(#66920)nn.Module
: Throws a proper error message fromload_state_dict
for non-tensor values (#70596)nn.ModuleList
: Fixed incorrect return type in__getitem__
(#69083)nn.MultiheadAttention
: Used query dtype for mask type (#68077)nn.NLLLoss
: Fixed backward computation with negative weights (#64572)nn.{RNN, GRU}
: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)nn.utils.rnn.pad_sequence
: Fix regression to support tuples for padding (#72436)optim._LrScheduler
: Fixed print formatting (#68338)optim.ChainedScheduler
: Fixedget_last_lr()
(#69112)optim.CosineAnnealingWarmRestarts
: Fixed ordering bug whenlast_epoch > 0
(#64758)optim.SequentialLR
: Updated_last_lr
on step (#70558)
torch.fx
- Supported
torch.layout
as arg (#66048) - Specified a default value when possible for placeholders created from
concrete_args
(#59569) - Fixed issue where
GraphModule.delete_all_unused_submodules
deletes submodules from called leaf modules (#66430) - Fixed
torch.fx.subgraph_rewriter.replace_pattern
mechanism so that multiple one-liner instances of the pattern are captured correctly (#66442) - Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
- Ensured node stack trace survives copying (#69368)
- Fixed
to_folder
not saving dtype (#69983) - Added a
default_value
arg tofx.Graph.placeholder
and fixsplit_module
(#71016)
Sparse
- Fixed CSR storage access to throw when used (#70072)
- Fixed multiplication of 0-D sparse tensors (#70749)
- Fixed result dtype for neg if given sparse Tensor (#68885)
CUDA
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
- Fixed error in warning about unsupported GPU (#67900)
- Disabled TF32 in
pinv_jvp
andpinv_backward
(#67948) - Fixed DLPack CUDA stream convention (#67618)
- Sets device guard in
_cudnn_impl
functions (#70406) - Fixed
mem_get_info
when querying on a device other than the current device (#69640)
Benchmark
- Fixed divide-by-zero errors in
torch.utils.benchmark.Timer
(#70050)
Dispatcher
- Added explicit
OperatorHandle
destructor, so that the symbol shows up in windows builds (#70033)
Profiler
Visualization
- Fixed
torch.utils.tensorboard
parsing JIT graph incorrectly (#65692)
Vulkan
- Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
- Addressed several warnings raised by the Vulkan Validation layers:
Mobile
- Fixed quantized logistic converter for
NNAPI
(#70847) - Fixed potential crash if
MTLCreateSystemDefaultDevice
returns nil (#66859) - Used full name to look for the promoted prim operator table (#66081)
- Fixed function name bug in mobile export (#66915)
- Fixed issues with
irange
not having a header included inMetal
(#66877) - Fixed backward compatibility issue for UnionType on mobile in
type_parser
. (#71341) - Fixed forward flatbuffer type handling with dynamic type in
flatbuffer_loader
. (#71500) - Fixed type equalities issue in
pytorch_jni_common
(#71508) - Fixed missing properties to the executor in
CoreML
(#67737) - Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
- Ensured that function participating in bundled inputs have their “name" attribute set (#65856)
Distributed
torch.distributed
- Fixed bug on empty
GLOO_SOCKET_IFNAME_ENV
(#68933)
- Fixed bug on empty
DistributedDataParallel
- Fixed “Cannot modify in-place due to DDPSink” (#66015)
torch.distributed.elastic
- Fixed scale down bug caused by calling
rdzv_handler.shutdown()
on premature agent failures (#67749)
- Fixed scale down bug caused by calling
TorchScript
- Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
- Fixed a ref counting loop for
CompilationUnit
, resulting in memory leaks when class objects were in JIT graphs. (#65442) - Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
- Fixed bug where
torch.jit.optimize_for_inference
did nottorch.jit.freeze
a module when passed a a non-frozen module (#71436) - Fixed bug where running module.forward() on a
torch.jit.freeze
ed module ran the wrong graph (#68316) - Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of
torch.split
, resulting in invalid optimizations in various JIT optimization passes (#69745) - Fixed places where using
torch.autocast
together with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648) - Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
- Fixed bug where graph input types were incorrect after running
torch.jit.trace
(#68242) - Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when
torch.jit.freeze
ops are converted to MKLDNN(#66628) - Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
- Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python
pickle
version.(#69807) - Fixed bug where
torch.jit.script
fails when comments in function has less indent than surrounding code (#70227) - Fixed incorrect device type when torch.device is called inside scripted (
torch.jit.script
) code (#69645) - Fixed warning: overloaded virtual function
torch::jit::Function::call
is only partially overridden in classtorch::jit::GraphFunction
(4bf1be898d)
Quantization
- Fixed applying non-zero offset 1 to null pointer in
torch.nn.functional.interpolate
for quantized tensors (#65570) - Doesn't assume bias is a keyword argument to
torch.nn.Conv{n}d
(#61647, #71426) - Made error message when trying to use
torch.quantize_per_tensor
on non floats more specific (#66050) - Quantized
torch.nn.Embedding
conversion with unsupported dtype: make error message clearer (#66051) - Fixed
torch.nn.qat.EmbeddingBag
from_float error message (#66989) - Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in
torch.nn.Embedding
QAT (#68852) - Fixed scale+zp serialization of
torch.nn.quantized.BatchNorm{2|3}d
(#70432) - Fixed
torch.nn.Dropout
in FX graph mode quantization (#71043, #71438) - Fixed
qconfig
setting for fused modules in FX graph mode quantization (#71254) - Removed assumption number of rows is in 32 bit in fbgemm (#69066)
- Fixed
reduce_range
warning when using default observers (#71027)
ONNX
- Doesn’t create invalid
index_select
op when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493) - Shape inference:
- Fixed inplace
fill_
dtype export mismatch (#64580) - Fixed
remainder
(#64578) - Fixed
reciprocal
when input is not floating point (#67808) - Fixed
new_full
andfull_like
for Python 3.9 (#67806) - Fixed reduce ops on
binary_cross_entropy_with_logits
(#67805) - Propagated node metadata across passes (#45256)
- Ensured outputs don’t have the same name (#66137)
- Fixed
pad
with sequence inputs (#64377) - Fixed
instance_norm
withtrack_running_stats=True
(#64375) - Fixed
all
andany
withdim
arg (#67270) - Allows autograd functions (
prim::PythonOp
) to be exported withOperatorExportTypes.ONNX_FALLTHROUGH
(#67273)
torch.package
- Prevent import race condition that leaves
torch.package.PackagePickler
with unwanted dispatch table entries. (#71025)
Performance
Python API
- Speed up pickling for
torch.dtype
(#65182) - Speed up
histogram
: avoid index_put_ overhead in histogram kernel's inner loop (#67815) - Speed up
torch.topk
with sort for some cases (#68632) - Speed up
torch.stack
: don't unsqueeze every stack arg if possible (#70288) - Speed up
LayerNorm
4-5% (#71423) - Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
- Speed up
indexing
functions: release GIL in a few places (#71728) - Speed up
torch.empty
a bit: define check_sizes_nonnegative as inline (#71640) - Speed up
XLA
tensor printing by reducing compilations (#71147)
C++ API
- Updated
c10::SmallVector
from LLVM (#69110) - Reduced some framework overhead in
at::copy_()
(#68950) - Reduced some overhead in
StorageImpl::set_data_ptr
(#65432) - Improved
IValue
performance for tuples by inlining tuple storage (#64066)
Autograd
- Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
- Rewrote the backward of
linalg.lu
andlinalg.lu_solve
to uselinalg_solve_triangular
(#63569) - Updated
nn.functional.grid_sample
backward to compute input gradient only if required (#66069, #66070) - Stopped erroneously saving the output of
torch.softplus
for backward (#70296)
Complex Numbers
- Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
- Restored conjugate and negative bits of a tensor when calling
repeat_interleave
(#68523)
CUDA
- Used a better hash table in
CUDACachingAllocator
(#71667) TopK
CUDA Optimization: used multiple block per slice (#71081)- Removed sync in
Embedding
caused byunique
(#66091) EmbeddingBackward
exclusive_scan thrust->cub (#66566)sort_out_cuda
: Used custom kernels to fill index tensors (#66668)masked_scatter
: fuse mask count check into one kernel (#66871)- Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
- Improved native
layer_norm
forward perf (#67977) - Improved native
layer_norm
backward perf (#68238) - Fast path for size 0 GPU host malloc (#68532)
- Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
- Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
- Removed
call_once
fromCUDACachingAllocator
(#71668) - Reworked stat collection in
CUDACachingAllocator
(#71669) - Fixed CUDA
LpNormFunctor
(#70601)
Dispatcher
- Made
c10::KernelFunction
struct smaller, which should reduce some memory usage by the dispatcher (#65618)
torch.fx
- Made
torch.fx.symbolic_trace
reuse buffers if they're the same (#66211)
Profiler
Mobile
- Reduced PyTorch Library startup time by 40% for mobile and edge deployments(#65735, #65732, #65939, #66112, #66064, #66131)
- Reduced PyTorch Library heap memory utilization by 40% for mobile and edge deployments(#65732, #66112, #66064, #66131)
- Improve efficiency of IValue and reduce overhead in code paths that use IValue and perform Type Parsing (#65710, #64278, #66717, #65381, #66134, #65951, #70477)
TorchScript
- Improved performance of autodiff on small JIT graphs (#71666)
- Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
- Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
- In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)
Quantization
- Sped up quantized
torch.nn.functional.interpolate
for channels last (#66525) - Sped up
torch.nn.functional.upsample
for channels last (#70903) - Parallelized computation in
torch.quantize_per_tensor_affine
andtorch.dequantize
(#65845)
Documentation
Python API
- Added docs for
torch.adjoint
. (#68869) - Clarified difference in behavior of
empty_strided
andas_strided
(#64568) - Added some missing generated doc entries (
torch.select
,torch.slice_scatter
,torch.diagonal_scatter
,torch.select_scatter
) (#69030),histogramdd
(#68273) - Typo and formatting fixes.
LinearLR
(#67840),torch.any
(#65310, #70187),torch.futures
(#70630), jit docs (#68557),Tensor.type
(#67019),torch.lobpcg
(#71464),Tensor.triu()
,Tensor.tril()
,Tensor.ravel()
. (#71057),torch.acosh
(#66814), (#70439) - General Doc improvements for individual ops.
torch.finfo
(mentiontorch.bfloat16
) (#68496),torch.quantile
interpolation kwarg (#70637),from_dlpack
andto_dlpack
(#70437),set_printoptions
added examples (#68324),index_add
(#65806), topk doc (#65938),unique
(#66132),chi2
(#67379),torch.histc
(#64191),empty
andempty_like
(#68874),torch.cholesky_inverse
(#69069),torch.dsplit
(#70557) - Changed README getting started link to explicit instructions (#66828)
- Modernized and clarified docs for
torch.tensor
andtorch.as_tensor
(#63308) - Improved
torchhub
docs (#69970) - Updated docs for
torch.Tensor.real
to indicate that it's supported for real tensors (#71962)
C++ API
- Fixed typos in ATen README (#69170)
- Mentioned
TORCH_SHOW_CPP_STACKTRACES
inContributing.md
docs (#64052) - Updated link to C++ frontend examples (#66095)
- Added docs for Visual Studio extension (#63944)
- Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)
Autograd
- Updated docs for forward AD and make them public (#71643, #71159)
- Updated “Extending PyTorch” doc to cover forward AD (#66962)
- Fixed broken code syntax in autograd.rst (#69362)
- Fixed incorrect variable in autograd docs (#70884)
- Fixed typo in
torch.autograd.Function
docs that prevented it from compiling (#66754)
Dataloader
- Added docstring for
default_collate
anddefault_convert
(#69862) - Updated the documentation for AMP with DataParallel (#69218)
torch.nn
F.binary_cross_entropy
: Updated examples to avoid deprecated calls (#69816)F.linear
: Fixed shape docs to indicate no-batch-dim support (#66884)F.max_pool*d
: Added functional docs (#63264)F.multilabel_soft_margin_loss
: Added reduction args to signature (#70420)nn.AdaptiveLogSoftmaxWithLoss
: Fixed typo inlog_prob
name (#68926)nn.{BatchNorm1d, InstanceNorm1d}
: Fixed input shape notation inconsistencies (#71371)nn.CrossEntropyLoss
: Corrected typo in formula for class probability targets (#70220)nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}
: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)nn.Flatten
: Simplified example code (#67472)nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}
: Added activation function images (#65415)nn.KLDivLoss
: Fixed rendering ofreduction
arg (#66583)nn.KLDivLoss
: Rewrote docs to clarify math (#67443)nn.MaxUnpool2d
: Changed misleading example to better demonstrateoutput_size
usage (#68936)nn.Module
: Added note describing requiredsuper().__init__()
call (#66909)nn.Module
: Changedsuper()
usage to Python 3 syntax in example (#65748)nn.Module
: Fixed formatting fornamed_modules()
(#70491)nn.NLLLoss
: Corrected default value forreduce
(#68426)nn.SmoothL1Loss
: Clarified equivalence withnn.L1Loss
whenbeta == 0
(#70673)nn.{TransformerDecoderLayer, TransformerEncoderLayer}
: Clarified defaultbatch_first=False
dimension format (#66574)nn.Upsample
: Indicated thatalign_corners
takes effect inbicubic
mode (#66756)nn.utils.clip_grad_norm_
: Fixed rendering ofparameters
inerror_if_nonfinite
arg docs (#69958)optim.Adam
: Fixed formatting (#70387)optim.AdamW
: Fixed formula (#68587)optim.RAdam
: Corrected default value oflr
arg (#69186)- Removed orphan from cuDNN persistent note (#65160)
- Updated link to tutorial on defining NN modules (#65534)
nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}
: Fix doc formatting regressions from no-batch-dim support (#73014)
torch.fx
- Fixed for retracing documentation which would break for n-ary operators (#71599)
- Updated
torch.fx.passes.split_module
docstring (#65542) - Updated
fx.rst
example outputs (#68043) - Added document gotcha about training flag (#68915)
- Defined
get_dot_``graph
to match documentation (#70541)
Sparse
- Updated sparse.rst to warn about _values() (#71088)
CUDA
- Updated Stream
wait
documentation to reference underlyingcudaStreamWaitEvent
call (#67973) - Documented
torch.cuda.ExternalStream
,torch.cuda.caching_allocator_alloc
andtorch.cuda.caching_allocator_delete
(#70126) - Updated
CUDA Graphs
docs: Fixedmake_graphed_callables
example typos (#69379)
Mobile
- Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
- Added recipe for bundled inputs in TorchScript models (#1524)
Distributed
DistributedDataParallel
torch.distributed
torch.distributed.elastic
- Made --max_restarts explicit in the quickstart and runner docs (#65838)
torch.distributed.optim
- Rendered
torch.distributed.optim
members (#67885)
- Rendered
torch.distributed.rpc
- Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)
TorchScript
- Added
typing.Union
to supported types in documentation (#68435) - Added documentation to
torch.jit.is_tracing()
(#67326) - Fixed typos in
jit_language_reference.rst
(#68706)
Quantization
- Added documentation for quantized model save/load instructions (#69789)
- Updated link to qnnpack in quantization doc. (#66226)
- Improved quantization API docs (#66379)
- Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
- Documented the quantization custom module APIs (#67449)
- Improved quantization documentation (#68907)
ONNX
- Improved documentation of
operator_export_type
andopset_version
args (#69549) - Fixed documentation for
do_constant_folding
arg default (#71348) - Documented
ExportTypes
,CheckerError
, andunregister_custom_op_symbolic
(#68489) - Fixed link to ONNX Runtime custom op documentation (#67944)
- Added section “Discovering all unconvertible ATen ops at once” (#66143)
- Fixed typos (#66090)
- Documented work-arounds for indexing export limitations, and improve error messages (#64579)
torch.package
- Add some docs describing how to debug
torch.package
dependencies (#65704)
Download Release
This release has 2 assets:
- pytorch-v1.11.0.tar.gz
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.
Have any questions?
Contact Exxact Today
PyTorch 1.11.0 Now Available
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.
The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!
PyTorch 1.11.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
Highlights
The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:
- TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
- functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
- Distributed Data Parallel (DDP) static graph optimizations available in stable.
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
Fixed python deepcopy
to correctly copy all attributes on Tensor
objects (#65584)
This change ensures that the deepcopy
operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).
1.10.2 | 1.11.0 |
---|---|
a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # Raise AttributeError: "Tensor" object has no attribute "foo" | a = torch.rand(2) a.foo = 3 torch.save(a, "bar") b = torch.load("bar") print(b.foo) # 3 |
steps
argument is no longer optional in torch.linspace
and torch.logspace
This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps
). In PyTorch 1.11, it is not longer optional.
1.10.2 | 1.11.0 |
---|---|
# Works, but raises a deprecation warning # Steps defaults to 100 a = torch.linspace(1, 10) # UserWarning: Not providing a value for linspace's steps is deprecated # and will throw a runtime error in a future release. # This warning will appear only once per process. # (Triggered internally at ../aten/src/ATen/native/RangeFactories.cpp:19 | # In 1.11, you must specify steps a = torch.linspace(1, 10, steps=100) |
Remove torch.hub.import_module
function that was mistakenly public (#67990)
This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module
.
C++ API
We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten
operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)
When you #include
a header from the C++ frontend, you can no longer assume that every aten
operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h>
in your file, which will maintain the old behavior of including every aten
operators.
Custom implementation for c10::List
and c10::Dict
move constructors have been removed (#69370)
The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"
1.10.2 | 1.11.0 |
---|---|
c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); std::cout << list1.size() // 0 | c10::List list1({"3", "4"}); c10::List list2(std::move(list1)); // calls copy ctr std::cout << list1.size() // 2 |
CUDA
Removed THCeilDiv
function and corresponding THC/THCDeviceUtils.cuh
header (#65472)
As part of cleaning up TH
from the codebase, the THCeilDiv
function has been removed. Instead, please use at::ceil_div
, and include the corresponding ATen/ceil_div.h
header
Removed THCudaCheck
(#66391)
You can replace it with C10_CUDA_CHECK
, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions
Removed THCudaMalloc()
, THCudaFree()
, THCThrustAllocator.cuh
(#65492)
If your extension is using THCThrustAllocator.cuh
, please replace it with ATen/cuda/ThrustAllocator.h
and corresponding APIs (see examples in this PR).
This PR also removes THCudaMalloc/THCudaFree
calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr)
, or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate
which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.
Build
Stopped building shared library for AOT Compiler, libaot_compiler.so
(#66227)
Building aot_compiler.cpp
as a separate library is not necessary, as it’s already included in libtorch.so
.
You can update your build system to only dynamically link libtorch.so
.
Mobile
Make typing.Union
type unsupported for mobile builds (#65556)
typing.Union
support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.
Distributed
torch.distributed.rpc
: Final Removal of ProcessGroup RPC backend (#67363)
ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.
The backend type “PROCESS_GROUP” is now deprecated, e.g.torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)
Quantization
Disabled the support for getitem
in FX Graph Mode Quantization (#66647)
getitem
used to be quantized in FX Graph Mode Quantization
, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.
1.10.2 | 1.11.0 |
---|---|
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, # scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, # linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor) # quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # getitem = stack[0]; stack = None # dequantize_2 = getitem.dequantize(); getitem = None # return getitem | from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) def forward(self, x): x = self.linear(x) y = torch.stack([x], 0) return y[0] m = M().eval() m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig}) m = convert_fx(m) print(m) # prints # GraphModule( # (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine) # ) # def forward(self, x): # linear_input_scale_0 = self.linear_input_scale_0 # linear_input_zero_point_0 = self.linear_input_zero_point_0 # quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0, linear_input_zero_point_0, torch.quint8) # x = linear_input_scale_0 = linear_input_zero_point_0 = None # linear = self.linear(quantize_per_tensor); quantize_per_tensor = None # stack = torch.stack([linear], 0); linear = None # dequantize_2 = stack.dequantize(); stack = None # getitem = dequantize_2[0]; dequantize_2 = None # return getitem |
Users should now use fuse_modules
for PTQ fusion and fuse_modules_qat
for QAT fusion (#69878, #71956)
There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training
to decide which mode user wanted, but this was a misuse of the training
attribute since that is not the intended purpose. This PR removes the dependency on module.training
and uses separate APIs to make the fusion requested by the user explicit.
Previously, fuse_module
used to support both cases and distinguished PTQ/QAT fusion based on module.training
, but now fuse_module
only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat
, instead of using fuse_modules
, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.
Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.
1.10.2 | 1.11.0 |
---|---|
import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'> | import torch from torch.ao.quantization import fuse_modules class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = torch.nn.Conv2d(3, 3, 3) self.bn = torch.nn.BatchNorm2d(3) def forward(self, x): return self.bn(self.conv(x)) m = M().train() # For Quantization Aware Training, use fuse_modules_qat() m = fuse_modules_qat(m, ["conv", "bn"]) print(type(m.conv)) m = M().eval() m = fuse_modules(m, ["conv", "bn"]) print(type(m.conv)) # Result (doesn't change): <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'> <class 'torch.nn.modules.conv.Conv2d'> |
ONNX
Removed f
arg from onnx.export_to_pretty_string
(#69546)
The arg has always been ignored. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export_to_pretty_string(model, inputs, "file_name") | torch.onnx.export_to_pretty_string(model, inputs) |
Removed use_external_data_format
arg from onnx.export
(#67809)
The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, use_external_data_format=True) | torch.onnx.export(model, inputs, f_name) |
Removed example_outputs
arg from torch.onnx.export
(#67809)
The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,)) | torch.onnx.export(model, inputs, f_name) |
Removed enable_onnx_checker
arg from onnx.export
(#67276)
The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError
will be raised. Users can catch and ignore that exception.
1.10.2 | 1.11.0 |
---|---|
torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False) | try: torch.onnx.export(model, inputs, f_name) except torch.onnx.CheckerError: pass # ignore error |
Moved and renamed onnx.utils.ONNXCheckerError
to onnx.CheckerError
(#66644)
Previously the documentation was incorrect and stated ONNXCheckerError
was in the onnx
module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.
1.10.2 | 1.11.0 |
---|---|
except torch.onnx.utils.ONNXCheckerError: | except torch.onnx.CheckerError: |
Removed _retain_param_name
arg from onnx.export
(#67276)
The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.
1.10.2 | 1.11.0 |
---|---|
# NOTE: No way to get same behavior as _retain_param_name=False. torch.onnx.export(model, inputs, f_name, _retain_param_name=True) | torch.onnx.export(model, inputs, f_name) |
Deprecations
Python API
Deprecated x.T
on tensors of dimension other than 0 or 2 (#64180)
x.T
only accepts tensors with 0 or 2 dimensions. Calling x.T
on tensors with a different number of dimensions has been deprecated.
1.10.2 | 1.11.0 |
---|---|
a = torch.ones(2, 3, 4) a.T.size() # torch.Size([4, 3, 2]) | a = torch.ones(2, 3, 4) a.T.size() # UserWarning: The use of `x.T` on tensors of dimension other than 2 # to reverse their shape is deprecated and it will throw an error in a future release. # Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` # to reverse the dimensions of a tensor. (Triggered internally at # aten/src/ATen/native/TensorShape.cpp:2386.) # torch.Size([4, 3, 2]) |
Quantization
torch.ao.quantization.QConfigDynamic
is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig
instead (#69875, #69864)
1.10.2 | 1.11.0 |
---|---|
qconfig = torch.ao.quantization.QConfigDynamic(...) | qconfig = torch.ao.quantization.QConfig(...) |
New features
Python API
- Added
set_deterministic_debug_mode
andget_deterministic_debug_mode
(#67778, #66233) - Added n-dimensional Hermitian FFT:
torch.fft.ifftn
andtorch.fft.hfftn
(#63890) - Added
Wishart
distribution totorch.distributions
(#70377) - Preliminary support for the Python Array API standard has been added to the
torch
andtorch.linalg
modules. PyTorch implements over 90% of the operators defined by the Python Array API, including thetorch.from_dlpack
operation for improved DLPack support (#60627) - Moved
torch.testing
from prototype to beta (#69668)
Autograd
- Added new
torch.utils.checkpoint
implementation that does not use reentrant autograd (can be toggled with the newuse_reentrant
flag) (#69508) - Added
batched_grad
parameter toautograd.grad
to allow batched gradient computation (#65564) - Forward mode AD:
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Check the following issue (#71117) to see the list of ops that do not yet support forward AD. Please comment there if you run into any ops that don’t support forward AD that you want prioritized or are missing from that list.
- Added
ctx.save_for_forward
function toautograd.Function
(#71569) autograd.forward_ad.unpack_dual
returns a named tuple instead of plain tuple (#68062, #68628)
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Linear algebra operation support:
Build
- Added FlexiBLAS build support (#64815)
- Added
IS_LINUX
andIS_MACOS
global vars for cpp extensions building (#69093) - Added ARC for iOS CMake builds (#67884)
- Added support for IBM z14/15 SIMD (#66407)
Complex Numbers
Dataloader
- TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)
LinAlg
- Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
- Added the
linalg.matrix_exp
operation (see the docs here) (#62715) - Added the
linalg.cross
operation (see the docs here) (#63285) - Added the
linalg.diagonal
operation, an alias for torch.diagonal (see the docs here) (#70599) - Added the
linalg.lu_factor
operation (see the docs here) (#66933)
torch.nn
- Added
torch.nn.utils.rnn.{unpack_sequence,unpad_sequence}
functions (#66550)
Sparse
- Added
torch.sparse.sampled_addmm
for CSR Tensors on GPU (#68007)
CUDA
- The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
- Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
- Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
- The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
- Added complex support for Jiterator, port sinc to Jiterator (#71577)
- Jiterates
lcm
,i0e
,i1e
,ndtri
,efcx
,digamma
,trigamma
,lgamma
(#70663) - Jiterates
exp2
,erfc
,erfinv
andentr
(#71295) - Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
- Jiterates
polygamma
(#71162)
- Added cuSPARSE descriptors and updated CSR addmm (#60838)
- Sparse CSR CUDA: added
addmv_out
(#61407) - Added nvidia-smi memory and utilization as native Python API (#69104)
Vulkan
- Added Vulkan support for several torch operators:
- Added the
vulkan_perf_test
benchmark binary to benchmark Vulkan ops under various input conditions. (#67230)
Mobile
- Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
- Build tracer for tracing based workflow (#66267)
- Used operator.yaml to build LibTorch library (#66237)
- Unified tracer between internal and external (#64152)
- Reorganized model tracer dependency (#63421)
- Added support for the
bool
andint
dtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297) - Generic build features for selective build (#67817)
- Made more classes selective (#67397)
- Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)
Distributed
FullyShardedDataParallel
- FSDP is a type of data-parallel training but unlike traditional data-parallel it shards model’s parameters, gradients and optimizer states across data parallel workers and can optionally offload the sharded model parameters to the CPUs. This new API can help users to scale their large model training with minimal code change when switching from DDP to FSDP. (#63881, #64964, #66578, #66904, #66956, #66957, #67117, #67292, #67249, #67135, #67813, #68308, #68155, #68417, #68776, #69356, #69357, #69358, #70340, #71803, #71804, #70341, #70235, #72084)
DistributedDataParallel
TorchScript
- Enabled running
torch.jit.freeze()
andtorch.jit.optimize_for_inference
on functions that are not forward (#68668, #69367) - Enabled
torch.jit.freeze
to work on for sparse COO tensors (#69614) - Enabled
torch.jit.script()
,torch.jit.freeze()
and serialization for tensors in Compressed Sparse Row (CSR) format (#69555) - Allowed users to set the fusion strategy for
torch.jit.fuser
through the now publictorch.jit.set_fusion_strategy
. (#72937) - Enabled Dynamic Shape Fusion For GPU & CPU, configurable via
torch.jit.set_fusion_strategy
(#72036)
Quantization
- Added bilinear quantized implementation of
torch.nn.functional.grid_sample
2d operator (#66879) - Added the
torch.quantize_per_tensor_dynamic
operator (#68004) - Added Quantization Aware Training support for
torch.nn.Embedding
andtorch.nn.EmbeddingBag
- Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
- Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
- Eager mode QAT for Embeddings (#66429)
- Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
- Supported Embedding QAT via FX API (#69333)
- Add FX support for QAT EmbeddingBag (#69334)
- Added support for depthwise quantized
torch.nn.Conv3d
in qnnpack, for use in quantization- Depthwise Conv3d Indirection Buffer Setup (#69311)
- Depthwise Conv3d Weight Packing (#69312)
- Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
- Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
- Tightened Step Height for Indirection Buffers (#70530)
- Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
- Implemented 3d convolution in qnnpack (#66350)
ONNX
- Supports opset version 15 (#67805)
- Supports exporting
nn.Module
calls as ONNX local functions (#66140, #67803) - Supports for exporting new ops:
- Added BFloat16 type support (#66788)
- Supports exporting with Apex O2 (#66700)
Infra (Releng)
- Added support for ROCm 4.3.1 (#65624)
- Added support for ROCm 4.5.2 (#71064)
- Added support for CUDA 11.5 (#69262)
- Added support for CUDA enabled Bazel builds (#66241)
- Added support for Python 3.10 (#71132, #71419)
Improvements
Python API
- NumPy compatibility:
- Improved
torch.Tensor.view(dtype)
: enable all dtype combinations (#66493) - Improved
torch.diff
by adding support for n greater than 1 (#67260) - Improved
torch.movedim
to handle scalar as no-op (#69537) - Improved
cartesian_prod
: fixed a warning in the docs example (#68753) - Improved error messages for
max_unpool{}d
operators (#67328) torch.distributions
- Implemented positive-semidefinite constraint in
torch.distributions
(#71375) - Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
- Implemented support for
non-negative
constraint in exponential distribution (allowing it to include zero). (#67184) - Implemented
kl divergence
betweennormal
andlaplace
distribution. (#68807)
- Implemented positive-semidefinite constraint in
- Improved meta tensor support for operators:
- Added support for
torch.Tensor.real
for real-valued tensors (#71718) torch.logaddexp, torch.logaddexp2, torch.remainder
: added BFloat16 support on CPU (#63621)torch.bucketize
andsearchsorted
: added Half precision support (#67077)- Added new
torch.slice_scatter
,torch.select_scatter
,torch.diagonal_scatter
ops (#64430) - Made
torch.scatter_reduce
a public API (#68580, #73125)
C++ API
- Added C++ API and docs for
hfftn
(#66127) - Added support for
MaybeOwned<IValue>
(#68157) - Added
set_to_none
option forzero_grad()
to C++ API (#68801) - Added an environment variable,
TORCH_CPP_LOG_LEVEL
, that you can use to toggle the log level in the c10 library (#71746)
Autograd
- Added nesting support for
torch.autograd.graph.saved_tensor_hooks
(#70932) - Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
- Added complex autograd support to
torch.{col2im,im2col}
(#68199) - Added new reduce options and autograd support for
torch.scatter_reduce
(#71788) - Added derivatives wrt the second argument for
torch.{remainder,fmod}
(#69908) - Added new
strategy
flag toautograd.functional.{Jacobian, Hessian}
to enable vectorized computation (#67041, #66292) - Added
check_backward_ad
flag totorch.autograd.gradcheck
to be able to skip backward mode AD checks (#65040) - Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)
Build
- Improved incremental build times of PyTorch core by removing a dependency on
native_functions.yaml
in many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735) - Enabled bazel build without glog and gflags (#70850)
- Added support for C++ frontend wrapper on Linux (#69094)
- Added support for dynamic codegen outputs in CMake (#68246)
- Max CMake version is now used by default with setup.py (#69355)
- Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
- Code base should now be
-Wno-unused-variable
compliant (#66041) - Added lazy import for
packaging
intorch_version
(#71345)
Dataloader
- Support custom
Sequence
andMapping
forutils.data.default_collate
(#68779) - Allowed specifying
num_samples
toRandomSampler
whenreplacement
isFalse
(#71568) - Fixed the warning of shape inconsistency
utils.data.default_collate
(#71065)
ForEach
- Implemented
ForEach
L1 & L2 norm (#62646)
LinAlg
- The
linalg.matrix_rank
(docs) andlinalg.pinv
(docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)
torch.nn
- Added
channels_last
support forChannelShuffle
(#50247) - Added no-batch-dim support for
nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}
(#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291) - Added
BFloat16
support on CPU tonn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d}
(#56902, #66929, #66927, #56903) - Added
maximize
support tooptim.{Adam, AdamW, SGD}
(#68164, #70146, #67847, #68733, #71023) F.interpolate
: Addnearest-exact
mode to fix off-by-one error innearest
mode (#64501)F.interpolate
: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)F.interpolate
: Improved error message for invalid shapes (#66417)nn.Conv*d
: Accepts 0-sized channel inputs (#66256)nn.LogSigmoid
: Usedlog1p
for improved precision (#66441)nn.Module
: Added flag for removing duplicates from parameters (#71542)nn.Module
: Addedregister_module
alias for registering a sub-module (#65174)nn.ModuleList
: Supported concatenation (#70887)nn.MultiheadAttention
: Added flag to optionally average output attention weights across heads (#70055)nn.ParameterDict
: Supported full set ofdict
methods (#69403)nn.{RNN, GRU}
: Allowedhidden_size
to be 0 (#70556)nn.Sequential
: Addedappend
method (#71326)nn.Upsample
: Exposedrecompute_scale_factor
(#66419)nn.ZeroPad2d
: Addedextra_repr
for printing purposes (#69206)optim.{ChainedScheduler, SequentialLR}
: Addedoptimizer
attribute (#67406, #69817)optim.swa_utils.AveragedModel
: Addeduse_buffers
flag for averaging buffers in addition to parameters (#65921, #71763)
torch.fx
- Improved the customizability of
fx.Graph
’s code generation function, including support for setting a breakpoint in the generated code (#67139) - Supported printing inplace operators in FX (#71887)
Sparse
- Add CSR support for several operators:
torch.triangular_solve
,torch.addmv
,torch.addmm
,torch.add
for all arguments on CPU (#62180, #61536, #65606, #64391)torch.triangular_solve
,torch.addmv
,torch.addmm
,torch.add
for all arguments on GPU (#61407, #61858, #63511, #63948)- zero-preserving unary functions (#68123, #69292)
torch.empty
,torch.resize_
,torch.copy_
,torch.randn_like
,torch.clone
(#63509, #63510, #68083, #70581)transpose
(#70582)
- Added torch.sparse_coo Layout support to
zeros_like
(#68108) - Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
- Added support for conversion of CSR to COO Tensor to
to_sparse
(#66774) - Added support for empty COO Tensors to sparse.sum (#71091)
AMD
- Added sparse mappings for CUDA->HIP translation (#67323)
- Enabled frexp support for ROCm builds (#67226)
- Used hipCUB/rocPRIM scan algorithms for large index support (#68487)
CUDA
- Allows external CUDA streams to be set as current (#66324)
- Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
- Improved CUDA memory usage of
nanmedian
result (#68591) - Reduced number of
igamma
kernel instantiations (#70666) - Reduced number of
compare
kernels by unifying them (#69111) - Reduced number of
bernoulli
tensor tensor kernel instantiations (#70169) - Used
cub::FutureValue
to simplify 64bit indexing split of cub scan (#66711) - Added
hascuSOLVER
flag to Context (#69825) - Improved error message from
CUDACachingAllocator
(#69174) - Fixed
masked_softmax
perf for element_size is not 8 (#70271) - Reduced binary size of
TensorCompare.cu
(#68835) - Improved error message for
interpolation
(#72066) - Doesn't compile
pow
kernels for non-existent case (#70017)
Profiler
- Added flop count formulas for
bmm
andbaddbmm
(#66636)
Vulkan
- Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
- Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)
Mobile
- Introduced multiple improvements for
NNAPI
- Added converters for torchscript ops
quantized::mul
andquantized::convtranspose2d
to converter (torch.backends._nnapi.prepare.convert_model_to_nnapi
) (#63913, #63914) - Supported
int32
andqint16
type in Torchscript expressions (#70197, #70621) - Supported runtime flexible shapes and return shapes (#70334)
- Added converters for torchscript ops
- Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
- Introduced multiple improvements for
CoreML
- Type Support in Mobile Lite Interpreter
Distributed
torch.distributed
- Improvements to error handling in
TCPStore’
s socket implementation (#68225) - Enabled
ncclAvg
for reductions (#62835) - Init dummy
NCCL
comms in constructor (#65173, #66393) - Added pybind trampoline for
ProcessGroup
andWork
(#66338) - Setup
c10d
extension Backend class attr the same way as builtin ones (#66991) - Added barrier to
ProcessGroup
trampoline (#67236) - Raised warning when calling collectives on non-member group objects (#67639)
- Patched
bfloat16
support for NCCL (#67843) - Fixed
c10d
TCP store race condition with mutex (#68499) - Surfaced
ncclUniqueId
store broadcast error (#68597) - Checks for file existence before invoking cleanup logic in
FileStore
destructor (#68603) - Implemented gather primitive for
ProcessGroupNCCL
(#66745) - Implemented scatter primitive for
ProcessGroupNCCL
(#70029) - Enabled
gather_object
onNCCL
(#71623) - Implemented
allreduce_coalesced
forProcessGroupNCCL
(#62140) - Set non-default backend names to lower case (#69400)
- Added support for
deleteKey
forFileStore
(#69953) - Fixed
TSAN
issue inTCPStore
(#69590)
- Improvements to error handling in
DistributedDataParallel
torch.distributed.rpc
torch.distributed.autograd
- Made Kineto + distributed a warning rather than an error (#71120)
torch.distributed.elastic
- Added ability to override sys.executable for
torch.distributed.run
(#66179)
- Added ability to override sys.executable for
TorchScript
- Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single
CudaFusionGroup
, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137) - Enabled
profile_ivalue
to convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137) - Added support in
torch.jit.trace
for tracing already JITted subgraphs(#59949) - We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
torch.jit.freeze
now can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)- The peephole optimizer, which is used in
torch.jit.freeze
now coalesces consecutive calls totorch.concat
into a single call (#67000) - Added ability for Torch.JIT C dispatch to convert python
None
into an undefined Tensor(#67793) torch.jit.script
now recognizes union of scalars as a JIT NumberType (#66591)- No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
- In
torch.jit.optimize_for_inference
, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024) - In
torch.jit.freeze
, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024) - Added support for normalizing
torch.Tensor.__rsub__
innormalize_ops
JIT pass(#65014)
Quantization
- Quantized op improvements
torch.ao.FakeQuantize
now supportsfp32/fp16
zero_point
. (#65836)torch.ops.quantized.add
now supports broadcasting (#66049)torch.Tensor.dequantize
now supports fp16 + cuda (#67234)- Added quantized CPU support for
torch.nn.GELU
(#69968) torch.nn.quantized.functional.hardsigmoid
supports aninplace
flag (#65740)
- Workflow improvements
- FX graph mode quantization: enable
torch.nn.Linear + torch.nn.BatchNorm1d
fusion for PTQ (#66484) - Added an option in
torch.ao.quantization.quantize_fx.convert_fx
to acceptqconfig_dict
to skip quantization (#66878) - Added
torch.nn.qat.dynamic.modules.Linear
module (#67325) - Added
torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}d
fusion support (#70022) - Extended
torch.ao.quantization.prepare_qat
withallow_list
argument, to allow custom mapping and custom QAT module (#65119) - Added
torch.ao.quantization.default_replay_qconfig
which allows observer reuse fortorch.reshape
in FX graph mode quantization (#69249)
- FX graph mode quantization: enable
ONNX
- Set
ir_version
of the exported model based onopset_version
. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803) - Preserved op input names when op just passes through the input to the output (#67275)
- Shape inference improvements:
- Included op type in exported models’ input and output names (#68976)
- Supports Conv-BatchNorm fusion inside blocks (#67272)
- Exported
torch.reciprocal
to ONNX Reciprocal operator instead ofDiv(1, x)
(#67271) - Supports
beta!=1
in softplus (#66146) - Added warning for inplace updates on
tensor.shape
in tracing mode (#66142) - Supports
instance_norm
in training mode (#64375) - Allow registration of custom symbolics for ops specifying aten namespace (i.e.
aten::foo
is allowed as well as “foo”). (#67810) - Allow registration of custom symbolics for
prim
namespace (#66139) - Supports dynamic inputs for
OneHot
, bool forEinsum
(#66147)
Infra (Releng)
- Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)
torch.package
- Add ability to retrieve the dependency graph via
all_path
function(#65602) - Add support for pickle v4 (#70642)
- Add better testing support for Package Exporter (#70641)
Bug fixes
Python API
- Fixed scalar inputs for aliased binary ops {
multiply
,subtract
,divide
} (#65937) - Fixed
torch.save
when saving storages that view same data with different type (#66949) - Fixed
torch.save
error if storages are unallocated (#68787) - Fixed
k
out-of-bounds intorch.kthvalue
(cpu kernel) (#68863) - Fixed
inference_mode
decorator:with inference_mode(mode=False)
used to ignore themode
argument and always set inference mode. (#68617) - Fixed
cdist_backward
in the case whencdist
inputs are not contiguous (#70016) - Fixed
cdist
error message typo (#70178) - Fixed
scatter
for empty indexes (#70662) - Fixed
torch.{unique, unique_consecutive}
out of bound (#71540) - Fixed
torch.isin
in the case when inputs are non-contiguous on CPU (#70659) - Fixed
hsplit vsplit dsplit
crash when section is 0 (#69342) - Fixed:
torch.gradient
ignores dim argument when checking edge_order (#67926) - Fixed:
TransformedDistribution.icdf
should perform validation after applying the inverse transformation rather than before. (#71393) - Fixed
torch.all and torch.any
internal assert error with requires_grad=True (#65714) - Fixed
torch.logsumexp
type promotion: promote integral inputs to floating for(#63393)
C++ API
- Fixed libtorch
at::Tensor::print()
linking error (#69615) - Avoided UB when indexing into size-0 tensors (#65878)
- Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)
Autograd
- Fixed autocast state propagation in the
torch.utils.checkpoint
API (#71169) - Fixed
torch.nn.functional.conv_transpose3d
backward when grad_out is non-contiguous (#67829) - Forward mode AD:
- Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
- Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
- Fixed forward AD codegen for functions that have multiple formulas (#68535)
- Fixed deadlock when forward and backward AD are used at the same time (#67360)
- Fixed
Tensor.copy_
forward AD to handle broadcasting (#69592) - Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
- Fixed
autograd.Function
when non-Tensor argument precedes tensor argument (#71530) - Fixed
autograd.Function
forward AD when forward is a no-op to no longer raise an internal error (#71531)
Build
- Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
- Disabled SVE when cross-compiling for M1 (#67114)
- Added failure if
pocketfft
is not found andat_mkl
is not enabled (#67909) - Fixed clang issues when compiling with
_GLIBCXX_USE_CXX11_ABI
(#72081)
Complex Numbers
- Fixed
torch.autograd.gradcheck
to generate valid inputs for forward AD computation for complex functions (#68001) - Fixed
torch.Tensor.copy_
transpose path for tensors with conjugate or negative bit set (#69026) - Fixed
torch.Tensor.copy_
behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)
Dataloader
- Made
ProcessException
picklable (#70118) - Fixed persistent worker exiting before
pin_memory_thread
(#71579)
torch.nn
nn.AdaptiveAvgPool*d
: Throws an error for negativeoutput_size
(#70488)nn.Conv1d
: Fixed for 1D convolution on MKL-DNN backend (#68166)nn.CrossEntropyLoss
: Fixed for usage ofweight
,ignore_index
, andlabel_smoothing
together (#69511)nn.Fold
: Checked that block height and width are positive (#69048)nn.LayerNorm
: Fixed incorrect result on CUDA whengamma
orbias
are missing (#69210)nn.LayerNorm
: Avoided overflow by doing computation infloat
forhalf
(#66920)nn.Module
: Throws a proper error message fromload_state_dict
for non-tensor values (#70596)nn.ModuleList
: Fixed incorrect return type in__getitem__
(#69083)nn.MultiheadAttention
: Used query dtype for mask type (#68077)nn.NLLLoss
: Fixed backward computation with negative weights (#64572)nn.{RNN, GRU}
: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)nn.utils.rnn.pad_sequence
: Fix regression to support tuples for padding (#72436)optim._LrScheduler
: Fixed print formatting (#68338)optim.ChainedScheduler
: Fixedget_last_lr()
(#69112)optim.CosineAnnealingWarmRestarts
: Fixed ordering bug whenlast_epoch > 0
(#64758)optim.SequentialLR
: Updated_last_lr
on step (#70558)
torch.fx
- Supported
torch.layout
as arg (#66048) - Specified a default value when possible for placeholders created from
concrete_args
(#59569) - Fixed issue where
GraphModule.delete_all_unused_submodules
deletes submodules from called leaf modules (#66430) - Fixed
torch.fx.subgraph_rewriter.replace_pattern
mechanism so that multiple one-liner instances of the pattern are captured correctly (#66442) - Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
- Ensured node stack trace survives copying (#69368)
- Fixed
to_folder
not saving dtype (#69983) - Added a
default_value
arg tofx.Graph.placeholder
and fixsplit_module
(#71016)
Sparse
- Fixed CSR storage access to throw when used (#70072)
- Fixed multiplication of 0-D sparse tensors (#70749)
- Fixed result dtype for neg if given sparse Tensor (#68885)
CUDA
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
- Fixed error in warning about unsupported GPU (#67900)
- Disabled TF32 in
pinv_jvp
andpinv_backward
(#67948) - Fixed DLPack CUDA stream convention (#67618)
- Sets device guard in
_cudnn_impl
functions (#70406) - Fixed
mem_get_info
when querying on a device other than the current device (#69640)
Benchmark
- Fixed divide-by-zero errors in
torch.utils.benchmark.Timer
(#70050)
Dispatcher
- Added explicit
OperatorHandle
destructor, so that the symbol shows up in windows builds (#70033)
Profiler
Visualization
- Fixed
torch.utils.tensorboard
parsing JIT graph incorrectly (#65692)
Vulkan
- Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
- Addressed several warnings raised by the Vulkan Validation layers:
Mobile
- Fixed quantized logistic converter for
NNAPI
(#70847) - Fixed potential crash if
MTLCreateSystemDefaultDevice
returns nil (#66859) - Used full name to look for the promoted prim operator table (#66081)
- Fixed function name bug in mobile export (#66915)
- Fixed issues with
irange
not having a header included inMetal
(#66877) - Fixed backward compatibility issue for UnionType on mobile in
type_parser
. (#71341) - Fixed forward flatbuffer type handling with dynamic type in
flatbuffer_loader
. (#71500) - Fixed type equalities issue in
pytorch_jni_common
(#71508) - Fixed missing properties to the executor in
CoreML
(#67737) - Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
- Ensured that function participating in bundled inputs have their “name" attribute set (#65856)
Distributed
torch.distributed
- Fixed bug on empty
GLOO_SOCKET_IFNAME_ENV
(#68933)
- Fixed bug on empty
DistributedDataParallel
- Fixed “Cannot modify in-place due to DDPSink” (#66015)
torch.distributed.elastic
- Fixed scale down bug caused by calling
rdzv_handler.shutdown()
on premature agent failures (#67749)
- Fixed scale down bug caused by calling
TorchScript
- Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
- Fixed a ref counting loop for
CompilationUnit
, resulting in memory leaks when class objects were in JIT graphs. (#65442) - Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
- Fixed bug where
torch.jit.optimize_for_inference
did nottorch.jit.freeze
a module when passed a a non-frozen module (#71436) - Fixed bug where running module.forward() on a
torch.jit.freeze
ed module ran the wrong graph (#68316) - Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of
torch.split
, resulting in invalid optimizations in various JIT optimization passes (#69745) - Fixed places where using
torch.autocast
together with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648) - Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
- Fixed bug where graph input types were incorrect after running
torch.jit.trace
(#68242) - Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when
torch.jit.freeze
ops are converted to MKLDNN(#66628) - Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
- Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python
pickle
version.(#69807) - Fixed bug where
torch.jit.script
fails when comments in function has less indent than surrounding code (#70227) - Fixed incorrect device type when torch.device is called inside scripted (
torch.jit.script
) code (#69645) - Fixed warning: overloaded virtual function
torch::jit::Function::call
is only partially overridden in classtorch::jit::GraphFunction
(4bf1be898d)
Quantization
- Fixed applying non-zero offset 1 to null pointer in
torch.nn.functional.interpolate
for quantized tensors (#65570) - Doesn't assume bias is a keyword argument to
torch.nn.Conv{n}d
(#61647, #71426) - Made error message when trying to use
torch.quantize_per_tensor
on non floats more specific (#66050) - Quantized
torch.nn.Embedding
conversion with unsupported dtype: make error message clearer (#66051) - Fixed
torch.nn.qat.EmbeddingBag
from_float error message (#66989) - Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in
torch.nn.Embedding
QAT (#68852) - Fixed scale+zp serialization of
torch.nn.quantized.BatchNorm{2|3}d
(#70432) - Fixed
torch.nn.Dropout
in FX graph mode quantization (#71043, #71438) - Fixed
qconfig
setting for fused modules in FX graph mode quantization (#71254) - Removed assumption number of rows is in 32 bit in fbgemm (#69066)
- Fixed
reduce_range
warning when using default observers (#71027)
ONNX
- Doesn’t create invalid
index_select
op when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493) - Shape inference:
- Fixed inplace
fill_
dtype export mismatch (#64580) - Fixed
remainder
(#64578) - Fixed
reciprocal
when input is not floating point (#67808) - Fixed
new_full
andfull_like
for Python 3.9 (#67806) - Fixed reduce ops on
binary_cross_entropy_with_logits
(#67805) - Propagated node metadata across passes (#45256)
- Ensured outputs don’t have the same name (#66137)
- Fixed
pad
with sequence inputs (#64377) - Fixed
instance_norm
withtrack_running_stats=True
(#64375) - Fixed
all
andany
withdim
arg (#67270) - Allows autograd functions (
prim::PythonOp
) to be exported withOperatorExportTypes.ONNX_FALLTHROUGH
(#67273)
torch.package
- Prevent import race condition that leaves
torch.package.PackagePickler
with unwanted dispatch table entries. (#71025)
Performance
Python API
- Speed up pickling for
torch.dtype
(#65182) - Speed up
histogram
: avoid index_put_ overhead in histogram kernel's inner loop (#67815) - Speed up
torch.topk
with sort for some cases (#68632) - Speed up
torch.stack
: don't unsqueeze every stack arg if possible (#70288) - Speed up
LayerNorm
4-5% (#71423) - Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
- Speed up
indexing
functions: release GIL in a few places (#71728) - Speed up
torch.empty
a bit: define check_sizes_nonnegative as inline (#71640) - Speed up
XLA
tensor printing by reducing compilations (#71147)
C++ API
- Updated
c10::SmallVector
from LLVM (#69110) - Reduced some framework overhead in
at::copy_()
(#68950) - Reduced some overhead in
StorageImpl::set_data_ptr
(#65432) - Improved
IValue
performance for tuples by inlining tuple storage (#64066)
Autograd
- Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
- Rewrote the backward of
linalg.lu
andlinalg.lu_solve
to uselinalg_solve_triangular
(#63569) - Updated
nn.functional.grid_sample
backward to compute input gradient only if required (#66069, #66070) - Stopped erroneously saving the output of
torch.softplus
for backward (#70296)
Complex Numbers
- Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
- Restored conjugate and negative bits of a tensor when calling
repeat_interleave
(#68523)
CUDA
- Used a better hash table in
CUDACachingAllocator
(#71667) TopK
CUDA Optimization: used multiple block per slice (#71081)- Removed sync in
Embedding
caused byunique
(#66091) EmbeddingBackward
exclusive_scan thrust->cub (#66566)sort_out_cuda
: Used custom kernels to fill index tensors (#66668)masked_scatter
: fuse mask count check into one kernel (#66871)- Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
- Improved native
layer_norm
forward perf (#67977) - Improved native
layer_norm
backward perf (#68238) - Fast path for size 0 GPU host malloc (#68532)
- Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
- Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
- Removed
call_once
fromCUDACachingAllocator
(#71668) - Reworked stat collection in
CUDACachingAllocator
(#71669) - Fixed CUDA
LpNormFunctor
(#70601)
Dispatcher
- Made
c10::KernelFunction
struct smaller, which should reduce some memory usage by the dispatcher (#65618)
torch.fx
- Made
torch.fx.symbolic_trace
reuse buffers if they're the same (#66211)
Profiler
Mobile
- Reduced PyTorch Library startup time by 40% for mobile and edge deployments(#65735, #65732, #65939, #66112, #66064, #66131)
- Reduced PyTorch Library heap memory utilization by 40% for mobile and edge deployments(#65732, #66112, #66064, #66131)
- Improve efficiency of IValue and reduce overhead in code paths that use IValue and perform Type Parsing (#65710, #64278, #66717, #65381, #66134, #65951, #70477)
TorchScript
- Improved performance of autodiff on small JIT graphs (#71666)
- Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
- Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
- In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)
Quantization
- Sped up quantized
torch.nn.functional.interpolate
for channels last (#66525) - Sped up
torch.nn.functional.upsample
for channels last (#70903) - Parallelized computation in
torch.quantize_per_tensor_affine
andtorch.dequantize
(#65845)
Documentation
Python API
- Added docs for
torch.adjoint
. (#68869) - Clarified difference in behavior of
empty_strided
andas_strided
(#64568) - Added some missing generated doc entries (
torch.select
,torch.slice_scatter
,torch.diagonal_scatter
,torch.select_scatter
) (#69030),histogramdd
(#68273) - Typo and formatting fixes.
LinearLR
(#67840),torch.any
(#65310, #70187),torch.futures
(#70630), jit docs (#68557),Tensor.type
(#67019),torch.lobpcg
(#71464),Tensor.triu()
,Tensor.tril()
,Tensor.ravel()
. (#71057),torch.acosh
(#66814), (#70439) - General Doc improvements for individual ops.
torch.finfo
(mentiontorch.bfloat16
) (#68496),torch.quantile
interpolation kwarg (#70637),from_dlpack
andto_dlpack
(#70437),set_printoptions
added examples (#68324),index_add
(#65806), topk doc (#65938),unique
(#66132),chi2
(#67379),torch.histc
(#64191),empty
andempty_like
(#68874),torch.cholesky_inverse
(#69069),torch.dsplit
(#70557) - Changed README getting started link to explicit instructions (#66828)
- Modernized and clarified docs for
torch.tensor
andtorch.as_tensor
(#63308) - Improved
torchhub
docs (#69970) - Updated docs for
torch.Tensor.real
to indicate that it's supported for real tensors (#71962)
C++ API
- Fixed typos in ATen README (#69170)
- Mentioned
TORCH_SHOW_CPP_STACKTRACES
inContributing.md
docs (#64052) - Updated link to C++ frontend examples (#66095)
- Added docs for Visual Studio extension (#63944)
- Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)
Autograd
- Updated docs for forward AD and make them public (#71643, #71159)
- Updated “Extending PyTorch” doc to cover forward AD (#66962)
- Fixed broken code syntax in autograd.rst (#69362)
- Fixed incorrect variable in autograd docs (#70884)
- Fixed typo in
torch.autograd.Function
docs that prevented it from compiling (#66754)
Dataloader
- Added docstring for
default_collate
anddefault_convert
(#69862) - Updated the documentation for AMP with DataParallel (#69218)
torch.nn
F.binary_cross_entropy
: Updated examples to avoid deprecated calls (#69816)F.linear
: Fixed shape docs to indicate no-batch-dim support (#66884)F.max_pool*d
: Added functional docs (#63264)F.multilabel_soft_margin_loss
: Added reduction args to signature (#70420)nn.AdaptiveLogSoftmaxWithLoss
: Fixed typo inlog_prob
name (#68926)nn.{BatchNorm1d, InstanceNorm1d}
: Fixed input shape notation inconsistencies (#71371)nn.CrossEntropyLoss
: Corrected typo in formula for class probability targets (#70220)nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}
: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)nn.Flatten
: Simplified example code (#67472)nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}
: Added activation function images (#65415)nn.KLDivLoss
: Fixed rendering ofreduction
arg (#66583)nn.KLDivLoss
: Rewrote docs to clarify math (#67443)nn.MaxUnpool2d
: Changed misleading example to better demonstrateoutput_size
usage (#68936)nn.Module
: Added note describing requiredsuper().__init__()
call (#66909)nn.Module
: Changedsuper()
usage to Python 3 syntax in example (#65748)nn.Module
: Fixed formatting fornamed_modules()
(#70491)nn.NLLLoss
: Corrected default value forreduce
(#68426)nn.SmoothL1Loss
: Clarified equivalence withnn.L1Loss
whenbeta == 0
(#70673)nn.{TransformerDecoderLayer, TransformerEncoderLayer}
: Clarified defaultbatch_first=False
dimension format (#66574)nn.Upsample
: Indicated thatalign_corners
takes effect inbicubic
mode (#66756)nn.utils.clip_grad_norm_
: Fixed rendering ofparameters
inerror_if_nonfinite
arg docs (#69958)optim.Adam
: Fixed formatting (#70387)optim.AdamW
: Fixed formula (#68587)optim.RAdam
: Corrected default value oflr
arg (#69186)- Removed orphan from cuDNN persistent note (#65160)
- Updated link to tutorial on defining NN modules (#65534)
nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}
: Fix doc formatting regressions from no-batch-dim support (#73014)
torch.fx
- Fixed for retracing documentation which would break for n-ary operators (#71599)
- Updated
torch.fx.passes.split_module
docstring (#65542) - Updated
fx.rst
example outputs (#68043) - Added document gotcha about training flag (#68915)
- Defined
get_dot_``graph
to match documentation (#70541)
Sparse
- Updated sparse.rst to warn about _values() (#71088)
CUDA
- Updated Stream
wait
documentation to reference underlyingcudaStreamWaitEvent
call (#67973) - Documented
torch.cuda.ExternalStream
,torch.cuda.caching_allocator_alloc
andtorch.cuda.caching_allocator_delete
(#70126) - Updated
CUDA Graphs
docs: Fixedmake_graphed_callables
example typos (#69379)
Mobile
- Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
- Added recipe for bundled inputs in TorchScript models (#1524)
Distributed
DistributedDataParallel
torch.distributed
torch.distributed.elastic
- Made --max_restarts explicit in the quickstart and runner docs (#65838)
torch.distributed.optim
- Rendered
torch.distributed.optim
members (#67885)
- Rendered
torch.distributed.rpc
- Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)
TorchScript
- Added
typing.Union
to supported types in documentation (#68435) - Added documentation to
torch.jit.is_tracing()
(#67326) - Fixed typos in
jit_language_reference.rst
(#68706)
Quantization
- Added documentation for quantized model save/load instructions (#69789)
- Updated link to qnnpack in quantization doc. (#66226)
- Improved quantization API docs (#66379)
- Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
- Documented the quantization custom module APIs (#67449)
- Improved quantization documentation (#68907)
ONNX
- Improved documentation of
operator_export_type
andopset_version
args (#69549) - Fixed documentation for
do_constant_folding
arg default (#71348) - Documented
ExportTypes
,CheckerError
, andunregister_custom_op_symbolic
(#68489) - Fixed link to ONNX Runtime custom op documentation (#67944)
- Added section “Discovering all unconvertible ATen ops at once” (#66143)
- Fixed typos (#66090)
- Documented work-arounds for indexing export limitations, and improve error messages (#64579)
torch.package
- Add some docs describing how to debug
torch.package
dependencies (#65704)
Download Release
This release has 2 assets:
- pytorch-v1.11.0.tar.gz
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.
Have any questions?
Contact Exxact Today