PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world.
Here is the newest PyTorch release v1.4.0 featuring mobile build customization, distributed model parallel training, Java bindings, and many more new features.
Repository: pytorch/pytorch · Tag: v1.4.0 · Commit: 7f73f1d · Released by: nairbv
PyTorch 1.4.0 Release Notes
The PyTorch v1.4.0 release is now available.
The release contains over 1,500 commits and a significant amount of effort in areas spanning existing areas like JIT, ONNX, Distributed, Performance and Eager Frontend Improvements and improvements to experimental areas like mobile and quantization. It also contains new experimental features including rpc-based model parallel distributed training and language bindings for the Java language (inference only).
PyTorch 1.4 is the last release that supports Python 2. For the C++ API, it is the last release that supports C++11: you should start migrating to Python 3 and building with C++14 to make the future transition from 1.4 to 1.5 easier.
Table of Contents
- Highlights
- Backwards Incompatible Changes
- Python
- JIT
- C++
- New Features
- torch.optim
- Distributed
- RPC [Experimental]
- JIT
- Mobile
- Improvements
- Distributed
- JIT
- Mobile
- Named Tensors
- C++ API
- AMD Support
- ONNX
- Quantization
- Visualization
- Other Improvements
- Bug Fixes
- Distributed
- RPC
- C++ API
- JIT
- Quantization
- Mobile
- Other Bug fixes
- Decodecations
- Performance
Highlights
PyTorch Mobile - Build level customization
Following the experimental release of PyTorch Mobile in the 1.3 release, PyTorch 1.4 adds additional mobile support including the ability to customize build scripts at a fine-grain level. This allows mobile developers to optimize library size by only including the operators used by their models and, in the process, reduce their on device footprint significantly. Initial results show that, for example, a customized MobileNetV2 is 40% to 50% smaller than the codebuilt PyTorch mobile library. Learn more about how to create your own custom builds, and please engage with the community on the PyTorch forums to provide any feedback you have.
Distributed Model Parallel Training [Experimental]
With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.
To learn more about the APIs and the design of this feature, see the links below:
For the full tutorials, see the links below:
- A full RPC tutorial
- Examples using model parallel training for reinforcement learning and with an LSTM
As always, you can connect with community members and discuss more on the forums.
Java bindings [Experimental]
In addition to supporting Python and C++, this release adds experimental support for Java bindings. Based on the interface developed for Android in PyTorch Mobile, the new bindings allow you to invoke TorchScript models from any Java program. Note that the Java bindings are only available for Linux for this release, and for inference only. We expect support to expand in subsequent releases. See the code snippet below for how to use PyTorch within Java:
Learn more about how to use PyTorch from Java here, and see the full Javadocs API documentation here.
Pruning
Pruning functionalities have been added to PyTorch in the nn.utils.prune
module. This provides out-of-the-box support for common magnitude-based and random pruning techniques, both structured and unstructured, both layer-wise and global, and it also enables custom pruning from user-provided masks.
To prune a tensor, first select a pruning technique among those available in nn.utils.prune
(or implement your own by subclassing BasePruningMethod
).
from torch.nn.utils import prune
t = torch.rand(2, 5)
p = prune.L1Unstructured(amount=0.7)
pruned_tensor = p.prune(t)
To prune a module, select one of the pruning functions available in nn.utils.prune
(or implement your own) and specify which module and which parameter within that module pruning should act on.
m = nn.Conv2d(3, 1, 2)
prune.ln_structured(module=m, name='weight', amount=5, n=2, dim=1)
Pruning reparametrizes the module by turning weight
(in the example above) from a parameter to an attribute, and replacing it with a new parameter called weight_orig
(i.e. appending "_orig"
to the initial parameter name
) that stores the unpruned version of the tensor. The pruning mask is stored as a buffer named weight_mask
(i.e. appending "_mask"
to the initial parameter name
). Pruning is applied prior to each forward pass by recomputing weight
through a multiplication with the updated mask using PyTorch's forward_code_hooks
.
Iterative pruning is seamlessly enabled by repeatedly calling pruning functions on the same parameter (this automatically handles the combination of successive masks by making use of a PruningContainer
under the hood).
nn.utils.prune
is easily extensible to support new pruning functions by subclassing the BasePruningMethod
base class and implementing the compute_mask
method with the instructions to compute the mask according to the logic of the new pruning technique.
Backwards Incompatible Changes
Python
torch.optim
: It is no longer supported to use Scheduler.get_lr()
to obtain the last computed learning rate. to get the last computed learning rate, call Scheduler.get_last_lr()
instead. (26423)
Learning rate schedulers are now “chainable,” as mentioned in the New Features section below. Scheduler.get_lr
was sometimes used for monitoring purposes to obtain the current learning rate. But since Scheduler.get_lr
is also used internally for computing new learning rates, this actually returns a value that is “one step ahead.” To get the last computed learning rate, use Scheduler.get_last_lr
instead.
Note that optimizer.param_groups[0]['lr']
was in version 1.3.1 and remains in 1.4.0 a way of getting the current learning rate used in the optimizer.
Tensor.unfold
on a 0-dimensional Tensor now properly returns a 1-dimensional Tensor.
Version 1.3.1 | Version 1.4.0 |
---|---|
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1) tensor(5) | >>> torch.tensor(5).unfold(dimension=0, size=1, step=1) tensor([5]) |
torch.symeig
now return a 0-element eigenvectors tensor when eigenvectors=False
(the default).
Version 1.3.1 | Version 1.4.0 |
---|---|
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape torch.Size([3, 3]) | >>> torch.symeig(torch.randn(3,3)).eigenvectors.shape
<pre>torch.Size([0])</pre> |
JIT
- Make
torch.jit.get_trace_graph
private (it is nowtorch.jit._get_trace_graph
) (29149)- This function was intended only for ONNX integration; use
traced_module.graph
instead, like: - traced_module = torch.jit.trace(my_module, example_inputs)
traced_graph = traced_module.graph
- This function was intended only for ONNX integration; use
@property
onScriptModule
s has been disabled (28395)- Scripted
@property
accesses were silently broken before, where we would evaluate the theget
function once and store that as the attribute permanently. They properly error now; a workaround is to make your@property
a regular method.
- Scripted
- Custom ops:
torch::jit::RegisterOperators
has been removed, usetorch::RegisterOperators
instead (28229). The usage and behavior should remain the same. - Remove
torch.jit._register_*
bindings from Python (e.g.torch.jit._register_attribute
). These were private functions that were not intended to be used. (29499)
C++
[C++] The distinction between Tensor and Variable has been eliminated at the C++ level. (28287)
This change simplifies our C++ API and matches codevious changes we did at the python level that merged Tensors and Variables into a single type.
This change is unlikely to affect user code; the most likely exceptions are:
- Argument-dependent lookup for
torch::autograd
may no longer work. This can break because Variable is now defined as an alias for Tensor (using Variable = Tensor;
). In this case, you must explicitly qualify the calls totorch::autograd
functions. - Because
Variable
andTensor
are now the same type, code which assumes that they are different types (e.g., for the purposes of templating, orstd::enable_if
checks) will not work until you delete the (now) redundant overload/specialization. - Some operators may trace differently. If this happens, please file a bug. The most likely situations are:
- There are now more operations in your trace than before (usually, calls to
aten::empty
) - There are now less operations in your trace than before (e.g., the trace complains that
"there is no observable dependence"
with the inputs)
[C++] arguments in torch::nn::LinearOptions
are renamed to match the Python API. (27382)
- Arguments that are renamed:
in
->in_features
out
->out_features
with_bias
->bias
[C++] arguments in torch::nn::Conv{1,2,3}dOptions
are renamed to match the Python API. (28917) (29838)
- Arguments that are renamed:
input_channels
->in_channels
output_channels
->out_channels
with_bias
->bias
[C++] torch::nn::Conv{1,2,3}dOptions
no longer has the transposed
argument. (31005)
- If users have
transposed
originally set totrue
intorch::nn::Conv{1,2,3}dOptions
, they should migrate their code to usetorch::nn::ConvTranspose{1,2,3}d
layers instead.
[C++] All Reduction enums for torch::nn
layers and functionals are changed to have torch::KEnumNAME
syntax. (27942, 26837)
- Example: codeviously, to specify “mean” as the reduction method in a torch::nn layer or functional, we would use
torch::Reduction::Mean
. Now,torch::Reduction::Mean
has been renamed to the shortertorch::kMean
.
[C++] torch::tensor
constructor is improved to match Python API behavior. (28523) (29632) (29066)
- Shape checking fixes
- Example 1: codeviously,
torch::tensor({{1}, {2}})
produced a tensor of sizes{2}
. Now, it produces a tensor of sizes{2, 1}
. - Example 2: codeviously,
torch::tensor(1.1)
produced a 1-dim tensor. Now it produces a 0-dim tensor.
- Example 1: codeviously,
- Type inference improvements
- Example 1: codeviously, C++
torch::tensor
with a double (e.g.torch::tensor(1.1)
) or a (nested) braced-init-list of doubles (e.g.torch::tensor({{1.1, 2.2}})
produces a tensor with dtypetorch::kDouble
. Now it produces a tensor with dtypetorch::get_default_dtype()
. - Example 2: codeviously, C++
torch::tensor
with an integer type (e.g.torch::tensor(1)
) or a (nested) braced-init-list of integer types (e.g.torch::tensor({{1, 2}})
) produces a tensor with the same dtype. Now it always produces a tensor of dtypetorch::kLong
(aka.int64_t
). - Example 3: codeviously, when passed a
TensorOptions
without a dtype set to thetorch::tensor
constructor, it always produces a tensor of dtypetorch::get_default_dtype()
. Now it produces a tensor of different dtypes based on the dtype of the braced-init-list and the default dtype.
- Example 1: codeviously, C++
- Passing a
std::initializer_list
(NOT braced-init-list) totorch::tensor
will no longer compile, and the user should pass the equivalent braced-init-list totorch::tensor
instead. For example, writetorch::tensor({1.1, 1.2})
instead
[C++] Some activation modules’ forward
function now take Tensor
instead of Tensor&
as input. (28501)
torch::nn
layers affected: ELU
/ SELU
/ Hardtanh
/ LeakyReLU
/ ReLU
/ ReLU6
/ RReLU
/ CELU
This change ensures that the above layers can be used in a torch::nn::Sequential
module. If your C++ model uses any of the above layers, you must recompile your C++ code with the new libtorch binary.
New Features
torch.optim
Learning rate schedulers (torch.optim.lr_scheduler
) now support “chaining.” This means that two schedulers can be defined and stepped one after the other to compound their effect, see the example below. Codeviously, the schedulers would overwrite each other.
>>> import torch >>> from torch.optim import SGD >>> from torch.optim.lr_scheduler import ExponentialLR, StepLR >>> >>> model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))] >>> optimizer = SGD(model, 0.1) >>> >>> scheduler1 = ExponentialLR(optimizer, gamma=0.9) >>> scheduler2 = StepLR(optimizer, step_size=3, gamma=0.1) >>> >>> for epoch in range(4): >>> print(epoch, scheduler2.get_last_lr()[0]) >>> >>> optimizer.step() >>> scheduler1.step() >>> scheduler2.step() 0 0.1 1 0.09000000000000001 2 0.08100000000000002 3 0.00729000000000002 4 0.00656100000000002
Distributed
- Add
allgather_coalesced
API toProcessGroup
(28634,29059) - Add
abort
API inProcessGroupGloo
Send/Recv Work (29928). - Add
--no_python
flag to allow using a bash script wrapper in the launch command (29144).
RPC [Experimental]
torch.distributed.rpc
is a newly introduced package. It contains basic building blocks to run functions remotely in model training and inference, which will be useful for scenarios like distributed model parallel or implementing parameter server frameworks. More specifically, it contains four pillars: RPC, Remote Reference, Distributed Autograd, and Distributed Optimizer. Please refer to the documentation and the tutorial for more details.
- Add
rpc_sync
andrpc_async
for builtin operators and Python user functions (23228, 23569, 28392). - Add
remote
andRRef
for builtin operators and Python user functions (25169, 25499). - Distributed Autograd - FAST mode backward pass implementation. (27022, 27576).
- Integrate
remote
andRRef
with distributed autograd (28630, 28656). - Add a distributed optimizer (29304, 30062).
- Add python API for
get_gradients()
a method to retrieve gradients from distributed autograd context. (28926). - Support creating local
RRef
s on local values and to-selfremote
calls (28948, 29634). - Support custom pickler for RPC (30185).
- Add default RPC agent options based on the backend type (30201).
- Add local
shutdown
toProcessGroup
agent (30330).
JIT
script::Module
: implement more of the nn.Module API (28828)- In particular, adds the (optionally recursive) methods that iterate over submodules, parameters, etc.
- Adds a pybind-like
attr()
method to simplify attribute access.
- Add support for
@staticmethod
onScriptModule
s (27163) - Support Module Containers as Iterables (26465)
- Support Iterables In List Comcodehensions (26768)
- Dictionaries now codeserve insertion order, and
OrderedDict
is supported (26465) - Add support for
hasattr()
(29332) - TorchScript classes can now be callable (26743)
- Add
clone_instance
forScriptModule
s (30168) - Add
torch.memory_format
support to the TorchScript (28544) - Custom
forward()
is now allowed on container modules (28988) - Calls to submodules are now codeserved in the traced graph (29261)
- Add support for module containers to be used as iterables (28255)
- Make JIT Serialization support arbitrary std::function<> IO (28039)
- Support
layout()
in script (27100) - Methods and functions are no longer inlined in the serialized file format (26706)
Mobile
- Build level customization
Improvements
Distributed
Improvements
- Add timeout support in
ProcessGroupNCCL
(27224). - Ensure that DDP wrapped module has parameters that require gradients (25858).
- Making
torch/csrc/cuda
NCCL usage safe for NCCL 2.5 (29014). - Enable
test_distributed
for ROCm but only with NCCL backend (28814).
RPC Improvements
- Separate out RPC to
rpc_sync
andrpc_async
APIs (26570). - Make python user function serialization format to be consistent with builtin operators (27136).
- Clean up distributed autograd context on all participants on exit (27951).
- Improve error handling for distributed autograd engine. (27940).
- Scope pybind11 functions to
torch.distributed.{autograd,rpc}
(27529). - Lift
rpc_timeout
toRpcAgent
to make it reusable for otherRpcAgent
implementations. (29341). - Support sending message to self in
process_group_agent
(29253). - Properly shutdown RPC even in the case of
clean_shutdown=False
. (29148). - Ensure
initializedContextIds_
map is cleaned up appropriately in distributed autograd engine. (29787). - Add hash and equality operators for
WorkerInfo
(29958). - Add
RpcAgentOptions
struct type to bundle arguments for differentRpcAgent
s (29972). - Mark timeout
FutureMessage
s and throw exceptions inProcessGroupAgent
(29601). - Re-throw python remote exception when using remote reference to itself (29930).
- By default ignore
RRef
leaks during shutdown (30217).
Documentation
- Add Design doc for Distributed Autograd Engine (29175, 30068, 29927)
- Add Design doc for Remote Reference (30066).
- Add documentation page for
torch.distrbuted.rpc
(29276, 28030, 29971, 30160, 30050, 30069, 30179, 30218, 30240, 30243, 30259).
MISC
- Add known worker IDs to distributed autograd context (26324).
- Minor tweaks to RPC message API (28326).
- Rename
PythonUDF{Call,Resp}
(27530). - Use
std::shared_ptr
forDistAutogradContext
(29770). - Mark
c10d::~NCCLUtils
as noexcept (29118).
JIT
- Move custom passes to last optimization step (29256)
- Recodesent the original Python name of a module type the same way in traced and scripted modules. (29912)
- Only print original SourceRange on highlight (29708)
- Error message and ergonomic improvements:
- Show full call stack in TorchScript exception even when calls were inlined. (29911)
- Reduce error context from 10 -> 3 (26765)
- Fix error report highlight for unmatched type annotation (27195)
- Make default string arguments in schemas human readable (27088)
- Print which output didn't have dependence during trace checking. (29047)
- Improvements to save/load and serialization performance:
- Modules can now share JIT types if their implementation is the same, improving save/load performance (26666)
- Improve float pickling speed. (28553)
- Pickler: convert
std::stringstream
cases for improved performance. (29351) - Buffer to speed Unpickler (27727)
- Buffer in Pickler to improve performance. (27720)
- In
torch::save()
avoid zip comcodessing small header records. (28180) - String optimizations related to serialization. (28230)
- Clean up serialized source format (28129)
- API for finding a common ancestor block for a pair of nodes (28864)
- Make inserted child module names unique (27237)
- Better hashing for constant pool (27733)
- Improve error messages when a method or attribute is missing (27110)
- Display original source range in
Node::print
(27524) - Always use the closure to resolve variable names (27515)
Mobile
- Improve Java API / JNI
- Add the module method to allow explicitly destructing native part (27090).
- Add methods to write image tensor content to buffer (27359).
- Various improvements to Android API (27454, 27455).
- Add support for PyTorch JNI build (29412, 42faf961c8, d22f61432d).
- Various fixes to PyTorch JNI (29350, 29861, 30206, 30207).
- Improve support for older Android NDK
- Improve error message, documentation, debuggability
- Improve support for benchmark and profiling
- Improve build / CI
- Improve Android Gradle build and publishing (26833, 27389, 29262, 29738).
- Misc fixes to the Android test project (27453).
- Improve XCode build script (27358, 28996, 29002).
- Add testing code to iOS CI jobs (27593, 27594, 27784, 30133).
- Misc fixes to the iOS TestApp (27591, 28356, 28809, 29247, 29962, 29963).
- Add support for host build to pytorch_android (27662,27664).
- Add host build Gradle publishing (29749).
- Add mobile build CI with host toolchain (30292).
Named Tensors
torch.addcdiv
,torch.addcmul
Added named tensor support (28975).torch.{ones,zeros,full,rand,randn}_like
Added named tensor support (28981).torch.cdist
Added named tensor support (29129).torch.equal
Added named tensor support (29322).- Added named tensor support for comparison ops (27162).
Tensor.align_to
Fixed error message (27221).Tensor.align_to
Make method-only. (27304).Tensor.align_to
Accept partially named tensors (27308).torch.mean(Tensor, Dimname)
Fixed autograd support (29199).Tensor.unflatten
Fix when dim is a negative integer (#31208) (31432).- Fix type errors in examples about Named Tensor (27828).
C++ API
New torch::nn modules
- Convolution layers
- Pooling layers
- Loss layers
- torch::nn::HingeEmbeddingLoss / CosineEmbeddingLoss /MultiMarginLoss (27101) (27345) (27424) (27770).
- torch::nn::TripletMarginLoss / SoftMarginloss / MultiLabelMargin / MarginRankingLoss / MultiLabelSoftMarginLoss (27713, 27956) (27660) (27659) (29000) (27669).
- torch::nn::MSELoss / KLDivLoss / BCELoss / SmoothL1Loss / PoissonNLLLoss / BCEWithLogitsLoss (27156) (28806) (30146) (27661) (28755) (28783).
- torch::nn::NLLLoss / CrossEntropyLoss / CTCLoss (29812) (28654).
- Normalization Layers
- Activation Layers
- torch::nn::ELU / LeakyReLU / SELU / CodeLU / ReLU / ReLU6 / RRelu / CELU / GLU (27028) (27059) (27434) (27429) (27435) (27436) (27437) (27487) (29922).
- torch::nn::Sigmoid / LogSigmoid / LogSoftmax / Softmax / Softmax2d / Softplus / Softmin / Softsign / Softshrink / Hardshrink / Hardtanh / Tanh / Threshold (27488) (27060) (27462) (27446) (27509) (27489) (27459) (27535) (27534) (27035) (27537) (27038) (27536) (27538).
- Dropout Layers
- Padding Layers
- Embedding layers
- torch::nn::Embedding / EmbeddingBag (26358).
- Linear layers
- Vision layers
New torch::nn::functional functions
- Convolution functions
- Pooling functions
- Loss functions
- torch::nn::functional::hinge_embedding_loss / multi_margin_loss / multilabel_soft_margin_loss / triplet_margin_loss / soft_margin_loss / margin_ranking_loss (27101) (27424) (27669) (27713) (27660) (29000).
- torch::nn::functional::poisson_nll_loss / nll_loss / cross_entropy / binary_cross_entropy_with_logits (28755) (29812) (28783).
- torch::nn::functional::l1_loss / kl_div / mse_loss / binary_cross_entropy / smooth_l1_loss / ctc_loss (27156) (28806) (30146) (27661) (28654).
- Normalization functions
- Activation functions
- torch::nn::functional::elu / leaky_relu / selu / codelu / relu / relu6 / rrelu / celu / glu / gelu (27028) (27059) (27434) (27429) (27435) (27436) (27437) (27487) (29922) (28433).
- torch::nn::functional:: log_sigmoid/ log_softmax / softmax / softplus / softmin / softsign / softshrink / hardshrink / tanhshrink / hardtanh / gumbel_softmax / threshold (27060) (27462) (27446) (27489) (27459) (27535) (27534) (27035) (27537) (27038) (28121) (27538).
- Embedding functions
- Linear functions
- Padding functions
- Vision functions
- Distance functions
- torch::nn::functional::pdist (27122).
- Utility functions
AMD Support
- New features integration
- Build/CI
ONNX
In PyTorch 1.4, we have mainly focused on expanding the coverage for ONNX Opset 11, and enabling exporting torchvision models. Most of the torchvision models can be exported to ONNX (Opset 11, with fixed input size), including FasterRCNN, MaskRCNN, and KeypointRCNN. We have also enhanced export support for some tensor indexing scenarios, with more enhancements to come in the next release. In addition, 20+ new PyTorch operators are enabled in ONNX exporter.
Expanding Coverage for ONNX Opset 11
torch.sort/torch.topk
are supported in Opset 11 (25739)torch.size/torch.squeeze/torch.unsqueeze/torch.mm/torch.index_fill/torch.index_copy
are supported in Opset 11 (27578)torch.masked_select/torch.masked_scatter
are supported in Opset 11 (25949)torch.arange
is supported in Opset 11 (26875)avg_pool, constant_pad_nd, reflection_pad, replication_pad
Support enhanced in Opset 11 (28225)torch.hardtanh
is supported in Opset 11 (30169)- Enable ONNX constant folding for opset 11 (29011)
Exporting More Torch Operators/Models to ONNX
torch.remainder
is enabled in exporter (24410)torch.unfold
is enabled in exporter (24970)torch.slice/torch.select
with negative index are enabled in exporter (25273, 26549)torch.ones/torch.ones_like/torch.zeros/torch.zeros_like/torch.full/torch.full_like
with default dtype are enabled in exporter (27577)torch.unbind
is enabled in exporter (27247)torch.nn.functional.interpolate
export is enhanced (27179, 27566, 28560, 29489)torch.det
is enabled in exporter (26958)torch.group_norm
is enabled in exporter (27071)torch.meshgrid
is enabled in exporter (26037)torch.randn/torch.randn_like
are enabled in exporter (28470, 29354)torch.weight_norm
enabled in exporter (28618)torch.scalar_tensor
is enabled in exporter (28713)torch.logdet
is enabled in exporter (29767)torch.batch_norm
2D with affine=False is enabled in exporter (29458)torch.bitshift
is enabled in exporter (28210)
Enhancing Export/Test Infra
- Use deepcopy inputs in ONNX ORT test cases (27186)
- Return NotImplemented from all binary math ops (27423).
- Disabling ONNX IR v4 sematics for opset 8 or lower (28990)
- Add ONNX tests for torchvision models (30121)
- Keep output type information while exporting ONNX graph (25906)
Quantization
Quantization updates correspond to a mix of bug-fixes and feature improvements, with feature improvements adding improved operator coverage and performance improvements. We have also made a lot of progress towards enabling graph mode quantization support.
- Feature improvements:
- Enabling intra-op parallelism (26692).
- Enabling inplace relu (28710).
- Quantized Tensor support copy (28612).
- Add quantized torch mean implementation (27675).
- Add quantized avg_pool2d for pytorch mobile (27631).
- Add nn.quantized.Conv3d (29813).
- Adding inplace quantized relu6 (29245).
- Fast histogram observer (29790).
- PackedSequence support for quantized LSTM (29585).
- Improve legacy QuantizedLinear functions to reduce overhead (29773).
- Add support for quantized operator conversion from PT to C2 via ONNX (29694).
- enable per channel dynamic quantization (30122).
- Scripting support:
Visualization
- Fixed graph visualization: displaying proper names after recent JIT changes (30244)
- Support logging embedding for TensorBoard visualizations to the generic filesystem (27716)
Other Improvements
torch.argmax/argmin
Allow half type (28787).torch.cuda.memory_stats / memory_summary
instrumentation for CUDA memory allocator (27361).torch.set_num_threads
Allow calling multiple times with TBB (27190).torch.set_num_threads
Allow calling multiple times in parallel native (27947).torch.logical_xor
Allow non-bool tensors (27248).torch.promote_types
Nicer error message. (27941).torch.batch_norm_elemt
Add an out-variant (27621).torch.lerp
Implement derivative with respect to weight (28219).torch.complex32
Add type promotion support (27929).torch.unique
Support bool tensors (28374).torch.reshape
Improve backward for viewable geometries (28901).torch.lu
Generalized factorization (28608).torch.equal
Add the intra-op parallelism (28810).torch.randint
Accept generator=None (29748).torch.bfloat16
Enabled for cuda (27259).torch.multinomial
Enable for torch.half (29266).nn.RNN
Respect the current stream in cudnn (27026).nn.RNN
Codeserve nonlinearity attribute (28058).nn.Linear
Support 0-batch size. (27211).nn.functional.binary_cross_entropy
implement double backwards (26983).nn.AdaptiveAvgPool2d
Add support for NHWC memory format (24396).nn.GELU
Add GELU activation (28944).nn.LayerNorm
Handle batch size of zero (28614).nn.BatchNorm
Add NHWC support on cudnn (23861).nn.BatchNorm2d
support torch.channels_last (28982).nn.BatchNorm2d
Handle empty inputs (30035).nn.LayerNorm
Enable the intra-op parallelism (28464).nn.utils.prune
Add pruning functionality (24076).nn.Sequential
Make iterable (28987).dtype.is_signed
Ability to differentiate signed dtypes (29511).optim.lr_scheduler.MultiplicativeLR
Add new multiplicative learning rate scheduler. (27254).cuda.comm.scatter, gather
Add channel-last support (28077).at::parallel_for
Choose number of OMP threads based on GRAIN_SIZE (26963).- Return NotImplemented from unsupported tensor arithmetic operators (26507).
- Automatically select proper tqdm submodule (27108).
- Pickle support for sparse tensors (27062).
- Vectorized complex unary and binary op support. (26500).
- Complex support for reduce and linpack ops on CPU (27653).
- Complex support for compare and pointwise ops on CPU (28735).
- Make PyTorch Python 3.8 compatible (29302).
- Buffer python warning to avoid deadlocks (26613).
- Use NNPACK for strided convolutions. (29084).
Bug Fixes
Distributed
- Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (27124).
- Fix segmentation fault in
FileStore
with concurrent accesses. (28812). - Fix DDP incompatibility issue with
nn.MultiheadAttention
(26826).
RPC
- Add
ProcessGroupAgent
termination detection algorithm (26984). - Fix pybind11 warnings in Python RPC handler implementation (27284).
- Defer creating
ProcessGroupAgent
listener thread until contexts are initialized (28013). - Fix Python RPC handler exit crash (27251).
- Fix distributed autograd initialization (29069).
- Always include autograd context id in
rpc_*
/remote
requests (29781). - Make
RRefContext
singleton leaky, deal with module destruct order race. (30172).
C++ API Bug Fixes
- at::Tensor::requires_grad_ now supported (26332).
- torch::isfinite now supported (30083).
- torch::nn::modules_ordered_dict is decodecated (28774).
- Add reset_parameters to torch::nn modules (29832).
- Allow passing undefined Tensor to Module::register_parameter (27948).
- Exclude undefined tensors in the result of Module::parameters() / named_paramters() / buffers() / named_buffers() (30626).
- Include hierarchy information in C++ API loading error messages (28499).
- Fix a bug: the C++ L-BFGS optimizer does not work properly if there are one or more registered tensors with no grad in the model (27606).
- Use c10::variant-based enums for Nonlinearity and FanMode (27933). Support for
torch::nn::init::Nonlinearity
andtorch::nn::init::FanMode
will be removed in 1.5.
JIT
- Make dropout properly condition on training. (29436)
- Fix aten::grad to return optional list (29577)
- Fix
torch.arange
dtype - Fix type sharing on loaded ScriptModules (29826)
- Fix type sharing between traced modules (29583)
- Check for mutable default parameters (29833)
- Fix tracing of autograd functions (29791)
- Check for unrolled loop in break & continue (29474)
- Fix negative string indexing (22700)
- Make jit.trace_module reentrant (29411)
- Fix jit outplace tracing and reapply changes to _like operators. (28839)
- Properly guard against inheritance on TorchScript classes (28407)
- Fix when giving jit format warning about unsupported options (28616)
- Fix handling of function attributes. (28569)
- Fix pushLong() issue in pickler. (28057)
- Fix broken name mangling (27511)
- Fix segfault while printing value type for an error msg in emitListComcodehension (27261)
- Fix
toIValue
dict iteration (26856) - Fix race condition in Function::optimized_graph(). (27012)
- Sanitize module names on legacy import (27764)
- Python None should have its type inferred as NoneType (26665)
- Properly set existing attributes under recursive script (27514)
Quantization
- Skip copy_same_type_transpose_ for quantized tensor (29609).
- Add note that cuda quantization is not supported (27829).
- Rename _intrinsic to intrinsic (27194).
- Better error message for quantized dispatch (28635).
- Update the misleading comments for zero_points and scale in dynamic quant linear module [1/2] (28767).
- Avoid the misleading zero_point and scale [2/2] (28827).
- Add the warning message for API with linear modules (28766).
- Do not insert observers for empty sequential modules (28384).
- Fix the padding issue of quantized average pool operator (28260).
Mobile
Other Bug fixes
torch.kthvalue
Fix CUDA shared memory out of bound access in findPattern (28989).torch.save
Fix source files not being saved (28965).torch.load
Fix OSError loading files larger than 2GB. (27125).torch.linspace
clearer error message for negative step sizes. (28274).torch.histc
Add range checks to avoid segfaults (27712).torch.lu
Fix thread local issue on cpu (28546).torch.max_pool2d
Limit tensor size to max CUDA grid size (28931).torch.renorm
Fix a memory leak in CUDA renorm. (29873).torch.index_add
Fix bug in atomicAdd on CUDA for some dtypes (29231).torch.addmm
Fix handling of empty tensors (28613).nn.CTCLoss
Fix incorrect gradient for large target sizes (27460).nn.functional.ctc_loss
Fix incorrect gradient on cudnn (27039).nn.Embedding
Incorrect gradient at padding_idx in cuda kernel. (27731).nn.LayerNorm
Fix an illegal memory access error (28196).nn.Conv2d
handle zero stride (28784).nn.PoissonNLLLoss
Fix incorrect result withfull=True
(28637).nn.AvgPool2d
fix an overflow for 2^31-1 sized inputs (30793).nn.RNNBase
Fix an issue with use of children of RNN third party device types (28562).nn.Upsample
Fix “invalid configuration argument” error (28927).nn.Upsample
Fix a CUDA launch config failure (29016).optim.lr_scheduler.OneCycleLR
Correctly handle div_factor parameter (28217).PackedSequence.to
Ensure all tensors are moved (27245).EventList.total_average
Fix a regression caused by missing iadd (27498).Tensor.record_stream
Ensure stream is recorded for shifted view tensors (27371).torch.hub
Handle branch names containing a slash. (27960).- Fix error handling in Magma kernels (29003).
- Fix avx for c++14 (28207).
- Fix illegal memory access thread safety issue in sparse CUDA (29426).
Decodecations
Python 2 support is decodecated and will not be supported in the 1.5 release.
torch.optim
: Scheduler.step(epoch)
is now decodecated; use Scheduler.step()
instead. (26432)
For example:
>>> for epoch in range(10): >>> optimizer.step() >>> scheduler.step(epoch) DecodecationWarning: The epoch parameter in `scheduler.step()` was not necessary and is being decodecated where possible. Please use `scheduler.step()` to step the scheduler. During the decodecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose. warnings.warn(EPOCH_DECODECATION_WARNING, DecodecationWarning)
[C++] C++11 is decodecated and will not be supported in the 1.5 release.
[C++] Tensor::is_variable()
has been decodecated. As noted in the Backwards Incompatible Changes section, the distinction between variable and non-variable has been eliminated, so this check is no longer meaningful. Generally, is_variable()
will now return true except in some special circumstances (see 29653 for more details). (29653)
[C++] torch::nn::modules_ordered_dict
has been decodecated. It is generally no longer necessary and can just be removed. (28774)
torch.jit.quantized
API has been decodecated in favor of torch.quantization.quantize_dynamic
(28766)
Performance
A benchmark suite is available to easily measure the performance of operators with a range of input shapes. The generated benchmark data fully characterize the performance of operators in terms of execution time. For more details see README.md in the benchmarks/operator_benchmark directory.
torch.nn.functional.threshold, torch.nn.functional.layer_norm, torch.cdist
Performance of threshold (CPU), layer norm (CUDA) and cdist operations was improved (27155,27634, 25799)torch.Tensor.fill_
Performance for half and bfloat16 types on CPU was improved (28397).torch.nn.MaxPool2d
implementation for channels_last format was added (24872)- There is a fast pass reducing the overheads of pointwise operations relying on TensorIterator under certain conditions (contiguous inputs, no broadcast) (29180).
- Overheads of operations with scalars/number literals was improved (29915).
- In case of type promotion on the GPU, the values are converted on the fly, without explicit casting of the full tensor (30018).
- reorder_dimensions in TensorIterator favors output write locality, improving overall performance when operating on discontiguous tensors (28615).
- Float pickling speed was improved (28553).
- GRAIN_SIZE for intra-op parallelization was unified between TH and ATen operations (28770)
tensor.numel
devirtualized, improving performance (27294)
This release has 2 assets:
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.
PyTorch Release v1.4.0 - Mobile build customization, Distributed model parallel training, Java bindings | Exxact Blog
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world.
Here is the newest PyTorch release v1.4.0 featuring mobile build customization, distributed model parallel training, Java bindings, and many more new features.
Repository: pytorch/pytorch · Tag: v1.4.0 · Commit: 7f73f1d · Released by: nairbv
PyTorch 1.4.0 Release Notes
The PyTorch v1.4.0 release is now available.
The release contains over 1,500 commits and a significant amount of effort in areas spanning existing areas like JIT, ONNX, Distributed, Performance and Eager Frontend Improvements and improvements to experimental areas like mobile and quantization. It also contains new experimental features including rpc-based model parallel distributed training and language bindings for the Java language (inference only).
PyTorch 1.4 is the last release that supports Python 2. For the C++ API, it is the last release that supports C++11: you should start migrating to Python 3 and building with C++14 to make the future transition from 1.4 to 1.5 easier.
Table of Contents
- Highlights
- Backwards Incompatible Changes
- Python
- JIT
- C++
- New Features
- torch.optim
- Distributed
- RPC [Experimental]
- JIT
- Mobile
- Improvements
- Distributed
- JIT
- Mobile
- Named Tensors
- C++ API
- AMD Support
- ONNX
- Quantization
- Visualization
- Other Improvements
- Bug Fixes
- Distributed
- RPC
- C++ API
- JIT
- Quantization
- Mobile
- Other Bug fixes
- Decodecations
- Performance
Highlights
PyTorch Mobile - Build level customization
Following the experimental release of PyTorch Mobile in the 1.3 release, PyTorch 1.4 adds additional mobile support including the ability to customize build scripts at a fine-grain level. This allows mobile developers to optimize library size by only including the operators used by their models and, in the process, reduce their on device footprint significantly. Initial results show that, for example, a customized MobileNetV2 is 40% to 50% smaller than the codebuilt PyTorch mobile library. Learn more about how to create your own custom builds, and please engage with the community on the PyTorch forums to provide any feedback you have.
Distributed Model Parallel Training [Experimental]
With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.
To learn more about the APIs and the design of this feature, see the links below:
For the full tutorials, see the links below:
- A full RPC tutorial
- Examples using model parallel training for reinforcement learning and with an LSTM
As always, you can connect with community members and discuss more on the forums.
Java bindings [Experimental]
In addition to supporting Python and C++, this release adds experimental support for Java bindings. Based on the interface developed for Android in PyTorch Mobile, the new bindings allow you to invoke TorchScript models from any Java program. Note that the Java bindings are only available for Linux for this release, and for inference only. We expect support to expand in subsequent releases. See the code snippet below for how to use PyTorch within Java:
Learn more about how to use PyTorch from Java here, and see the full Javadocs API documentation here.
Pruning
Pruning functionalities have been added to PyTorch in the nn.utils.prune
module. This provides out-of-the-box support for common magnitude-based and random pruning techniques, both structured and unstructured, both layer-wise and global, and it also enables custom pruning from user-provided masks.
To prune a tensor, first select a pruning technique among those available in nn.utils.prune
(or implement your own by subclassing BasePruningMethod
).
from torch.nn.utils import prune
t = torch.rand(2, 5)
p = prune.L1Unstructured(amount=0.7)
pruned_tensor = p.prune(t)
To prune a module, select one of the pruning functions available in nn.utils.prune
(or implement your own) and specify which module and which parameter within that module pruning should act on.
m = nn.Conv2d(3, 1, 2)
prune.ln_structured(module=m, name='weight', amount=5, n=2, dim=1)
Pruning reparametrizes the module by turning weight
(in the example above) from a parameter to an attribute, and replacing it with a new parameter called weight_orig
(i.e. appending "_orig"
to the initial parameter name
) that stores the unpruned version of the tensor. The pruning mask is stored as a buffer named weight_mask
(i.e. appending "_mask"
to the initial parameter name
). Pruning is applied prior to each forward pass by recomputing weight
through a multiplication with the updated mask using PyTorch's forward_code_hooks
.
Iterative pruning is seamlessly enabled by repeatedly calling pruning functions on the same parameter (this automatically handles the combination of successive masks by making use of a PruningContainer
under the hood).
nn.utils.prune
is easily extensible to support new pruning functions by subclassing the BasePruningMethod
base class and implementing the compute_mask
method with the instructions to compute the mask according to the logic of the new pruning technique.
Backwards Incompatible Changes
Python
torch.optim
: It is no longer supported to use Scheduler.get_lr()
to obtain the last computed learning rate. to get the last computed learning rate, call Scheduler.get_last_lr()
instead. (26423)
Learning rate schedulers are now “chainable,” as mentioned in the New Features section below. Scheduler.get_lr
was sometimes used for monitoring purposes to obtain the current learning rate. But since Scheduler.get_lr
is also used internally for computing new learning rates, this actually returns a value that is “one step ahead.” To get the last computed learning rate, use Scheduler.get_last_lr
instead.
Note that optimizer.param_groups[0]['lr']
was in version 1.3.1 and remains in 1.4.0 a way of getting the current learning rate used in the optimizer.
Tensor.unfold
on a 0-dimensional Tensor now properly returns a 1-dimensional Tensor.
Version 1.3.1 | Version 1.4.0 |
---|---|
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1) tensor(5) | >>> torch.tensor(5).unfold(dimension=0, size=1, step=1) tensor([5]) |
torch.symeig
now return a 0-element eigenvectors tensor when eigenvectors=False
(the default).
Version 1.3.1 | Version 1.4.0 |
---|---|
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape torch.Size([3, 3]) | >>> torch.symeig(torch.randn(3,3)).eigenvectors.shape
<pre>torch.Size([0])</pre> |
JIT
- Make
torch.jit.get_trace_graph
private (it is nowtorch.jit._get_trace_graph
) (29149)- This function was intended only for ONNX integration; use
traced_module.graph
instead, like: - traced_module = torch.jit.trace(my_module, example_inputs)
traced_graph = traced_module.graph
- This function was intended only for ONNX integration; use
@property
onScriptModule
s has been disabled (28395)- Scripted
@property
accesses were silently broken before, where we would evaluate the theget
function once and store that as the attribute permanently. They properly error now; a workaround is to make your@property
a regular method.
- Scripted
- Custom ops:
torch::jit::RegisterOperators
has been removed, usetorch::RegisterOperators
instead (28229). The usage and behavior should remain the same. - Remove
torch.jit._register_*
bindings from Python (e.g.torch.jit._register_attribute
). These were private functions that were not intended to be used. (29499)
C++
[C++] The distinction between Tensor and Variable has been eliminated at the C++ level. (28287)
This change simplifies our C++ API and matches codevious changes we did at the python level that merged Tensors and Variables into a single type.
This change is unlikely to affect user code; the most likely exceptions are:
- Argument-dependent lookup for
torch::autograd
may no longer work. This can break because Variable is now defined as an alias for Tensor (using Variable = Tensor;
). In this case, you must explicitly qualify the calls totorch::autograd
functions. - Because
Variable
andTensor
are now the same type, code which assumes that they are different types (e.g., for the purposes of templating, orstd::enable_if
checks) will not work until you delete the (now) redundant overload/specialization. - Some operators may trace differently. If this happens, please file a bug. The most likely situations are:
- There are now more operations in your trace than before (usually, calls to
aten::empty
) - There are now less operations in your trace than before (e.g., the trace complains that
"there is no observable dependence"
with the inputs)
[C++] arguments in torch::nn::LinearOptions
are renamed to match the Python API. (27382)
- Arguments that are renamed:
in
->in_features
out
->out_features
with_bias
->bias
[C++] arguments in torch::nn::Conv{1,2,3}dOptions
are renamed to match the Python API. (28917) (29838)
- Arguments that are renamed:
input_channels
->in_channels
output_channels
->out_channels
with_bias
->bias
[C++] torch::nn::Conv{1,2,3}dOptions
no longer has the transposed
argument. (31005)
- If users have
transposed
originally set totrue
intorch::nn::Conv{1,2,3}dOptions
, they should migrate their code to usetorch::nn::ConvTranspose{1,2,3}d
layers instead.
[C++] All Reduction enums for torch::nn
layers and functionals are changed to have torch::KEnumNAME
syntax. (27942, 26837)
- Example: codeviously, to specify “mean” as the reduction method in a torch::nn layer or functional, we would use
torch::Reduction::Mean
. Now,torch::Reduction::Mean
has been renamed to the shortertorch::kMean
.
[C++] torch::tensor
constructor is improved to match Python API behavior. (28523) (29632) (29066)
- Shape checking fixes
- Example 1: codeviously,
torch::tensor({{1}, {2}})
produced a tensor of sizes{2}
. Now, it produces a tensor of sizes{2, 1}
. - Example 2: codeviously,
torch::tensor(1.1)
produced a 1-dim tensor. Now it produces a 0-dim tensor.
- Example 1: codeviously,
- Type inference improvements
- Example 1: codeviously, C++
torch::tensor
with a double (e.g.torch::tensor(1.1)
) or a (nested) braced-init-list of doubles (e.g.torch::tensor({{1.1, 2.2}})
produces a tensor with dtypetorch::kDouble
. Now it produces a tensor with dtypetorch::get_default_dtype()
. - Example 2: codeviously, C++
torch::tensor
with an integer type (e.g.torch::tensor(1)
) or a (nested) braced-init-list of integer types (e.g.torch::tensor({{1, 2}})
) produces a tensor with the same dtype. Now it always produces a tensor of dtypetorch::kLong
(aka.int64_t
). - Example 3: codeviously, when passed a
TensorOptions
without a dtype set to thetorch::tensor
constructor, it always produces a tensor of dtypetorch::get_default_dtype()
. Now it produces a tensor of different dtypes based on the dtype of the braced-init-list and the default dtype.
- Example 1: codeviously, C++
- Passing a
std::initializer_list
(NOT braced-init-list) totorch::tensor
will no longer compile, and the user should pass the equivalent braced-init-list totorch::tensor
instead. For example, writetorch::tensor({1.1, 1.2})
instead
[C++] Some activation modules’ forward
function now take Tensor
instead of Tensor&
as input. (28501)
torch::nn
layers affected: ELU
/ SELU
/ Hardtanh
/ LeakyReLU
/ ReLU
/ ReLU6
/ RReLU
/ CELU
This change ensures that the above layers can be used in a torch::nn::Sequential
module. If your C++ model uses any of the above layers, you must recompile your C++ code with the new libtorch binary.
New Features
torch.optim
Learning rate schedulers (torch.optim.lr_scheduler
) now support “chaining.” This means that two schedulers can be defined and stepped one after the other to compound their effect, see the example below. Codeviously, the schedulers would overwrite each other.
>>> import torch >>> from torch.optim import SGD >>> from torch.optim.lr_scheduler import ExponentialLR, StepLR >>> >>> model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))] >>> optimizer = SGD(model, 0.1) >>> >>> scheduler1 = ExponentialLR(optimizer, gamma=0.9) >>> scheduler2 = StepLR(optimizer, step_size=3, gamma=0.1) >>> >>> for epoch in range(4): >>> print(epoch, scheduler2.get_last_lr()[0]) >>> >>> optimizer.step() >>> scheduler1.step() >>> scheduler2.step() 0 0.1 1 0.09000000000000001 2 0.08100000000000002 3 0.00729000000000002 4 0.00656100000000002
Distributed
- Add
allgather_coalesced
API toProcessGroup
(28634,29059) - Add
abort
API inProcessGroupGloo
Send/Recv Work (29928). - Add
--no_python
flag to allow using a bash script wrapper in the launch command (29144).
RPC [Experimental]
torch.distributed.rpc
is a newly introduced package. It contains basic building blocks to run functions remotely in model training and inference, which will be useful for scenarios like distributed model parallel or implementing parameter server frameworks. More specifically, it contains four pillars: RPC, Remote Reference, Distributed Autograd, and Distributed Optimizer. Please refer to the documentation and the tutorial for more details.
- Add
rpc_sync
andrpc_async
for builtin operators and Python user functions (23228, 23569, 28392). - Add
remote
andRRef
for builtin operators and Python user functions (25169, 25499). - Distributed Autograd - FAST mode backward pass implementation. (27022, 27576).
- Integrate
remote
andRRef
with distributed autograd (28630, 28656). - Add a distributed optimizer (29304, 30062).
- Add python API for
get_gradients()
a method to retrieve gradients from distributed autograd context. (28926). - Support creating local
RRef
s on local values and to-selfremote
calls (28948, 29634). - Support custom pickler for RPC (30185).
- Add default RPC agent options based on the backend type (30201).
- Add local
shutdown
toProcessGroup
agent (30330).
JIT
script::Module
: implement more of the nn.Module API (28828)- In particular, adds the (optionally recursive) methods that iterate over submodules, parameters, etc.
- Adds a pybind-like
attr()
method to simplify attribute access.
- Add support for
@staticmethod
onScriptModule
s (27163) - Support Module Containers as Iterables (26465)
- Support Iterables In List Comcodehensions (26768)
- Dictionaries now codeserve insertion order, and
OrderedDict
is supported (26465) - Add support for
hasattr()
(29332) - TorchScript classes can now be callable (26743)
- Add
clone_instance
forScriptModule
s (30168) - Add
torch.memory_format
support to the TorchScript (28544) - Custom
forward()
is now allowed on container modules (28988) - Calls to submodules are now codeserved in the traced graph (29261)
- Add support for module containers to be used as iterables (28255)
- Make JIT Serialization support arbitrary std::function<> IO (28039)
- Support
layout()
in script (27100) - Methods and functions are no longer inlined in the serialized file format (26706)
Mobile
- Build level customization
Improvements
Distributed
Improvements
- Add timeout support in
ProcessGroupNCCL
(27224). - Ensure that DDP wrapped module has parameters that require gradients (25858).
- Making
torch/csrc/cuda
NCCL usage safe for NCCL 2.5 (29014). - Enable
test_distributed
for ROCm but only with NCCL backend (28814).
RPC Improvements
- Separate out RPC to
rpc_sync
andrpc_async
APIs (26570). - Make python user function serialization format to be consistent with builtin operators (27136).
- Clean up distributed autograd context on all participants on exit (27951).
- Improve error handling for distributed autograd engine. (27940).
- Scope pybind11 functions to
torch.distributed.{autograd,rpc}
(27529). - Lift
rpc_timeout
toRpcAgent
to make it reusable for otherRpcAgent
implementations. (29341). - Support sending message to self in
process_group_agent
(29253). - Properly shutdown RPC even in the case of
clean_shutdown=False
. (29148). - Ensure
initializedContextIds_
map is cleaned up appropriately in distributed autograd engine. (29787). - Add hash and equality operators for
WorkerInfo
(29958). - Add
RpcAgentOptions
struct type to bundle arguments for differentRpcAgent
s (29972). - Mark timeout
FutureMessage
s and throw exceptions inProcessGroupAgent
(29601). - Re-throw python remote exception when using remote reference to itself (29930).
- By default ignore
RRef
leaks during shutdown (30217).
Documentation
- Add Design doc for Distributed Autograd Engine (29175, 30068, 29927)
- Add Design doc for Remote Reference (30066).
- Add documentation page for
torch.distrbuted.rpc
(29276, 28030, 29971, 30160, 30050, 30069, 30179, 30218, 30240, 30243, 30259).
MISC
- Add known worker IDs to distributed autograd context (26324).
- Minor tweaks to RPC message API (28326).
- Rename
PythonUDF{Call,Resp}
(27530). - Use
std::shared_ptr
forDistAutogradContext
(29770). - Mark
c10d::~NCCLUtils
as noexcept (29118).
JIT
- Move custom passes to last optimization step (29256)
- Recodesent the original Python name of a module type the same way in traced and scripted modules. (29912)
- Only print original SourceRange on highlight (29708)
- Error message and ergonomic improvements:
- Show full call stack in TorchScript exception even when calls were inlined. (29911)
- Reduce error context from 10 -> 3 (26765)
- Fix error report highlight for unmatched type annotation (27195)
- Make default string arguments in schemas human readable (27088)
- Print which output didn't have dependence during trace checking. (29047)
- Improvements to save/load and serialization performance:
- Modules can now share JIT types if their implementation is the same, improving save/load performance (26666)
- Improve float pickling speed. (28553)
- Pickler: convert
std::stringstream
cases for improved performance. (29351) - Buffer to speed Unpickler (27727)
- Buffer in Pickler to improve performance. (27720)
- In
torch::save()
avoid zip comcodessing small header records. (28180) - String optimizations related to serialization. (28230)
- Clean up serialized source format (28129)
- API for finding a common ancestor block for a pair of nodes (28864)
- Make inserted child module names unique (27237)
- Better hashing for constant pool (27733)
- Improve error messages when a method or attribute is missing (27110)
- Display original source range in
Node::print
(27524) - Always use the closure to resolve variable names (27515)
Mobile
- Improve Java API / JNI
- Add the module method to allow explicitly destructing native part (27090).
- Add methods to write image tensor content to buffer (27359).
- Various improvements to Android API (27454, 27455).
- Add support for PyTorch JNI build (29412, 42faf961c8, d22f61432d).
- Various fixes to PyTorch JNI (29350, 29861, 30206, 30207).
- Improve support for older Android NDK
- Improve error message, documentation, debuggability
- Improve support for benchmark and profiling
- Improve build / CI
- Improve Android Gradle build and publishing (26833, 27389, 29262, 29738).
- Misc fixes to the Android test project (27453).
- Improve XCode build script (27358, 28996, 29002).
- Add testing code to iOS CI jobs (27593, 27594, 27784, 30133).
- Misc fixes to the iOS TestApp (27591, 28356, 28809, 29247, 29962, 29963).
- Add support for host build to pytorch_android (27662,27664).
- Add host build Gradle publishing (29749).
- Add mobile build CI with host toolchain (30292).
Named Tensors
torch.addcdiv
,torch.addcmul
Added named tensor support (28975).torch.{ones,zeros,full,rand,randn}_like
Added named tensor support (28981).torch.cdist
Added named tensor support (29129).torch.equal
Added named tensor support (29322).- Added named tensor support for comparison ops (27162).
Tensor.align_to
Fixed error message (27221).Tensor.align_to
Make method-only. (27304).Tensor.align_to
Accept partially named tensors (27308).torch.mean(Tensor, Dimname)
Fixed autograd support (29199).Tensor.unflatten
Fix when dim is a negative integer (#31208) (31432).- Fix type errors in examples about Named Tensor (27828).
C++ API
New torch::nn modules
- Convolution layers
- Pooling layers
- Loss layers
- torch::nn::HingeEmbeddingLoss / CosineEmbeddingLoss /MultiMarginLoss (27101) (27345) (27424) (27770).
- torch::nn::TripletMarginLoss / SoftMarginloss / MultiLabelMargin / MarginRankingLoss / MultiLabelSoftMarginLoss (27713, 27956) (27660) (27659) (29000) (27669).
- torch::nn::MSELoss / KLDivLoss / BCELoss / SmoothL1Loss / PoissonNLLLoss / BCEWithLogitsLoss (27156) (28806) (30146) (27661) (28755) (28783).
- torch::nn::NLLLoss / CrossEntropyLoss / CTCLoss (29812) (28654).
- Normalization Layers
- Activation Layers
- torch::nn::ELU / LeakyReLU / SELU / CodeLU / ReLU / ReLU6 / RRelu / CELU / GLU (27028) (27059) (27434) (27429) (27435) (27436) (27437) (27487) (29922).
- torch::nn::Sigmoid / LogSigmoid / LogSoftmax / Softmax / Softmax2d / Softplus / Softmin / Softsign / Softshrink / Hardshrink / Hardtanh / Tanh / Threshold (27488) (27060) (27462) (27446) (27509) (27489) (27459) (27535) (27534) (27035) (27537) (27038) (27536) (27538).
- Dropout Layers
- Padding Layers
- Embedding layers
- torch::nn::Embedding / EmbeddingBag (26358).
- Linear layers
- Vision layers
New torch::nn::functional functions
- Convolution functions
- Pooling functions
- Loss functions
- torch::nn::functional::hinge_embedding_loss / multi_margin_loss / multilabel_soft_margin_loss / triplet_margin_loss / soft_margin_loss / margin_ranking_loss (27101) (27424) (27669) (27713) (27660) (29000).
- torch::nn::functional::poisson_nll_loss / nll_loss / cross_entropy / binary_cross_entropy_with_logits (28755) (29812) (28783).
- torch::nn::functional::l1_loss / kl_div / mse_loss / binary_cross_entropy / smooth_l1_loss / ctc_loss (27156) (28806) (30146) (27661) (28654).
- Normalization functions
- Activation functions
- torch::nn::functional::elu / leaky_relu / selu / codelu / relu / relu6 / rrelu / celu / glu / gelu (27028) (27059) (27434) (27429) (27435) (27436) (27437) (27487) (29922) (28433).
- torch::nn::functional:: log_sigmoid/ log_softmax / softmax / softplus / softmin / softsign / softshrink / hardshrink / tanhshrink / hardtanh / gumbel_softmax / threshold (27060) (27462) (27446) (27489) (27459) (27535) (27534) (27035) (27537) (27038) (28121) (27538).
- Embedding functions
- Linear functions
- Padding functions
- Vision functions
- Distance functions
- torch::nn::functional::pdist (27122).
- Utility functions
AMD Support
- New features integration
- Build/CI
ONNX
In PyTorch 1.4, we have mainly focused on expanding the coverage for ONNX Opset 11, and enabling exporting torchvision models. Most of the torchvision models can be exported to ONNX (Opset 11, with fixed input size), including FasterRCNN, MaskRCNN, and KeypointRCNN. We have also enhanced export support for some tensor indexing scenarios, with more enhancements to come in the next release. In addition, 20+ new PyTorch operators are enabled in ONNX exporter.
Expanding Coverage for ONNX Opset 11
torch.sort/torch.topk
are supported in Opset 11 (25739)torch.size/torch.squeeze/torch.unsqueeze/torch.mm/torch.index_fill/torch.index_copy
are supported in Opset 11 (27578)torch.masked_select/torch.masked_scatter
are supported in Opset 11 (25949)torch.arange
is supported in Opset 11 (26875)avg_pool, constant_pad_nd, reflection_pad, replication_pad
Support enhanced in Opset 11 (28225)torch.hardtanh
is supported in Opset 11 (30169)- Enable ONNX constant folding for opset 11 (29011)
Exporting More Torch Operators/Models to ONNX
torch.remainder
is enabled in exporter (24410)torch.unfold
is enabled in exporter (24970)torch.slice/torch.select
with negative index are enabled in exporter (25273, 26549)torch.ones/torch.ones_like/torch.zeros/torch.zeros_like/torch.full/torch.full_like
with default dtype are enabled in exporter (27577)torch.unbind
is enabled in exporter (27247)torch.nn.functional.interpolate
export is enhanced (27179, 27566, 28560, 29489)torch.det
is enabled in exporter (26958)torch.group_norm
is enabled in exporter (27071)torch.meshgrid
is enabled in exporter (26037)torch.randn/torch.randn_like
are enabled in exporter (28470, 29354)torch.weight_norm
enabled in exporter (28618)torch.scalar_tensor
is enabled in exporter (28713)torch.logdet
is enabled in exporter (29767)torch.batch_norm
2D with affine=False is enabled in exporter (29458)torch.bitshift
is enabled in exporter (28210)
Enhancing Export/Test Infra
- Use deepcopy inputs in ONNX ORT test cases (27186)
- Return NotImplemented from all binary math ops (27423).
- Disabling ONNX IR v4 sematics for opset 8 or lower (28990)
- Add ONNX tests for torchvision models (30121)
- Keep output type information while exporting ONNX graph (25906)
Quantization
Quantization updates correspond to a mix of bug-fixes and feature improvements, with feature improvements adding improved operator coverage and performance improvements. We have also made a lot of progress towards enabling graph mode quantization support.
- Feature improvements:
- Enabling intra-op parallelism (26692).
- Enabling inplace relu (28710).
- Quantized Tensor support copy (28612).
- Add quantized torch mean implementation (27675).
- Add quantized avg_pool2d for pytorch mobile (27631).
- Add nn.quantized.Conv3d (29813).
- Adding inplace quantized relu6 (29245).
- Fast histogram observer (29790).
- PackedSequence support for quantized LSTM (29585).
- Improve legacy QuantizedLinear functions to reduce overhead (29773).
- Add support for quantized operator conversion from PT to C2 via ONNX (29694).
- enable per channel dynamic quantization (30122).
- Scripting support:
Visualization
- Fixed graph visualization: displaying proper names after recent JIT changes (30244)
- Support logging embedding for TensorBoard visualizations to the generic filesystem (27716)
Other Improvements
torch.argmax/argmin
Allow half type (28787).torch.cuda.memory_stats / memory_summary
instrumentation for CUDA memory allocator (27361).torch.set_num_threads
Allow calling multiple times with TBB (27190).torch.set_num_threads
Allow calling multiple times in parallel native (27947).torch.logical_xor
Allow non-bool tensors (27248).torch.promote_types
Nicer error message. (27941).torch.batch_norm_elemt
Add an out-variant (27621).torch.lerp
Implement derivative with respect to weight (28219).torch.complex32
Add type promotion support (27929).torch.unique
Support bool tensors (28374).torch.reshape
Improve backward for viewable geometries (28901).torch.lu
Generalized factorization (28608).torch.equal
Add the intra-op parallelism (28810).torch.randint
Accept generator=None (29748).torch.bfloat16
Enabled for cuda (27259).torch.multinomial
Enable for torch.half (29266).nn.RNN
Respect the current stream in cudnn (27026).nn.RNN
Codeserve nonlinearity attribute (28058).nn.Linear
Support 0-batch size. (27211).nn.functional.binary_cross_entropy
implement double backwards (26983).nn.AdaptiveAvgPool2d
Add support for NHWC memory format (24396).nn.GELU
Add GELU activation (28944).nn.LayerNorm
Handle batch size of zero (28614).nn.BatchNorm
Add NHWC support on cudnn (23861).nn.BatchNorm2d
support torch.channels_last (28982).nn.BatchNorm2d
Handle empty inputs (30035).nn.LayerNorm
Enable the intra-op parallelism (28464).nn.utils.prune
Add pruning functionality (24076).nn.Sequential
Make iterable (28987).dtype.is_signed
Ability to differentiate signed dtypes (29511).optim.lr_scheduler.MultiplicativeLR
Add new multiplicative learning rate scheduler. (27254).cuda.comm.scatter, gather
Add channel-last support (28077).at::parallel_for
Choose number of OMP threads based on GRAIN_SIZE (26963).- Return NotImplemented from unsupported tensor arithmetic operators (26507).
- Automatically select proper tqdm submodule (27108).
- Pickle support for sparse tensors (27062).
- Vectorized complex unary and binary op support. (26500).
- Complex support for reduce and linpack ops on CPU (27653).
- Complex support for compare and pointwise ops on CPU (28735).
- Make PyTorch Python 3.8 compatible (29302).
- Buffer python warning to avoid deadlocks (26613).
- Use NNPACK for strided convolutions. (29084).
Bug Fixes
Distributed
- Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (27124).
- Fix segmentation fault in
FileStore
with concurrent accesses. (28812). - Fix DDP incompatibility issue with
nn.MultiheadAttention
(26826).
RPC
- Add
ProcessGroupAgent
termination detection algorithm (26984). - Fix pybind11 warnings in Python RPC handler implementation (27284).
- Defer creating
ProcessGroupAgent
listener thread until contexts are initialized (28013). - Fix Python RPC handler exit crash (27251).
- Fix distributed autograd initialization (29069).
- Always include autograd context id in
rpc_*
/remote
requests (29781). - Make
RRefContext
singleton leaky, deal with module destruct order race. (30172).
C++ API Bug Fixes
- at::Tensor::requires_grad_ now supported (26332).
- torch::isfinite now supported (30083).
- torch::nn::modules_ordered_dict is decodecated (28774).
- Add reset_parameters to torch::nn modules (29832).
- Allow passing undefined Tensor to Module::register_parameter (27948).
- Exclude undefined tensors in the result of Module::parameters() / named_paramters() / buffers() / named_buffers() (30626).
- Include hierarchy information in C++ API loading error messages (28499).
- Fix a bug: the C++ L-BFGS optimizer does not work properly if there are one or more registered tensors with no grad in the model (27606).
- Use c10::variant-based enums for Nonlinearity and FanMode (27933). Support for
torch::nn::init::Nonlinearity
andtorch::nn::init::FanMode
will be removed in 1.5.
JIT
- Make dropout properly condition on training. (29436)
- Fix aten::grad to return optional list (29577)
- Fix
torch.arange
dtype - Fix type sharing on loaded ScriptModules (29826)
- Fix type sharing between traced modules (29583)
- Check for mutable default parameters (29833)
- Fix tracing of autograd functions (29791)
- Check for unrolled loop in break & continue (29474)
- Fix negative string indexing (22700)
- Make jit.trace_module reentrant (29411)
- Fix jit outplace tracing and reapply changes to _like operators. (28839)
- Properly guard against inheritance on TorchScript classes (28407)
- Fix when giving jit format warning about unsupported options (28616)
- Fix handling of function attributes. (28569)
- Fix pushLong() issue in pickler. (28057)
- Fix broken name mangling (27511)
- Fix segfault while printing value type for an error msg in emitListComcodehension (27261)
- Fix
toIValue
dict iteration (26856) - Fix race condition in Function::optimized_graph(). (27012)
- Sanitize module names on legacy import (27764)
- Python None should have its type inferred as NoneType (26665)
- Properly set existing attributes under recursive script (27514)
Quantization
- Skip copy_same_type_transpose_ for quantized tensor (29609).
- Add note that cuda quantization is not supported (27829).
- Rename _intrinsic to intrinsic (27194).
- Better error message for quantized dispatch (28635).
- Update the misleading comments for zero_points and scale in dynamic quant linear module [1/2] (28767).
- Avoid the misleading zero_point and scale [2/2] (28827).
- Add the warning message for API with linear modules (28766).
- Do not insert observers for empty sequential modules (28384).
- Fix the padding issue of quantized average pool operator (28260).
Mobile
Other Bug fixes
torch.kthvalue
Fix CUDA shared memory out of bound access in findPattern (28989).torch.save
Fix source files not being saved (28965).torch.load
Fix OSError loading files larger than 2GB. (27125).torch.linspace
clearer error message for negative step sizes. (28274).torch.histc
Add range checks to avoid segfaults (27712).torch.lu
Fix thread local issue on cpu (28546).torch.max_pool2d
Limit tensor size to max CUDA grid size (28931).torch.renorm
Fix a memory leak in CUDA renorm. (29873).torch.index_add
Fix bug in atomicAdd on CUDA for some dtypes (29231).torch.addmm
Fix handling of empty tensors (28613).nn.CTCLoss
Fix incorrect gradient for large target sizes (27460).nn.functional.ctc_loss
Fix incorrect gradient on cudnn (27039).nn.Embedding
Incorrect gradient at padding_idx in cuda kernel. (27731).nn.LayerNorm
Fix an illegal memory access error (28196).nn.Conv2d
handle zero stride (28784).nn.PoissonNLLLoss
Fix incorrect result withfull=True
(28637).nn.AvgPool2d
fix an overflow for 2^31-1 sized inputs (30793).nn.RNNBase
Fix an issue with use of children of RNN third party device types (28562).nn.Upsample
Fix “invalid configuration argument” error (28927).nn.Upsample
Fix a CUDA launch config failure (29016).optim.lr_scheduler.OneCycleLR
Correctly handle div_factor parameter (28217).PackedSequence.to
Ensure all tensors are moved (27245).EventList.total_average
Fix a regression caused by missing iadd (27498).Tensor.record_stream
Ensure stream is recorded for shifted view tensors (27371).torch.hub
Handle branch names containing a slash. (27960).- Fix error handling in Magma kernels (29003).
- Fix avx for c++14 (28207).
- Fix illegal memory access thread safety issue in sparse CUDA (29426).
Decodecations
Python 2 support is decodecated and will not be supported in the 1.5 release.
torch.optim
: Scheduler.step(epoch)
is now decodecated; use Scheduler.step()
instead. (26432)
For example:
>>> for epoch in range(10): >>> optimizer.step() >>> scheduler.step(epoch) DecodecationWarning: The epoch parameter in `scheduler.step()` was not necessary and is being decodecated where possible. Please use `scheduler.step()` to step the scheduler. During the decodecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose. warnings.warn(EPOCH_DECODECATION_WARNING, DecodecationWarning)
[C++] C++11 is decodecated and will not be supported in the 1.5 release.
[C++] Tensor::is_variable()
has been decodecated. As noted in the Backwards Incompatible Changes section, the distinction between variable and non-variable has been eliminated, so this check is no longer meaningful. Generally, is_variable()
will now return true except in some special circumstances (see 29653 for more details). (29653)
[C++] torch::nn::modules_ordered_dict
has been decodecated. It is generally no longer necessary and can just be removed. (28774)
torch.jit.quantized
API has been decodecated in favor of torch.quantization.quantize_dynamic
(28766)
Performance
A benchmark suite is available to easily measure the performance of operators with a range of input shapes. The generated benchmark data fully characterize the performance of operators in terms of execution time. For more details see README.md in the benchmarks/operator_benchmark directory.
torch.nn.functional.threshold, torch.nn.functional.layer_norm, torch.cdist
Performance of threshold (CPU), layer norm (CUDA) and cdist operations was improved (27155,27634, 25799)torch.Tensor.fill_
Performance for half and bfloat16 types on CPU was improved (28397).torch.nn.MaxPool2d
implementation for channels_last format was added (24872)- There is a fast pass reducing the overheads of pointwise operations relying on TensorIterator under certain conditions (contiguous inputs, no broadcast) (29180).
- Overheads of operations with scalars/number literals was improved (29915).
- In case of type promotion on the GPU, the values are converted on the fly, without explicit casting of the full tensor (30018).
- reorder_dimensions in TensorIterator favors output write locality, improving overall performance when operating on discontiguous tensors (28615).
- Float pickling speed was improved (28553).
- GRAIN_SIZE for intra-op parallelization was unified between TH and ATen operations (28770)
tensor.numel
devirtualized, improving performance (27294)
This release has 2 assets:
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.