For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation of torch.segment_reduce()
crashes the build. Thus, we provide a wheel
without torch.segment_reduce()
included in order to sidestep the issue. If you need support
for torch.segment_reduce()
, please utilize a different version.
Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.
NotImplementedError
instead of RuntimeError
(#155470)
Please update exception handling logic to reflect this.
In 2.7.0
try:
torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except RuntimeError:
...
In 2.8.0
try:
torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except NotImplementedError:
...
Added missing in-place on view check to custom autograd.Function
(#153094)
In 2.8.0, if a custom autograd.Function
mutates a view of a leaf requiring grad,
it now properly raises an error. Previously, it would silently leak memory.
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, inp):
inp.add_(1)
ctx.mark_dirty(inp)
return inp
@staticmethod
def backward(ctx, gO):
pass
a = torch.tensor([1.0, 2.0], requires_grad=True)
b = a.view_as(a)
Func.apply(b)
Output:
Version 2.7.0
Runs without error, but leaks memory
Version 2.8.0
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
An error is now properly thrown for the out variant of tensordot
when called with a requires_grad=True
tensor (#150270)
Please avoid passing an out tensor with requires_grad=True
as gradients cannot be
computed for this tensor.
In 2.7.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
# does not error, but gradients for c cannot be computed
torch.tensordot(a, b, dims=([1], [0]), out=c)
In 2.8.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
torch.tensordot(a, b, dims=([1], [0]), out=c)
# RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and
# its shape does not match the expected result. Either remove the 'out' argument, ensure
# it does not require gradients, or make sure its shape matches the expected output.
torch.compile Specialization of a tensor shape with mark_dynamic
applied now correctly errors (#152661)
Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked with mark_dynamic
to become
specialized, this can result in an error. One workaround is to usemaybe_mark_dynamic
instead of mark_dynamic
.
See the discussion in issue #157921 for more
context.
Version 2.7.0
import torch embed = torch.randn(2, 8192) x = torch.zeros(8192) torch._dynamo.mark_dynamic(x, 0) @torch.compile def f(embedding_indices, x): added_tokens_mask = torch.where(x > 10000, 1, 0) ei = torch.narrow(embedding_indices, 1, 0, x.size(0)) return ei.clone() f(embed, x)
Version 2.8.0
import torch embed = torch.randn(2, 8192) x = torch.zeros(8192) torch._dynamo.maybe_mark_dynamic(x, 0) @torch.compile def f(embedding_indices, x): added_tokens_mask = torch.where(x > 10000, 1, 0) ei = torch.narrow(embedding_indices, 1, 0, x.size(0)) return ei.clone() f(embed, x)Several config variables related to
torch.compile
have been renamed or removed
enable_cpp_framelocals_guard_eval
has changed to no longer have any effect (#151008).rocm.n_max_profiling_configs
is deprecated (#152341).rocm.ck_max_profiling_configs
androcm.ck_tile_max_profiling_configs
.autotune_fallback_to_aten
is deprecated (#154331).ATen
. Please add "ATEN"
tomax_autotune_gemm_backends
for the old behavior.use_mixed_mm
and mixed_mm_choice
are deprecated (#152071). Inductor now supports prologue fusion, so there is no need fordescriptive_names = False
is deprecated (#151481). Please use one of the other available"torch"
, "original_aten"
, or "inductor_node"
.custom_op_default_layout_constraint
has moved from inductor config to functorch config (#148104). Please reference it viatorch._functorch.config.custom_op_default_layout_constraint
instead oftorch._inductor.config.custom_op_default_layout_constraint
.emit_current_arch_binary
is deprecated (#155768).aot_inductor.embed_cubin
has been renamed to aot_inductor.embed_kernel_binary
(#154412).aot_inductor.compile_wrapper_with_O0
has been renamed to compile_wrapper_opt_level
(#148714).HigherOrderOperator
s (e.g. cond
), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).
For affected HigherOrderOperator
s, add .clone()
to aliased outputs to address this.
Version 2.7.0
import torch @torch.compile(backend="eager") def fn(x): return torch.cond(x.sum() > 0, lambda x: x, lambda x: x + 1, [x]) fn(torch.ones(3))
Version 2.8.0
import torch @torch.compile(backend="eager") def fn(x): return torch.cond(x.sum() > 0, lambda x: x.clone(), lambda x: x + 1, [x]) fn(torch.ones(3))
guard_or_x
and definitely_x
have been consolidated (#152463)
We removed definitely_true
/ definitely_false
and associated APIs, replacing them withguard_or_true
/ guard_or_false
, which offer similar functionality and can be used to
achieve the same effect. Please migrate to the latter.
Version 2.7.0
from torch.fx.experimental.symbolic_shapes import definitely_false, definitely_true ... if definitely_true(x): ... if definitely_false(y): ...
Version 2.8.0
from torch.fx.experimental.symbolic_shapes import guard_or_false, guard_or_true ... if guard_or_false(x): ... # alternatively: if guard_or_false(torch.sym_not(y)) if not guard_or_true(y): ...torch.export
torch.export.export_for_inference
has been removed in favor of torch.export.export_for_training().run_decompositions()
(#149078)
Version 2.7.0
import torch ... exported_program = torch.export.export_for_inference(mod, args, kwargs)
Version 2.8.0
import torch ... exported_program = torch.export.export_for_training( mod, args, kwargs ).run_decompositions(decomp_table=decomp_table)Switched default to
strict=False
in torch.export.export
and export_for_training
(#148790, #150941)
This differs from the previous release default of strict=True
. To revert to the old default
behavior, please explicitly pass strict=True
.
Version 2.7.0
import torch # default behavior is strict=True torch.export.export(...) torch.export.export_for_training(...)
Version 2.8.0
import torch # strict=True must be explicitly passed to get the old behavior torch.export.export(..., strict=True) torch.export.export_for_training(..., strict=True)ONNX Default opset in
torch.onnx.export
is now 18 (#156023)
When dynamo=False
, th...
This release is meant to fix the following issues (regressions / silent correctness):
Torch.compileFix Excessive cudagraph re-recording for HF LLM models (#152287)
Fix torch.compile on some HuggingFace models (#151154)
Fix crash due to Exception raised inside torch.autocast (#152503)
Improve Error logging in torch.compile (#149831)
Mark mutable custom operators as cacheable in torch.compile (#151194)
Implement workaround for a graph break with older version einops (#153925)
Fix an issue with tensor.view(dtype).copy_(...) (#151598)
Fix assertion error due to inductor permuting inputs to flex attention (#151959)
Fix performance regression on nanogpt speedrun (#152641)
Fix extra CUDA context created by barrier (#149144)
Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature (#150010)
Add a workaround random hang in non-blocking API mode in NCCL 2.26 (#154055)
Fix MacOS compilation error with Clang 17 (#151316)
Fix binary kernels produce incorrect results when one of the tensor arguments is from a wrapped scalar on MPS devices (#152997)
Improve PyTorch Wheel size due to introduction of addition of 128 bit vectorization (#148320) (#152396)
Fix fmsub function definition (#152075)
Fix Floating point exception in torch.mkldnn_max_pool2d (#151848)
Fix abnormal inference output with XPU:1 device (#153067)
Fix Illegal Instruction Caused by grid_sample on Windows (#152613)
Fix ONNX decomposition does not preserve custom CompositeImplicitAutograd ops (#151826)
Fix error with dynamic linking of libgomp library (#150084)
Fix segfault in profiler with Python 3.13 (#153848)
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Some users with 12.2 CUDA driver (535 version) report seeing "CUDA driver error: invalid argument" during NCCL or Symmetric Memory initialization. This issue is currently under investigation, see #150852. If you use PyTorch from source, a known workaround is to rebuild PyTorch with CUDA 12.2 toolkit. Otherwise, you can try upgrading the CUDA driver on your system.
Backwards Incompatible Changes Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.py_limited_api=True
is now built with -DPy_LIMITED_API
(#145764)
We formally began respecting the py_limited_api=True
kwarg in 2.6 and stopped linking libtorch_python.so
when the flag was specified, as libtorch_python.so does not guarantee using APIs from from the stable Python limited API. In 2.7, we go further by specifying the -DPy_LIMITED_API
flag which will enforce that the extension is buildable with the limited API. As a result of this enforcement, custom extensions that set py_limited_api=True
but do not abide by the limited API may fail to build. For an example, see #152243.
This is strictly better behavior as it is sketchy to claim CPython agnosticism without enforcing with the flag. If you run into this issue, please ensure that the extension you are building does not use any APIs which are outside of the Python limited API, e.g., pybind
.
torch.Tensor.new_tensor()
to be on the given Tensor's device by default (#144958)
This function was always creating the new Tensor on the "cpu" device and will now use the same device as the current Tensor object. This behavior is now consistent with other .new_*
methods.
With Migration to manylinux_2_28 (AlmaLinux 8 based), we can no longer support OS distros with glibc2_26. These include popular Amazon Linux 2 and CentOS 7. (#143423, #146200, #148028, #148135, #148195, #148129)
torch.onnx.dynamo_export
now uses the ExportedProgram logic path (#137296)
Users using the torch.onnx.dynamo_export
API may see some ExportOptions
become
unsupported due to an internal switch to use torch.onnx.export(..., dynamo=True)
: diagnostic_options
, fake_context
and onnx_registry
are removed/ignored by ExportOptions
. Only dynamic_shapes
is retained.
Users should move to use the dynamo=True
option on torch.onnx.export
astorch.onnx.dynamo_export
is now deprecated. Leverage the dynamic_shapes
argument in torch.onnx.export
for specifying dynamic shapes on the model.
Version 2.6.0
torch.onnx.dynamo_export(model, *args, **kwargs)
Version 2.7.0
torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)Finish deprecation of
LRScheduler.print_lr()
along with the verbose
kwarg to the LRScheduler constructor. (#147301)
Both APIs have been deprecated since 2.2. Please use LRScheduler.get_last_lr()
to access the learning rate instead.print_lr
and verbose
were confusing, not properly documented and were little used, as described in #99270, so we deprecated them in 2.2. Now, we complete the deprecation by removing them completely. To access and print the learning rate of a LRScheduler:
Version 2.6.0
optim = ... lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, verbose=True) // lrsched will internally call print_lr() and print the learning rate
Version 2.7.0
optim = ... lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim) print(lrsched.get_last_lr())libtorch_python.so symbols are now invisible by default on all platforms except Apple (#142214)
Previously, the symbols in libtorch_python.so were exposed with default visibility. We have transitioned to being more intentional about what we expose as public symbols for our python API in C++. After #142214, public symbols will be marked explicitly while everything else will be hidden. Some extensions using private symbols will see linker failures with this change.
Please usetorch.export.export
instead of capture_pre_autograd_graph
to export the model for pytorch 2 export quantization (#139505)
capture_pre_autograd_graph
was a temporary API in torch.export
. Since now we have a better longer term API: export
available, we can deprecate it.
Version 2.6.0
from torch._export import capture_pre_autograd_graph from torch.ao.quantization.quantize_pt2e import prepare_pt2e from torch.ao.quantization.quantizer.xnnpack_quantizer import ( XNNPACKQuantizer, get_symmetric_quantization_config, ) quantizer = XNNPACKQuantizer().set_global( get_symmetric_quantization_config() ) m = capture_pre_autograd_graph(m, *example_inputs) m = prepare_pt2e(m, quantizer)
Version 2.7.0
from torch.export import export from torch.ao.quantization.quantize_pt2e import prepare_pt2e # please get xnnpack quantizer from executorch (https://github.com/pytorch/executorch/) from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( XNNPACKQuantizer, get_symmetric_quantization_config, ) quantizer = XNNPACKQuantizer().set_global( get_symmetric_quantization_config() ) m = export(m, *example_inputs) m = prepare_pt2e(m, quantizer)New interface for
torch.fx.passes.graph_transform_observer.GraphTransformObserver
to enable Node Level provenance tracking (#144277)
We now track a mapping between the nodes in the pre-grad and post-grad graph. See the issue for an example frontend to visualize the transformations. To update your GraphTransformObserver
subclasses, instead of overriding on_node_creation
and on_node_erase
, there are new functions get_node_creation_hook
, get_node_erase_hook
, get_node_replace_hook
and get_deepcopy_hook
. These are registered on the GraphModule
member of the GraphTransformObserver
upon entry and exit of a with
block
Version 2.6.0
class MyPrintObserver(GraphTransformObserver): def on_node_creation(self, node: torch.fx.Node): print(node)
Version 2.7.0
class MyPrintObserver(GraphTransformObserver): def get_node_creation_hook(self): def hook(node: torch.fx.Node): print(node) return hook
torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules
is no longer public (#141612)
We are planning to make all functions under torch.ao.quantization.pt2e.graph_utils
private. This update marks get_control_flow_submodules
as a private API. If you have to or want to continue using get_control_flow_submodules
, please make a private call by using _get_control_flow_submodules
.
Example:
Version 2.6:
>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodules
Version 2.7:
>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodules ImportError: cannot import name 'get_control_flow_submodules' from 'torch.ao.quantization.pt2e.graph_utils' >>> from torch.ao.quantization.pt2e.graph_utils import _get_control_flow_submodules # Note: Use _get_control_flow_submodules for private accessDeprecations
torch.onnx.dynamo_export
is deprecated (#146425, #146639, #146923)
Users should use the dynamo=True
option on torch.onnx.export
.
Version 2.6.0
torch.onnx.dynamo_export(model, *args, **kwargs)
Version 2.7.0
torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)
XNNPACKQuantizer
is deprecated in PyTorch and moved to ExecuTorch, please use it from executorch.backends.xnnpack.quantizer.xnnpack_quantizer
instead of torch.ao.quantization.quantizer.xnnpack_quantizer
. (#144940)
XNNPACKQuantizer
is a quantizer for xnnpack that was added into pytorch/pytorch for initial development. Ho...
We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile
can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance
; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.
NOTE: Starting with this release we are not going to publish on Conda, please see [Announcement] Deprecating PyTorch’s official Anaconda channel for the details.
For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.
Also in this release as an important security improvement measure we have changed the default value for weights_only
parameter of torch.load
. This is a backward compatibility-breaking change, please see this forum post for more details.
This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Beta Prototype torch.compiler.set_stance Improved PyTorch user experience on Intel GPUs torch.library.triton_op FlexAttention support on X86 CPU for LLMs torch.compile support for Python 3.13 Dim.AUTO New packaging APIs for AOTInductor CUTLASS and CK GEMM/CONV Backends for AOTInductor AOTInductor: minifier AOTInductor: ABI-compatible mode code generation FP16 support for X86 CPUs*To see a full list of public feature submissions click here.
BETA FEATURES [Beta] torch.compiler.set_stanceThis feature enables the user to specify different behaviors (“stances”) that torch.compile
can take between different invocations of compiled functions. One of the stances, for example, is
“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.
For more information please refer to the set_stance documentation and the Dynamic Compilation Control with torch.compiler.set_stance tutorial.
[Beta] torch.library.triton_op
torch.library.triton_op
offers a standard way of creating custom operators that are backed by user-defined triton kernels.
When users turn user-defined triton kernels into custom operators, torch.library.triton_op
allows torch.compile
to peek into the implementation, enabling torch.compile
to optimize the triton kernel inside it.
For more information please refer to the triton_op documentation and the Using User-Defined Triton Kernels with torch.compile tutorial.
[Beta] torch.compile support for Python 3.13
torch.compile
previously only supported Python up to version 3.12. Users can now optimize models with torch.compile
in Python 3.13.
[Beta] New packaging APIs for AOTInductor
A new package format, “PT2 archive”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.
For more details please see the updated torch.export AOTInductor Tutorial for Python runtime.
[Beta] AOTInductor: minifier
If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.
For more information please see the AOTInductor Minifier documentation.
[Beta] AOTInductor: ABI-compatible mode code generation
AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible.
In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.
[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)
Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched Intel® Xeon® 6 with P-Cores support Float16 datatype with native accelerator AMX. Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.
PROTOTYPE FEATURES[Prototype] Improved PyTorch user experience on Intel GPUs
PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on Intel® Core™ Ultra AI PCs and Intel® Arc™ discrete graphics will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.
For more information regarding Intel GPU support, please refer to Getting Started Guide.
[Prototype] FlexAttention support on X86 CPU for LLMs
FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support...
Read more PyTorch 2.5.1: bug fix releaseThis release is meant to fix the following regressions:
Besides the regression fixes, the release includes several documentation updates.
See release tracker #132400 for additional information.
PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention PyTorch 2.5 Release NotesWe are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.
*To see a full list of public feature submissions click here.
BETA FEATURES [Beta] CuDNN backend for SDPAThe cuDNN "Fused Flash Attention" backend was landed for torch.nn.functional.scaled_dot_product_attention
. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.
Regional compilation without recompilations, via torch._dynamo.config.inline_inbuilt_nn_modules
which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.
See the tutorial for more information.
[Beta] TorchInductor CPU backend optimizationThis feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.
Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.
PROTOTYPE FEATURES [Prototype] FlexAttentionWe've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.
For more information and examples, please refer to the official blog post and Attention Gym.
[Prototype] Compiled AutogradCompiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.
Please refer to the tutorial for more information.
[Prototype] Flight RecorderFlight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.
For more information please refer to the following tutorial.
[Prototype] Max-autotune Support on CPU with GEMM TemplateMax-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.
For more information please refer to the tutorial.
[Prototype] TorchInductor CPU on WindowsInductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.
See the tutorial for more details.
[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backendFloat16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.
[Prototype] Autoload Device ExtensionPyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.
See the tutorial for more information.
[Prototype] Enhanced Intel GPU supportIntel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.
These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to documentation.
Backwards Incompatible changes Distributed[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
# Users can pass in a basic option when creating an instance of ProcessGroup base_pg_options = ProcessGroup.Options(backend=str(backend)) base_pg_options._timeout = timeout pg: ProcessGroup = ProcessGroup( store, rank, group_size, base_pg_options ) # Users then need to create a backend option to create the comm backend (e.g., ProcessGroupNCCL) pg_options = ProcessGroupNCCL.Options() backend = ProcessGroupNCCL( store, rank, group_size, pg_options )
# No basic option is passed in when creating a instance of ProcessGroup pg: ProcessGroup = ProcessGroup(store, rank, group_size) pg._set_default_backend(...
This release is meant to fix the following issues (regressions / silent correctness):
Breaking Changes:Release tracker #132400 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore PyTorch 2.4 Release NotesWe are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) for torch.compile
.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing libuv
has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially for torch.compile
.
This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our
dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we
improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our
Getting Started page.
*To see a full list of public feature submissions click here.
Tracked Regressions Subproc exception with torch.compile and onnxruntime-trainingThere is a reported issue (#131070) when using torch.compile
if onnxruntime-training
lib is
installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variableTORCHINDUCTOR_WORKER_START=fork
before executing the script.
It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, the workaround is to setTRITON_PTXAS_PATH
manually as follows (adapt the code according to the local installation path):
TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas python script.pyBackwards Incompatible Change Python frontend Default
TreadPool
size to number of physical cores (#125963)
Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of
physical cores. This should reduce core oversubscribing when running CPU workload and improve performance.
Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.
torch.quasirandom.SobolEngine.draw
default dtype handling (#126781)
The default dtype value has been changed from torch.float32
to the current default dtype as given bytorch.get_default_dtype()
to be consistent with other APIs.
torch._C._TensorBase
directly (#125558)
This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was
advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should
subclass torch.Tensor directly.
torch.compile
will raise an error (#122502)
The torch.compile
flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.
An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid
this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set
of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).
@torch.compile def foo(a): e = a.diagonal() # as_strided is being called on an existing view (e), # making it non-compositional. mutations to f under torch.compile # are not allowed, as we cannot easily functionalize them safely f = e.as_strided((2,), (1,), 0) f.add_(1.0) return aWe now verify schemas of custom ops at registration time (#124520)
Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained
types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly
treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch
would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the
custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.
TORCH_LIBRARY(my_ns, m) {
// this now raises an error at registration time: bar/baz are unknown types
m.def("my_ns::foo(bar t) -> baz");
// you can get back the old behavior with the below flag
m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}
Autograd frontend Delete torch.autograd.function.traceable APIs (#122817)
The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute
on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted.
This API does not do anything and was only meant for internal purposes.
The following raised an warning in 2.3, and now errors because the API has been deleted:
@torch.autograd.function.traceable class Func(torch.autograd.Function): ...Release engineering
SparseAdam
weird allowance of raw Tensor input (#127081).Update get_group and add get_all_groups (#128097)
In 2.3 and before, users can do:
mesh_2d = init_device_mesh( "cuda", (2, 2), mesh_dim_names=("dp", "tp") ) mesh_2d.get_group() # This will return all sub-pgs within the mesh assert mesh_2d.get_group()[0] == mesh_2d.get_group(0) assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)
But from 2.4 forward, if users call get_group
without passing in the dim, users will get a RuntimeError
.
Instead, they should use get_all_groups
:
mesh_2d = init_device_mesh( "cuda", (2, 2), mesh_dim_names=("dp", "tp") ) mesh_2d.get_group() # This will throw a RuntimeError assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0) assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)Pipelining
Retire torch.distributed.pipeline (#127354)
In 2.3 and before, users can do:
import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining
But from 2.4 forward, if users write the code above, users will get a ModuleNotFound
error.
Instead, they should use torch.distributed.pipelining
:
import torch.distributed.pipeline # -> ModuleNotFoundError import torch.distributed.pipeliningjit
Complete revamp of float/promotion sympy handling (#126905)
ONNXtorch.load
with default weights_only=False
value (#129239, #129396, #129509).torch.compile
to provide a nice errorThis release is meant to fix the following issues (regressions / silent correctness):
Torch.compile:torch.__dynamo
(#124634)Plan failed with a cudnnException
warning (#125790)format_utils executable
, which was causing it to run as a no-op (#123407)device_mesh
in 2.3.0 during initialization causing memory spikes (#124780)FSDP + DTensor
with ShardingStrategy.SHARD_GRAD_OP
(#123617)use_orig_params = False
and activation checkpointing (#124698) (#126935)set_model_state_dict
errors on compiled module with non-persistent buffer with distributed checkpointing (#125336) (#125337)Tensor.abs()
for complex (#125662).pyi
files (#124932)import torch
failure when wheel is installed for a single user on Windows(#125684)tensor.dtype.to_complex()
after ~100 calls in ipython kernel (#125154)Release tracker #125425 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed PyTorch 2.3 Release notesWe are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.
This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Stable Beta Prototype Performance Improvements User-defined Triton kernels in torch.compile torch.export adds new API to specify dynamic_shapes Weight-Only-Quantization introduced into Inductor CPU backend Tensor parallelism within PyTorch Distributed Asynchronous checkpoint generation Support for semi-structured sparsity*To see a full list of public feature submissions click here.
Tracked Regressions torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497) torch.compile imports many unrelated packages when it is invoked (#123954)This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.
torch.compile is not supported on Python 3.12 (#120233)PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.
Backwards Incompatible Changes Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)Defining a subclass with a torch_dispatch entry will now automatically set torch_function to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.
The original behavior can be recovered by adding the following to your Tensor subclass:
@classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): return super().__torch_function__(func, types, args, kwargs)ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)
no_dist
and coordinator_rank
from public DCP API's (#121317)
As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the coordinator_rank
and no_dist
parameters under torch.distributed.checkpoint
. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, #118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: #118337). In the case of the no_dist
parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, no_dist
is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.
# Version 2.2.2 import torch.distributed.checkpoint as dcp dcp.save( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" no_dist=True, coordinator_rank=0 ) # ... dcp.load( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" no_dist=True, coordinator_rank=0 )
# Version 2.2.3 # no dist is assumed from pg state, and rank 0 is always coordinator. import torch.distributed.checkpoint as dcp dcp.save( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" ) # ... dcp.load( state_dict={"model": model.state_dict()}, checkpoint_id="path_to_model_checkpoint" )Remove deprecated tp_mesh_dim arg (#121432)
Starting from PyTorch 2.3, parallelize_module
API only accepts a DeviceMesh (the tp_mesh_dim
argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use mesh_nd["tp"]
to obtain a 1-D DeviceMesh for tensor parallelism.
Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: fp32 weight -> q -> dq -> linear
. Setting fold_quantize=True
produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: int8 weight -> dq -> linear
.
folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model)
folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False)Remove deprecated torch.jit.quantized APIs (#118406)
All functions and classes under torch.jit.quantized
will now raise an error if called/instantiated. This API has long been deprecated in favor of torch.ao.nn.quantized
.
# torch.jit.quantized APIs torch.jit.quantized.quantize_rnn_cell_modules torch.jit.quantized.quantize_rnn_modules torch.jit.quantized.quantize_linear_modules torch.jit.quantized.QuantizedLinear torch.jit.QuantizedLinearFP16 torch.jit.quantized.QuantizedGRU torch.jit.quantized.QuantizedGRUCell torch.jit.quantized.QuantizedLSTM torch.jit.quantized.QuantizedLSTMCell
# Corresponding torch.ao.quantization APIs torch.ao.nn.quantized.dynamic.RNNCell torch.ao.quantization.quantize_dynamic APIs torch.ao.nn.quantized.dynamic.Linear torch.ao.nn.quantized.dynamic.GRU torch.ao.nn.quantized.dynamic.GRUCell torch.ao.nn.quantized.dynamic.LSTM... Read more
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4