A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pytorch/pytorch/releases/tag/v2.3.0 below:

User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed · pytorch/pytorch · GitHub

PyTorch 2.3 Release notes Highlights

We are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.

This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Stable Beta Prototype Performance Improvements User-defined Triton kernels in torch.compile torch.export adds new API to specify dynamic_shapes Weight-Only-Quantization introduced into Inductor CPU backend Tensor parallelism within PyTorch Distributed Asynchronous checkpoint generation Support for semi-structured sparsity

*To see a full list of public feature submissions click here.

Tracked Regressions torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497) torch.compile imports many unrelated packages when it is invoked (#123954)

This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.

torch.compile is not supported on Python 3.12 (#120233)

PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.

Backwards Incompatible Changes Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)

Defining a subclass with a torch_dispatch entry will now automatically set torch_function to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.

The original behavior can be recovered by adding the following to your Tensor subclass:

@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
      return super().__torch_function__(func, types, args, kwargs)
ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674) Removes no_dist and coordinator_rank from public DCP API's (#121317)

As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the coordinator_rank and no_dist parameters under torch.distributed.checkpoint. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, #118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: #118337). In the case of the no_dist parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, no_dist is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.

2.2 2.3
# Version 2.2.2
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# Version 2.2.3
# no dist is assumed from pg state, and rank 0 is always coordinator.
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
) 
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
)
Remove deprecated tp_mesh_dim arg (#121432)

Starting from PyTorch 2.3, parallelize_module API only accepts a DeviceMesh (the tp_mesh_dim argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use mesh_nd["tp"] to obtain a 1-D DeviceMesh for tensor parallelism.

torch.export Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)

Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: fp32 weight -> q -> dq -> linear. Setting fold_quantize=True produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: int8 weight -> dq -> linear.

2.2 2.3
folded_model = convert_pt2e(model, fold_quantize=True)
non_folded_model = convert_pt2e(model)
folded_model = convert_pt2e(model)
non_folded_model = convert_pt2e(model, fold_quantize=False)
Remove deprecated torch.jit.quantized APIs (#118406)

All functions and classes under torch.jit.quantized will now raise an error if called/instantiated. This API has long been deprecated in favor of torch.ao.nn.quantized.

2.2 2.3
# torch.jit.quantized APIs

torch.jit.quantized.quantize_rnn_cell_modules

torch.jit.quantized.quantize_rnn_modules
torch.jit.quantized.quantize_linear_modules

torch.jit.quantized.QuantizedLinear
torch.jit.QuantizedLinearFP16

torch.jit.quantized.QuantizedGRU
torch.jit.quantized.QuantizedGRUCell
torch.jit.quantized.QuantizedLSTM
torch.jit.quantized.QuantizedLSTMCell
# Corresponding torch.ao.quantization APIs

torch.ao.nn.quantized.dynamic.RNNCell

torch.ao.quantization.quantize_dynamic APIs

torch.ao.nn.quantized.dynamic.Linear

torch.ao.nn.quantized.dynamic.GRU
torch.ao.nn.quantized.dynamic.GRUCell
torch.ao.nn.quantized.dynamic.LSTM
Remove deprecated fbgemm operators (#112153)

TorchScript models that were exported with the deprecated torch.jit.quantized API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer torch.ao.quantization API instead.

Other Deprecations torch.autograd.Function: Using the torch.autograd.function.traceable decorator and getting/setting torch.autograd.Function's is_traceable is now deprecated (#121413)

These decorators were previously marked for internal use only. They will be removed in version 2.4.

torch.utils.checkpoint: not passing use_reentrant explicitly to activation checkpoint and checkpoint_sequential is deprecated (#116710)

torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.

(Note that this was already deprecated in a previous release. In this version, we improve the deprecation message.)

Deprecated torch.backends.cuda.sdp_kernel and replace with torch.nn.attention.sdpa_kernel (#114689)

This PR deprecated torch.backends.cuda.sdp_kernel, users can now use torch.nn.attention.sdpa_kernel instead. The old code will raise the following warning: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.

2.2 2.3
import torch
from torch.backends.cuda import sdp_kernel

with sdp_kernel(enable_math=False, enable_flash=False, enable_mem_efficient=True):
    torch.nn.functional.scaled_dot_product_attention(...)
import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

with sdpa_kernel(backends=[SDPBackend.EFFICIENT_ATTENTION]):
    torch.nn.functional.scaled_dot_product_attention(...)
Distributed API Releng New Features Autograd API CUDA Distributed API FX torch.compile Dynamo Inductor torch.export Linalg MPS torch.nn API Profiler Python API Sparse Vulkan XPU Other Improvements Autograd API Composability

Dynamic shapes:

CPP API CUDA Distributed API torch.compile Dynamo Inductor torch.export FX JIT NestedTensors Linalg MPS torch.nn API ONNX Optimizer Profiler Python API Quantization

PT2 Export Quantization Flow:

XNNPACKQuantizer:

X86 CPU Inductor Backend:

DTypes:

Others:

Releng ROCm Other Bug Fixes Autograd API Composability CPP API Distributed API torch.compile Dynamo Inductor torch.export FX JIT Linalg MPS torch.nn API Nested Tensors ONNX Optimizer Profiler Python API Quantization Releng Sparse Other Performance Composability CUDA Inductor MPS Optimizer Profiler Python API ROCm Other Documentation Autograd API CUDA Distributed API FX torch.compile Inductor torch.export Linalg torch.nn API Optimizer Other

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4