OneFlow v1.0.0 came out, welcome to install the new version for a better experience.
This version update includes 447 commits and the following highlights:
Released a new interface compile_from_torch
. This interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. This interface is rapidly evolving and currently supports dynamic shape compilation, validated across typical models such as ResNet50, Faster RCNN, and Stable Diffusion.
Made a series of optimizations and refactoring to Eager execution runtime, including unification of system memory pools, integration with CUDA native interfaces, optimization of instruction scheduling mechanisms, introduction of an instruction fusion mechanism, optimization of Autograd graph construction speed, optimization of Op inference process, and decoupling of Instruction and Stream, etc.
The static graph distributed physical execution plan supports separate compilation functionality, allowing each process to independently compile its required execution plan, eliminating linear growth of compilation time with GPU scale.
Addition of a series of functional automatic differentiation related interface supports, including jvp, vjp, hvp, vhp, jacobian, and hessian.
Addition of the Insight module, supporting visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals.
Updates to LiBai (the open-source toolbox for large-scale model training), with native support for fine-tuning and distributed inference of Llama2 and ChatGLM2, supporting full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
Upgrade of OneFlow Serving functionality, adding support for OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for OneFlow Cpp backend.
The compile_from_torch
interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. (#10404, #10408, #9984, #9754)
Interface Signature and Parameter Introduction:
compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module: The Torch Module instance to be converted.
* use_graph: Indicates whether to transform into a static graph nn.Graph and utilize MLIR compilation acceleration. The default is True.
* options:
* size: When using static graph nn.Graph, the hash value of the graph corresponding to the input shape will be calculated and cached. Size indicates the maximum capacity of the static graph cache. When exceeding the maximum capacity, the graph will be cleared based on the LRU strategy. The default value is 9.
* dynamic: For the first input with a dynamic shape, the graph will be fully compiled. For subsequent inputs with different shapes, if dynamic is True, shared graph will be used for compilation acceleration; if dynamic is False, the compilation will be performed each time. The default is True.
* debug: Debug mode and log level settings. -1 disables debug mode, 0 outputs warnings and static graph construction information, 1 additionally outputs graph construction information for each sub-module, 2 additionally outputs progress for each operator, 3 provides more detailed operator information. The default value is -1.
Example of Usage:
import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
2. Separated Compilation
The static graph distributed physical execution plan supports separate compilation , allowing each process to independently compile its required execution plan, thereby preventing linear growth of compilation time with GPU scale. The separate compilation feature supports 3D hybrid parallel (data parallelism + model parallelism + pipeline parallelism) scenarios and can be used together with LiBai (the open-source large-scale model training toolbox). To enable this feature, use the command: export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1
. (#9920, #10140, #10141, #10124, #10102)
Below are the test results on a 128-card A100-PCIE-40GB device with LiBai on the GPT2 model:
Parallelism Separated Compilation Enabled Execution Plan Compilation Time Data Parallelism (DP128 MP1 PP1) No Over 20 minutes Data Parallelism (DP128 MP1 PP1) Yes 108.21 s 3D Parallelism (DP4 MP4 PP8) No 445.16 s 3D Parallelism (DP4 MP4 PP8) Yes 82.88 s 3. Functional Automatic Differentiation InterfacesA series of functional automatic differentiation-related interfaces have been introduced, including jvp, vjp, hvp, vhp, jacobian, and hessian. (#10412, #10428)
Example of Usage:
import oneflow as flow # jacobian example def exp_reducer(x): return x.exp().sum(dim=1) input = flow.rand(2, 2) jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input) # vhp example def pow_reducer(x): return x.pow(3).sum() input = flow.rand(2, 2) v = flow.ones(2, 2) vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)4. Insight Module
Introduced a new Insight module, enabling the visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals. (#10370)
Usage:
For more detailed information, please refer to: https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage
5. LiBai Version UpdateLiBai (the open-source toolbox for large-scale model training) has been upgraded to version v0.3.0. It now natively supports finetuning and distributed inference of large language models Llama2 and ChatGLM2. It supports full full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
The distributed training and inference support for ChatGLM and Llama2 are as follows:
Example of Usage:
# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py
6. Other New Features
Added FFT-related operators. (#10027)
Added zeta
operator. (#10189)
Added tril_
operator. (#9996)
Added clone
operator. (#9800)
Added frac
and frac_
operator. (#9979)
Added exp2
operator. (#9958)
Added rrelu
operator. (#9736)
Added lgamma
backward operator. (#10177)
Added digamma
operator. (#10066)
Added trigamma
operator. (#10117)
Added bitwise_not
operator. (#9859)
Added squared_relu
operator. (#10316)
Added skip_rms_norm
operator. (#10036)
Added multi_tensor_amp_grad_scaler
related operators. (#10071)
Added bitwise_and
, bitwise_or
, bitwise_xor
operators. (#9842)
Added fused_attention_concat_past_key_value
operator. (#9963)
Added fused_multi_head_attention_inference_v2
operator. (#9933)
Added fused_codegeex_qkv_reshape
operator. (#9927)
Added fused_apply_rotary_emb
operator. (#9914)
Added skip_layer_norm
operator. (#9906)
Added groupwise_dequantize
, fused_linear_with_groupwise_quantized_weight
operators. (#9900)
Added fused_scale_mask_bias_softmax
, fused_scale_mask_bias_softmax_grad
operators. (#9867)
Added depend
operator for describing dependency relationships in the computation graph. (#9807)
Added operators for handling complex data types: real
, imag
, conj
, conj_physical
. (#10034, #10281)
Added CPU support for the nms
operator. (#10225)
Added support for the cast
operator to convert bool
to int16
data type. (#10211)
Added support for the arange
operator for the fp16
data type. (#10019)
Added support for the adaptive_avg_pool
operator for the fp16
data type. (#10004)
Added support for the nonzero
operator for the fp16
data type. (#9826)
Added support for the exponential
operator for the half
data type. (#10005)
Added support for the arg_sort
and top_k
operators for the half
data type. (#10000)
Added support for some basic operators like add
, sub
, mul
, mm
, sqrt
, div
for complex data types. (#10269, #10136, #10284, #10049)
Added support for basic binary operators for discontinuous memory input tensors. (#9986)
Added a virtual jit
interface to support mocking of torch for user code that imports but does not actually use the interface. (#10395)
Added the mem_get_info
interface to return overall and free memory information for a specified CUDA device. (#10398)
Added the tensor.new
interface. (#9881)
Added the tensor.is_cpu
interface. (#10172)
Added the tensor.is_view
interface. (#10101)
Added the tensor.baddbmm
interface. (#9918)
Added interfaces like special.erf
, special.erfc
, etc. (#9982)
Added the layout
and frombuffer
interfaces. (#10171)
Added prune-related interfaces. (#9730)
Added the utils.model_zoo
interface. (#10183)
Added the get_rng_state
and get_rng_state_all
interfaces. (#9760)
Added the set_rng_state
and set_rng_state_all
interfaces. (#10250)
Added support for the float16
data type. (#9697)
Added support for the char
and short
data types. (#10086)
Added support for the complex64
and complex128
data types. (#9987)
Integrated Transform Dialect into the MLIR codegen process. (#10224, #10227)
Added code generation support for the matmul
operator. 。(#10283)
Added code generation support for the softmax
operator. (#10263, #10272)
Added code generation support for the transform.oneflow.apply_patterns
operator. (#10255)
Added support for byte
attributes in the MLIR codegen process. (#10276)
Added extra_libs
functionality to the mock_torch
module, enabling flowvision to mimic torchvision's functionality. (#10223)
Added lazy
parameter to the mock_torch
module, allowing non-existent interfaces to return a fake object without immediate errors. (#9876)
Added skip_init
functionality and introduced meta device. (#10008)
Introduced the HostMemoryInput mechanism, allowing an operator's specific input to be defined as HostMemoryInput type for accessing data within the kernel's host function body. (#9928)
Added fusion mechanism for nccl logical operations to reduce excessive synchronization overhead in scenarios like ZERO, where too many fragmented nccl calls lead to significant training speed reduction. (#9879)
Introduced a mechanism for re-computation of tensor operations. (#9861)
Added support for backward_hook
, register_full_backward_hook
, and register_state_dict_pre_hook
. (#9837, #9710)
Added content related to the stochastic weight averaging algorithm to the optimizers module. (#9781)
Added DelayVariableOpExecutionPass optimization pass for the computation graph. (#9745)
Added MulCastPattern
operator fusion rule. (#9715)
Added the environment variable ONEFLOW_ENABLE_GLOBAL_INPUTS_WITH_INCONSISTENT_PLACEMENT
to control whether to automatically place global tensors used by operators through the to_global
operation on the largest rank. (#10073)
Added the environment variable ONEFLOW_EAGER_NCCL_USE_COMPUTE_STREAM
to control whether nccl
and regular computations in eager mode are on the same stream. The default value is false
. (#10230)
Added the environment variable VLOG_REMAT
to handle dynamic graph recomputation logs and interface with ComputeComplexityFn to estimate op computation time. (#10212)
Added the environment variable ENABLE_ACTOR_DEBUG_LOG
to print detailed logs of actor message sending, receiving, and execution on the current rank. (#10081)
Added the environment variable ONEFLOW_RUN_GRAPH_BY_VM
to control whether to use VM to run static graph nn.Graph. (#9884)
Added the environment variable ONEFLOW_DISABLE_MOCK_TORCH
to control whether to disable the mock_torch
functionality. (#9805)
Added the environment variable ONEFLOW_VM_MULTI_THREAD
to control the number of threads used in the VM. (#9698)
Added support for the second-order optimizer lbfgs
. (#10265)
A series of optimizations and refactoring has been implemented for the Eager runtime, including:
Unified system memory pool to manage memory resources across all allocators on the same device. (#8591)
Integration with CUDA native interfaces to accelerate kernel launches.(#8571)
Optimization of instruction scheduling mechanism to reduce system overhead.(#8796)
Introduction of instruction fusion mechanism to accelerate instruction dispatch. (#7399)
Speed improvement in Autograd graph construction. (#8606)
Optimization of op deduction process to accelerate kernel execution. (#8672, #8619, #8662)
Consolidation of redundant concepts within the eager runtime, decoupling Instruction and Stream. (#8583, #8590, #7607)
Users can configure the Eager runtime using various environment variables:
Environment Variable Meaning Default Value ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD Whether to perform computation on worker threads true ONEFLOW_VM_MULTI_THREAD Whether to use multi-threaded collaboration for Eager computation true ONEFLOW_VM_ENABLE_STREAM_WAIT Whether to use stream_wait mechanism for dependencies between multiple streams true ONEFLOW_VM_ENABLE_SCHEDULE_YIELD Whether to use yield mechanism to reduce scheduler thread's busy wait true ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE Whether to cache operator output metadata during computation true ONEFLOW_VM_WORKER_THREAD_LIMIT Number of worker threads 16 ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE Maximum size for fusing vm instructions 10 ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT Number of unprocessed instructions to be printed when vm execution times out 1000 2. Upgrade of OneFlow Serving FeaturesOneFlow Serving features have been upgraded to support additional backends, including OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for the OneFlow Cpp backend.
For usage instructions, refer to: https://github.com/Oneflow-Inc/serving/blob/main/README.md
3. Other Functionality ImprovementsOptimized certain code implementations to accommodate CUDA 12.x. (#10367)
Optimized the glu operator implementation to support bias-less inputs.(#9874)
Optimized pooling operator implementation to support the channels_last parameter. (#10242)
Optimized the flip operator implementation to address memory access inefficiencies when dim = -1. (#10310)
Optimized the bincount operator implementation for accelerated performance. (#10308)
Optimized the index_add operator implementation by dispatching varied logic based on index length to enhance performance for smaller indices.(#9751)
Optimized the topk operator implementation to boost performance when batch size equals 1. (#10009)
Optimized implementations of operators such as conv and arange to facilitate CUDA graph usage. (#9761)
Optimized the upsample operator implementation to include input/output size validation.(#9737)
Optimized the grouped_matmul_bias operator implementation by introducing tensor parallelism sbp derivation rules. (#9934)
Optimized the reshape operator implementation with added nd sbp derivation rules. (#9858)
Optimized error messages and completed test cases for mask_fill and in_top_k operators. (#10062)
Optimized the higher-order differentiation rules for the tanh operator to optimize performance under third-order differentiation. (#10188, #10237)
Optimized conv interface implementation to support device and dtype parameters. (#10228)
Optimized conv interface implementation to automatically expand input dimensions.(#9721)
Optimized sum interface implementation to accommodate dtype parameters.(#10204)
Optimized softmax interface implementation to support dtype parameters. (#10069)
Optimized maxpool interface implementation to support 3D input tensors. (#10110)
Optimized ctc_loss interface implementation parameters with PyTorch interface. (#9887)
Optimized copy interface implementation to support scenarios where input and output have different devices and dtypes. (#9888)
Optimized grad interface implementation to support the allow_unused parameter.(#10251)
Optimized load interface implementation to provide more user-friendly error messages.(#10138)
Optimized fused_matmul_bias operator and interface implementation to support alpha and beta parameters. (#10015)
Optimized normal operator and interface implementation to align behavior with PyTorch. (#10185)
Optimized fused attention operator and interface implementation to allow None for pasti_key and past_value. (#9977)
Optimized fused_attention operator and interface implementation to add support for variable sequence lengths. (#9991)
Optimized fused_multi_head_attention_inference operator and interface implementation to include attn_bias parameter. (#9853)
Optimized bn-related functor implementation. Merging bn_add_relu and bn_relu operations to expedite inference. (#10239)
Optimized MLIR CodeGen-based processes and upgraded LLVM version to 16.0.0. (#9985)
Optimized MLIR codegen-based processes by adding AppendOneFlowStream, MgpuToOneFlowStream, and CastOneFlowInputToSignlessPass passes. (#10149, #10151, #10099)
Optimized MLIR codegen-based processes by linking LibDevice to support NVVM IR conversion to cubin. (#10200)
Optimized MLIR codegen-based processes by utilizing tmpbuffer as MemPool in MLIR. (#10159)
Optimized MLIR codegen-based processes by enabling bufferizable operator dispatch. (#9787)
Optimized MLIR codegen-based processes to expedite ofmempool and related processes. (#10152, #10168, #10184, #10239)
Optimized stacktrace call stack information.(#9912, #9937, #10260, #10161)
Optimized random number generator implementation by adding caching to avoid regeneration with each call. (#10387)
Optimized graph load functionality to support loading the graph onto a new device.(#10335)
Optimized dummy array initialization implementation using fold expressions. (#10271)
Optimized MemoryFormat class organization, exposed to Python layer via cpython to support changing tensor's MemoryFormat using Tensor.to interface. (#10181)
Optimized implementations of steam, device, and vm to support more device types. (#10166)
Optimized error messages for MapAt, adding printing of key values.(#10090)
Optimized OOM error messages to differentiate CUDA and CPU devices and display size. (#9938)
Optimized error messages for CHECK_XX_OR_RETURN macros. (#9921)
Optimized error messages for graph-related issues. (#9821)
Optimized error messages for convolution operator-related issues. (#9707)
Optimized model initialization to minimize additional overhead. (#10088)
Optimized thread manager implementation to accommodate three usage scenarios: unrestricted threads, master as a thread, and n threads. (#10060)
Optimized numpy array release mechanism to release in the main thread to reduce time-consuming GIL requests. (#10050)
Optimized graph save runtime_state_dict implementation to enhance performance and address related issues. (#10016)
Optimized parsing of different calling methods for interfaces like Tensor.foo(*args) using a unified PyParseArgs function. (#9983)
Optimized the implementation of the ArgsTree class to support arbitrary output types and conducted file location migration. (#9846)
Optimized memory allocation mechanism to achieve ordered allocation based on streams. (#9818)
Removed deallocate context. (#10143)
Removed debug compilation mode in graph compilation. (#10145)
Removed unused logic for MemChain merge. (#10097)
Removed default settings for some unused distributed environment variables. (#9803)
Refactored collective boxing implementation under lazy mode. (#10098)
Refactored registration of EagerCclS2S.(#10100)
Refactored implementation of collective_boxing_executor_backend. (#10082)
Refactored implementation of running global nn.graph using VM. (#10048)
Refactored implementation of local to global related interfaces.(#9870)
Refactored operator dispatch dialect implementation in MLIR codegen process. (#9693)
Refactored implementation of random generator and distribution kernels. (#9691)
Refactored implementation of fast_atomic_add operator. (#9680)
Refactored error check related macros in glog. (#10176)
Refactored implementation of random generator. (#10025)
Refactored implementation of some elementwise primitive operations. (#9857)
Refactored code related to device descriptions. (#9791)
Refactored implementation of ParseDeviceString and ParseDeviceNameConf. (#9833)
Refactored implementation of ActorMsg related functionalities, introducing IBVerbsActorMsgWrapper wrapper to reduce the size of ActorMsg. (#9762)
Refactored implementation of save and load interfaces, migrating the method of saving graphs to the _save_graph function, adding some _open* helper classes to differentiate between paths and memory, enabling saving weights to BytesIO in save, and supporting file streaming in load. (#10021)
Refactored implementation of some tensor-related interfaces, migrating code from Python layer to C++ layer. (#9990, #9964)
Upgraded PyBind version used in the project to 2.11.1. (#10391)
Fixed default dynamic linking settings in CMake files to avoid LLVM15 linking errors. (#10373, #10131)
Fixed cast-related bugs in MLIR codegen. (#10105)
Fixed logic handling for cpg attr in Module._apply function. (#10343)
Fixed inheritance issue for DummyModule when attr is mro_entries. (#9976)
Fixed size checking issue for _handle_size_arg in full op. (#9975)
Fixed residual environment variables after launching mock via command line, causing subsequent API mock parameter errors. (#9970)
Fixed inability to exit when two processes encounter exceptions. (#10054)
Fixed bug in grouped quantization sbp derivation. (#10132)
Fixed kMaxInputCount check issue in GroupedMatmulFunctor. (#10322)
Fixed 0-size tensor broadcast issue.(#10186)
Fixed issue where double type attr was not updated when using shared_graph. (#10279)
Fixed data type error in GetItemInScalarTensor. (#10226)
Fixed gradient issue in GroupNorm, calling GroupNormParamGrad only when gamma and beta gradients are required. (#10045)
Fixed error when reading tensors with partial ranks in global mode. (#10056)
Fixed control boundary issues in checkpointing under PP, affecting task graph construction under separate compilation. (#10057)
Fixed bug when using 3D parallelism and enabling activation checkpointing simultaneously. (#10031)
Fixed adaptation bug of AutoMixedPrecision pass on non-CUDA devices and bug related to device combinations in LayerNorm Module. (#10026)
Fixed default value setting issue for reduce parameter in scatter operator. (#10002)
Fixed incomplete disable of some Torch variables in mock.disable, causing lingering references in other globals. (#9989)
Fixed destructor issue in vm::TensorStorage. (#9962)
Fixed offload issue where small tensors were not released from CUDA memory.(#9974)
Fixed occasional segmentation fault in Python stack getter due to thread unsafety.(#9955)
Fixed element lookup issue in set under separate compilation scenario. (#9952)
Aligned qkv and output_layout in fused_multi_head_attention operator. (#9950)
Fixed inconsistency in seed behavior of random series operators between graph and checkpointing. (#9941)
Fixed parameter reload failure issue in Eager mode. (#9935)
Fixed infinite loop issue in specific cases of mock torch lazy functionality. (#9926)
Fixed issue where code in stft_kernel.cu file was not compiled by default. (#9922)
Fixed deadlock and memory allocation errors caused by invalid topological order due to incomplete TaskGraph under separate compilation in order_in_graph. (#9909 )
Fixed xrt compilation issue where fmt could not be found. (#9894)
Fixed imbalance in GPU memory allocation among processes during local to global process where sbp is B. (#9852)
Aligned OneFlow and PyTorch behaviors related to the third parameter of CTCLoss. (#9845)
Fixed initialization issues related to thread_global_id and rank_group_scope. (#9841)
Fixed inplace handling errors in dropout operator implementation. (#9808)
Fixed errors in loading non-tensor objects saved by PyTorch in the load function. (#9804)
Fixed conflicts between contiguous memory and GPU memory allocation strategies. (#9786)
Fixed memory allocation issues in EagerBlobObject::ByteSizeOfBlobBody when considering non-contiguous cases. (#9782)
Fixed dtype inference errors in fill_ operator during autocast. (#9776)
Fixed sbp derivation rule issues in fused_glu operator. (#10108)
Fixed issues related to calling nn.Graph.__map_io. (#10084)
Fixed inconsistency between set_grad_mode interface and PyTorch behavior. (#10059)
Fixed an issue related to the map_location parameter in the load interface and added support for passing lambda functions. (#10052)
Fixed stride inference errors after unsqueeze operation in view mode. (#9775)
Fixed problems in conv op with unbatched input and bias, and added support for unbatched input in deconv op. (#9740)
Fixed logic errors in trunc_normal_ implementation. (#9711)
Fixed default value issue in dim parameter of topk operator. (#9703)
Fixed issues where placement of some networks was incorrectly set to CPU during static graph printing. (#9770)
Fixed conflict between include paths of trt_flash_attention and native flash attention. (#9750)
Fixed segmentation fault caused by is_shutting_down and gil in stack getter. (#9681)
Fixed issues related to the separate compilation feature found in distributed unit testing.(#9749)
Fixed memory handling issues in flatten algorithm implementation. (#9746)
Fixed a deadlock issue in the execution flow. (#9738)
Fixed errors in isinstance check for DummyModule. (#10207)
Corrected behavior where default size was erroneously overridden when introducing llvm::SmallVector. (#9932)
Fixed errors in calculating memory size of non-contiguous memory tensors. (#9819)
Fixed issues with calling CHECK_JUST in the TensorStorage destructor function. (#9752)
Compile and execute the backbone parts of ResNet50 and Faster RCNN models using OneFlow compile_from_torch and PyTorch compile interfaces to test the compilation time with inputs of different shapes. The results are shown in the table below:
Model input shape PyTorch compile OneFlow compile_from_torch dynamic test timing ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False initial compilation and execution ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False continuous compilation and execution ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False continuous compilation and execution ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False continuous compilation and execution ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False continuous compilation and execution ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False continuous compilation and execution ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False continuous compilation and execution ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True initial compilation and execution ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True continuous compilation and execution ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True continuous compilation and execution ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True continuous compilation and execution ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True continuous compilation and execution ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True continuous compilation and execution ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True continuous compilation and execution Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False initial compilation and execution Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False continuous compilation and execution Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False continuous compilation and execution Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False continuous compilation and execution Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False continuous compilation and execution Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False continuous compilation and execution Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False continuous compilation and execution Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True initial compilation and execution Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True continuous compilation and execution Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True continuous compilation and execution Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True continuous compilation and execution Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True continuous compilation and execution Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True continuous compilation and execution Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True continuous compilation and executionUsing the OneFlow compile_from_torch and PyTorch compile interfaces, the unet section of the Stable Diffusion model was compiled and executed to test the compilation time and execution time with outputs of different shapes. The results are presented in the table below:
Model Output shape PyTorch compile OneFlow compile_from_torch dynamic test timing Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False initial compilation and execution Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False continuous compilation and execution Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False continuous compilation and execution Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False continuous compilation and execution Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True initial compilation and execution Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True continuous compilation and execution Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True continuous compilation and execution Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True continuous compilation and execution Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False subsequent execution Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False subsequent execution Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False subsequent execution Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False subsequent execution Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True subsequent execution Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True subsequent execution Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True subsequent execution Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True subsequent executionConclusion: The OneFlow compile_from_torch interface generally has shorter compilation times compared to the PyTorch compile interface. Additionally, benefiting from the exceptional operator optimizations in the OneFlow framework, there is superior execution performance on the Stable Diffusion model.
Note: The tests were conducted with GPU 3090, PyTorch v2.1.2 and CUDA 12.2.
2. OneFlow Eager vs PyTorch Eager Model GPU model number of GPUs macro batch PyTorch performance(iter/s) OneFlow performance(iter/s) speedup ratio ResNet50 3090 1 1 31.37 38.81 23.72% ResNet50 3090 1 2 32.06 48.45 51.12% ResNet50 3090 2 1 31.10 33.46 7.59% ResNet50 3090 2 2 31.76 34.83 9.67% ResNet50 A100 1 1 24.60 46.64 89.59% ResNet50 A100 1 2 25.06 49.88 99.04% ResNet50 A100 2 1 25.28 39.18 54.98% ResNet50 A100 2 2 24.09 32.84 36.32% Bert 3090 1 1 8.93 10.41 16.57% Bert 3090 1 2 13.11 14.31 9.15% Bert 3090 2 1 6.94 8.27 19.16% Bert 3090 2 2 12.19 15.58 27.81% Bert A100 1 1 10.45 12.72 21.72% Bert A100 1 2 20.24 21.57 6.57% Bert A100 2 1 12.63 16.09 27.39% Bert A100 2 2 24.86 29.84 20.03%Conclusion: Compared to PyTorch Eager, using OneFlow Eager shows significant performance advantages in small batch scenarios for both ResNet50 and BERT models.
Note: The tests were conducted using PyTorch v2.1.0 and CUDA 12.1.
Version 1.0.0 OneFlow v1.0.0 release noteOneFlow 发布 v1.0.0 版本, 欢迎大家安装使用。
本次版本更新包含 447 个 commits 和如下重点内容:
发布新接口 compile_from_torch
。该接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。该接口仍在快速演进中,目前支持了动态形状编译并在ResNet50、Faster RCNN、Stable Diffusion三个典型模型上做了验证。
对 Eager 运行时做了一系列优化与重构,包括统一系统内存池、对接 CUDA 原生接口、优化指令调度机制、引入指令融合机制、优化 Autograd 构图速度、优化 Op 推导过程、解耦 Instruction 与 Stream 等。
静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。
新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。
新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。
大规模模型训练开源工具箱 LiBai 版本更新,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。
OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。
compile_from_torch
接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。(#10404, #10408, #9984, #9754)
接口签名及参数介绍:
compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module:需要被转换的 Torch Module 实例。
* use_graph:是否转化为静态图 nn.Graph 并使用 MLIR 编译加速,默认为 True。
* options:
* size: 使用静态图 nn.Graph 后会根据输入的 shape 计算 hash 值缓存相应的 graph ,size 表示静态图缓存的最大容量,超过最大容量会根据 LRU 策略对 graph 进行清理,默认值为 9。
* dynamic:对于动态 shape 的输入第一次会完整编译 graph,之后的对于不同 shape 的输入当 dynamic 为 True 时会启用共享图进行编译加速,dynamic 为 False 时每次都会重新进行编译,默认为 True。
* debug:调试模式和日志级别设置,-1 禁用调试模式,0 输出警告和静态图构建信息,1 额外输出每个子模块的构图信息,2 额外输出每个算子的进度,3 输出更详细的算子信息,默认为 -1。
使用示例:
import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
2、分离编译
静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。分离编译功能支持 3D 混合并行(数据并行+模型并行+流水并行)场景,可与大规模模型训练开源工具箱 LiBai 一同使用,打开方式为:export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1
。(#9920, #10140, #10141, #10124, #10102)
以下是在 128 卡 A100-PCIE-40GB 设备上,配合 LiBai 在 GPT2 模型上的测试结果:
并行方式 是否开启分离编译 执行计划编译时间 数据并行 (DP128 MP1 PP1) 否 超过 20 minutes 数据并行 (DP128 MP1 PP1) 是 108.21 s 3D 并行 (DP4 MP4 PP8) 否 445.16 s 3D 并行 (DP4 MP4 PP8) 是 82.88 s 3、函数式自动微分接口新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。(#10412, #10428)
使用示例:
import oneflow as flow # jacobian example def exp_reducer(x): return x.exp().sum(dim=1) input = flow.rand(2, 2) jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input) # vhp example def pow_reducer(x): return x.pow(3).sum() input = flow.rand(2, 2) v = flow.ones(2, 2) vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)4、Insight模块
新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。(#10370)
使用方法如下:
更详细的介绍可参考:https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage
5、LiBai版本更新大规模模型训练开源工具箱 LiBai 功能升级,发布新版本 v0.3.0,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。
ChatGLM 和 Llama2 的分布式训练和推理支持情况如下:
使用示例:
# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py
6、其他新特性
对 Eager 运行时做了一系列优化与重构,主要包括:
可以通过一些环境变量设定 Eager 运行时行为:
环境变量 意义 默认值 ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD 是否在 worker 线程上完成计算 true ONEFLOW_VM_MULTI_THREAD 是否使用多线程协同执行 Eager 运算 true ONEFLOW_VM_ENABLE_STREAM_WAIT 多 stream 间的依赖是否使用 stream_wait 机制 true ONEFLOW_VM_ENABLE_SCHEDULE_YIELD 是否使用 yield 机制减少 scheduler 线程 busy wait 程度 true ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE 计算过程中是否缓存算子输出的元信息 true ONEFLOW_VM_WORKER_THREAD_LIMIT worker 线程的个数 16 ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE vm 融合指令的最大 size 10 ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT vm 执行超时时打印未处理指令的数量 1000 2、OneFlow Serving功能升级OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。
使用方法参考:https://github.com/Oneflow-Inc/serving/blob/main/README.md
3、其他功能改进对 ResNet50 模型和 Faster RCNN 模型的 backbone 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输入时的编译时间,结果如下表:
模型 输入 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机 ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False 首次编译执行 ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False 连续编译执行 ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False 连续编译执行 ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False 连续编译执行 ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False 连续编译执行 ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False 连续编译执行 ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False 连续编译执行 ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True 首次编译执行 ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True 连续编译执行 ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True 连续编译执行 ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True 连续编译执行 ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True 连续编译执行 ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True 连续编译执行 ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True 连续编译执行 Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False 首次编译执行 Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False 连续编译执行 Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False 连续编译执行 Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False 连续编译执行 Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False 连续编译执行 Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False 连续编译执行 Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False 连续编译执行 Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True 首次编译执行 Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True 连续编译执行 Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True 连续编译执行 Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True 连续编译执行 Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True 连续编译执行 Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True 连续编译执行 Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True 连续编译执行对 Stable Diffusion 模型的 unet 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输出时的编译时间和推理时间,结果如下表:
模型 输出 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机 Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False 首次编译执行 Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False 连续编译执行 Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False 连续编译执行 Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False 连续编译执行 Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True 首次编译执行 Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True 连续编译执行 Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True 连续编译执行 Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True 连续编译执行 Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False 后续执行 Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False 后续执行 Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False 后续执行 Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False 后续执行 Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True 后续执行 Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True 后续执行 Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True 后续执行 Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True 后续执行结论:使用 OneFlow compile_from_torch 接口有相对于 PyTorch compile 接口平均更短的编译时间,另外得益于 OneFlow 框架中极致的算子优化,在 Stable Diffusion 模型上有更优的执行性能。
备注:测试使用 GPU 型号为 3090,PyTorch 版本为 v2.1.2,cuda 版本为 12.2。
2、OneFlow Eager vs PyTorch Eager 模型 GPU 型号 卡数 macro batch PyTorch 性能(iter/s) OneFlow 性能(iter/s) 加速比 ResNet50 3090 1 1 31.37 38.81 23.72% ResNet50 3090 1 2 32.06 48.45 51.12% ResNet50 3090 2 1 31.10 33.46 7.59% ResNet50 3090 2 2 31.76 34.83 9.67% ResNet50 A100 1 1 24.60 46.64 89.59% ResNet50 A100 1 2 25.06 49.88 99.04% ResNet50 A100 2 1 25.28 39.18 54.98% ResNet50 A100 2 2 24.09 32.84 36.32% Bert 3090 1 1 8.93 10.41 16.57% Bert 3090 1 2 13.11 14.31 9.15% Bert 3090 2 1 6.94 8.27 19.16% Bert 3090 2 2 12.19 15.58 27.81% Bert A100 1 1 10.45 12.72 21.72% Bert A100 1 2 20.24 21.57 6.57% Bert A100 2 1 12.63 16.09 27.39% Bert A100 2 2 24.86 29.84 20.03%结论:使用 OneFlow Eager 相对于 PyTorch Eager 在 ResNet50 和 Bert 两个模型小 batch 场景下有明显性能优势。
备注:测试使用PyTorch版本为 v2.1.0,cuda 版本为 12.1。
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4