A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/Oneflow-Inc/oneflow/releases/tag/v1.0.0 below:

Release Version 1.0.0 · Oneflow-Inc/oneflow · GitHub

Version 1.0.0 OneFlow v1.0.0 release note

OneFlow v1.0.0 came out, welcome to install the new version for a better experience.

Highlights

This version update includes 447 commits and the following highlights:

New Features 1. compile_from_torch

The compile_from_torch interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. (#10404, #10408, #9984, #9754)

Interface Signature and Parameter Introduction:

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module: The Torch Module instance to be converted.
* use_graph: Indicates whether to transform into a static graph nn.Graph and utilize MLIR compilation acceleration. The default is True.
* options:
  * size: When using static graph nn.Graph, the hash value of the graph corresponding to the input shape will be calculated and cached. Size indicates the maximum capacity of the static graph cache. When exceeding the maximum capacity, the graph will be cleared based on the LRU strategy. The default value is 9.
  * dynamic: For the first input with a dynamic shape, the graph will be fully compiled. For subsequent inputs with different shapes, if dynamic is True, shared graph will be used for compilation acceleration; if dynamic is False, the compilation will be performed each time. The default is True.
  * debug: Debug mode and log level settings. -1 disables debug mode, 0 outputs warnings and static graph construction information, 1 additionally outputs graph construction information for each sub-module, 2 additionally outputs progress for each operator, 3 provides more detailed operator information. The default value is -1.

Example of Usage:

import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
2. Separated Compilation

The static graph distributed physical execution plan supports separate compilation , allowing each process to independently compile its required execution plan, thereby preventing linear growth of compilation time with GPU scale. The separate compilation feature supports 3D hybrid parallel (data parallelism + model parallelism + pipeline parallelism) scenarios and can be used together with LiBai (the open-source large-scale model training toolbox). To enable this feature, use the command: export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1. (#9920, #10140, #10141, #10124, #10102)

Below are the test results on a 128-card A100-PCIE-40GB device with LiBai on the GPT2 model:

Parallelism Separated Compilation Enabled Execution Plan Compilation Time Data Parallelism (DP128 MP1 PP1) No Over 20 minutes Data Parallelism (DP128 MP1 PP1) Yes 108.21 s 3D Parallelism (DP4 MP4 PP8) No 445.16 s 3D Parallelism (DP4 MP4 PP8) Yes 82.88 s 3. Functional Automatic Differentiation Interfaces

A series of functional automatic differentiation-related interfaces have been introduced, including jvp, vjp, hvp, vhp, jacobian, and hessian. (#10412, #10428)

Example of Usage:

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)
4. Insight Module

Introduced a new Insight module, enabling the visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals. (#10370)

Usage:

For more detailed information, please refer to: https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5. LiBai Version Update Models 2D (tp+pp) Inference 3D Parallel Training ChatGLM ✔ ✔ Llama2 ✔ ✔

Example of Usage:

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py
6. Other New Features Improvements 1. Eager Runtime Optimization and Refactoring

A series of optimizations and refactoring has been implemented for the Eager runtime, including:

Users can configure the Eager runtime using various environment variables:

Environment Variable Meaning Default Value ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD Whether to perform computation on worker threads true ONEFLOW_VM_MULTI_THREAD Whether to use multi-threaded collaboration for Eager computation true ONEFLOW_VM_ENABLE_STREAM_WAIT Whether to use stream_wait mechanism for dependencies between multiple streams true ONEFLOW_VM_ENABLE_SCHEDULE_YIELD Whether to use yield mechanism to reduce scheduler thread's busy wait true ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE Whether to cache operator output metadata during computation true ONEFLOW_VM_WORKER_THREAD_LIMIT Number of worker threads 16 ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE Maximum size for fusing vm instructions 10 ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT Number of unprocessed instructions to be printed when vm execution times out 1000 2. Upgrade of OneFlow Serving Features

OneFlow Serving features have been upgraded to support additional backends, including OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for the OneFlow Cpp backend.

For usage instructions, refer to: https://github.com/Oneflow-Inc/serving/blob/main/README.md

3. Other Functionality Improvements Changes and Fixes 1. Functional Changes 2. Bug Fixes Performance 1. OneFlow compile_from_torch VS PyTorch compile

Compile and execute the backbone parts of ResNet50 and Faster RCNN models using OneFlow compile_from_torch and PyTorch compile interfaces to test the compilation time with inputs of different shapes. The results are shown in the table below:

Model input shape PyTorch compile OneFlow compile_from_torch dynamic test timing ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False initial compilation and execution ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False continuous compilation and execution ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False continuous compilation and execution ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False continuous compilation and execution ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False continuous compilation and execution ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False continuous compilation and execution ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False continuous compilation and execution ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True initial compilation and execution ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True continuous compilation and execution ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True continuous compilation and execution ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True continuous compilation and execution ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True continuous compilation and execution ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True continuous compilation and execution ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True continuous compilation and execution Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False initial compilation and execution Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False continuous compilation and execution Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False continuous compilation and execution Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False continuous compilation and execution Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False continuous compilation and execution Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False continuous compilation and execution Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False continuous compilation and execution Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True initial compilation and execution Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True continuous compilation and execution Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True continuous compilation and execution Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True continuous compilation and execution Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True continuous compilation and execution Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True continuous compilation and execution Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True continuous compilation and execution

Using the OneFlow compile_from_torch and PyTorch compile interfaces, the unet section of the Stable Diffusion model was compiled and executed to test the compilation time and execution time with outputs of different shapes. The results are presented in the table below:

Model Output shape PyTorch compile OneFlow compile_from_torch dynamic test timing Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False initial compilation and execution Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False continuous compilation and execution Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False continuous compilation and execution Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False continuous compilation and execution Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True initial compilation and execution Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True continuous compilation and execution Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True continuous compilation and execution Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True continuous compilation and execution Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False subsequent execution Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False subsequent execution Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False subsequent execution Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False subsequent execution Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True subsequent execution Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True subsequent execution Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True subsequent execution Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True subsequent execution

Conclusion: The OneFlow compile_from_torch interface generally has shorter compilation times compared to the PyTorch compile interface. Additionally, benefiting from the exceptional operator optimizations in the OneFlow framework, there is superior execution performance on the Stable Diffusion model.

Note: The tests were conducted with GPU 3090, PyTorch v2.1.2 and CUDA 12.2.

2. OneFlow Eager vs PyTorch Eager Model GPU model number of GPUs macro batch PyTorch performance(iter/s) OneFlow performance(iter/s) speedup ratio ResNet50 3090 1 1 31.37 38.81 23.72% ResNet50 3090 1 2 32.06 48.45 51.12% ResNet50 3090 2 1 31.10 33.46 7.59% ResNet50 3090 2 2 31.76 34.83 9.67% ResNet50 A100 1 1 24.60 46.64 89.59% ResNet50 A100 1 2 25.06 49.88 99.04% ResNet50 A100 2 1 25.28 39.18 54.98% ResNet50 A100 2 2 24.09 32.84 36.32% Bert 3090 1 1 8.93 10.41 16.57% Bert 3090 1 2 13.11 14.31 9.15% Bert 3090 2 1 6.94 8.27 19.16% Bert 3090 2 2 12.19 15.58 27.81% Bert A100 1 1 10.45 12.72 21.72% Bert A100 1 2 20.24 21.57 6.57% Bert A100 2 1 12.63 16.09 27.39% Bert A100 2 2 24.86 29.84 20.03%

Conclusion: Compared to PyTorch Eager, using OneFlow Eager shows significant performance advantages in small batch scenarios for both ResNet50 and BERT models.

Note: The tests were conducted using PyTorch v2.1.0 and CUDA 12.1.

Version 1.0.0 OneFlow v1.0.0 release note

OneFlow 发布 v1.0.0 版本, 欢迎大家安装使用。

重点内容

本次版本更新包含 447 个 commits 和如下重点内容:

新特性 1、compile_from_torch

compile_from_torch 接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。(#10404, #10408, #9984, #9754)

接口签名及参数介绍:

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module:需要被转换的 Torch Module 实例。
* use_graph:是否转化为静态图 nn.Graph 并使用 MLIR 编译加速,默认为 True。
* options:
  * size: 使用静态图 nn.Graph 后会根据输入的 shape 计算 hash 值缓存相应的 graph ,size 表示静态图缓存的最大容量,超过最大容量会根据 LRU 策略对 graph 进行清理,默认值为 9。
  * dynamic:对于动态 shape 的输入第一次会完整编译 graph,之后的对于不同 shape 的输入当 dynamic 为 True 时会启用共享图进行编译加速,dynamic 为 False 时每次都会重新进行编译,默认为 True。
  * debug:调试模式和日志级别设置,-1 禁用调试模式,0 输出警告和静态图构建信息,1 额外输出每个子模块的构图信息,2 额外输出每个算子的进度,3 输出更详细的算子信息,默认为 -1。

使用示例:

import torch
from torchvision import models

import oneflow
from oneflow.framework.infer_compiler import compile_from_torch

DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
2、分离编译

静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。分离编译功能支持 3D 混合并行(数据并行+模型并行+流水并行)场景,可与大规模模型训练开源工具箱 LiBai 一同使用,打开方式为:export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1。(#9920, #10140, #10141, #10124, #10102)

以下是在 128 卡 A100-PCIE-40GB 设备上,配合 LiBai 在 GPT2 模型上的测试结果:

并行方式 是否开启分离编译 执行计划编译时间 数据并行 (DP128 MP1 PP1) 否 超过 20 minutes 数据并行 (DP128 MP1 PP1) 是 108.21 s 3D 并行 (DP4 MP4 PP8) 否 445.16 s 3D 并行 (DP4 MP4 PP8) 是 82.88 s 3、函数式自动微分接口

新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。(#10412, #10428)

使用示例:

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)
4、Insight模块

新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。(#10370)

使用方法如下:

更详细的介绍可参考:https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5、LiBai版本更新 Models 2D (tp+pp) Inference 3D Parallel Training ChatGLM ✔ ✔ Llama2 ✔ ✔

使用示例:

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8

# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8

# inference
bash tools/infer.sh projects/Llama/pipeline.py 8

# eval
python projects/Llama/utils/eval_adapter.py
6、其他新特性 功能改进 1、Eager 运行时优化与重构

对 Eager 运行时做了一系列优化与重构,主要包括:

可以通过一些环境变量设定 Eager 运行时行为:

环境变量 意义 默认值 ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD 是否在 worker 线程上完成计算 true ONEFLOW_VM_MULTI_THREAD 是否使用多线程协同执行 Eager 运算 true ONEFLOW_VM_ENABLE_STREAM_WAIT 多 stream 间的依赖是否使用 stream_wait 机制 true ONEFLOW_VM_ENABLE_SCHEDULE_YIELD 是否使用 yield 机制减少 scheduler 线程 busy wait 程度 true ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE 计算过程中是否缓存算子输出的元信息 true ONEFLOW_VM_WORKER_THREAD_LIMIT worker 线程的个数 16 ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE vm 融合指令的最大 size 10 ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT vm 执行超时时打印未处理指令的数量 1000 2、OneFlow Serving功能升级

OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。

使用方法参考:https://github.com/Oneflow-Inc/serving/blob/main/README.md

3、其他功能改进 改动与修复 1、功能改动 2、问题修复 性能 1、OneFlow compile_from_torch VS PyTorch compile

对 ResNet50 模型和 Faster RCNN 模型的 backbone 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输入时的编译时间,结果如下表:

模型 输入 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机 ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False 首次编译执行 ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False 连续编译执行 ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False 连续编译执行 ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False 连续编译执行 ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False 连续编译执行 ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False 连续编译执行 ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False 连续编译执行 ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True 首次编译执行 ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True 连续编译执行 ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True 连续编译执行 ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True 连续编译执行 ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True 连续编译执行 ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True 连续编译执行 ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True 连续编译执行 Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False 首次编译执行 Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False 连续编译执行 Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False 连续编译执行 Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False 连续编译执行 Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False 连续编译执行 Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False 连续编译执行 Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False 连续编译执行 Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True 首次编译执行 Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True 连续编译执行 Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True 连续编译执行 Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True 连续编译执行 Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True 连续编译执行 Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True 连续编译执行 Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True 连续编译执行

对 Stable Diffusion 模型的 unet 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输出时的编译时间和推理时间,结果如下表:

模型 输出 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机 Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False 首次编译执行 Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False 连续编译执行 Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False 连续编译执行 Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False 连续编译执行 Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True 首次编译执行 Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True 连续编译执行 Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True 连续编译执行 Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True 连续编译执行 Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False 后续执行 Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False 后续执行 Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False 后续执行 Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False 后续执行 Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True 后续执行 Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True 后续执行 Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True 后续执行 Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True 后续执行

结论:使用 OneFlow compile_from_torch 接口有相对于 PyTorch compile 接口平均更短的编译时间,另外得益于 OneFlow 框架中极致的算子优化,在 Stable Diffusion 模型上有更优的执行性能。

备注:测试使用 GPU 型号为 3090,PyTorch 版本为 v2.1.2,cuda 版本为 12.2。

2、OneFlow Eager vs PyTorch Eager 模型 GPU 型号 卡数 macro batch PyTorch 性能(iter/s) OneFlow 性能(iter/s) 加速比 ResNet50 3090 1 1 31.37 38.81 23.72% ResNet50 3090 1 2 32.06 48.45 51.12% ResNet50 3090 2 1 31.10 33.46 7.59% ResNet50 3090 2 2 31.76 34.83 9.67% ResNet50 A100 1 1 24.60 46.64 89.59% ResNet50 A100 1 2 25.06 49.88 99.04% ResNet50 A100 2 1 25.28 39.18 54.98% ResNet50 A100 2 2 24.09 32.84 36.32% Bert 3090 1 1 8.93 10.41 16.57% Bert 3090 1 2 13.11 14.31 9.15% Bert 3090 2 1 6.94 8.27 19.16% Bert 3090 2 2 12.19 15.58 27.81% Bert A100 1 1 10.45 12.72 21.72% Bert A100 1 2 20.24 21.57 6.57% Bert A100 2 1 12.63 16.09 27.39% Bert A100 2 2 24.86 29.84 20.03%

结论:使用 OneFlow Eager 相对于 PyTorch Eager 在 ResNet50 和 Bert 两个模型小 batch 场景下有明显性能优势。

备注:测试使用PyTorch版本为 v2.1.0,cuda 版本为 12.1。


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4