A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/NVIDIA-NeMo/RL/releases/tag/v0.3.0 below:

Release Release 0.3.0 ยท NVIDIA-NeMo/RL ยท GitHub

๐Ÿš€ Release v0.3.0 ๐Ÿ“ Blog

Our latest blog post shares highlights and progress from recent workโ€”take a look!

โœจ Highlights ๐Ÿ—๏ธ Improved Training Throughput and Scalability via Megatron-Core Backend

In addition to PyT DTensor backend to seamlessly support ๐Ÿค—HuggingFace models, this release has added Megatron-Core backend("Megatron backend") to enable large scale dense and MoE model training. It includes efficient parallelisms (data, tensor, pipeline, context, expert and sequence) and distributed optimizers for efficient training, and is our recommendation for RL on large model sizes and compute scale.

To use the Megatron backend, ensure you have initialized the submodules of NeMo RL:

git submodule update --init --recursive

You can try out the Megatron backend using predefined configs:

# Example 1 GPU 
uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml

Or by enabling it from the command line:

# Example 1 GPU
uv run examples/run_sft.py policy.megatron_cfg.enabled=True

To learn more about the different backends and their configuration, visit our documentation on Traning Backends.

For FAQ using the Megatron backend, see this section.

โšก Context Parallelism and Sequence Packing

Users can now train with longer sequences at enhanced GPU utilization via Context Parallelism ("CP") and Sequence Packing support for both Megatron-Core and PyT DTensor backends.

For the Megatron backend, both Context Parallelism and sequence packing can be enabled together:

policy:
  megatron_cfg:
    context_parallel_size: 2
  sequence_packing:
    enabled: True

DTensor backend also supports CP and Sequence Packing, but cannot be used together yet. Progress on this feature is being tracked here #520. In addition, there is also a known issue of CP with sequence parallelism, tracked here #659. For more information about CP and some of its current limitations in the Dtensor backend, visit our documentation.

policy:
  dtensor_cfg:
    context_parallel_size: 2
  # CP and sequence packing cannot be used together (To enable sequence packing, set context_parallel_size=1)
  sequence_packing:
     enabled: False

We recommend sequence packing to avoid extra padding and accelerate your training run, but if your model cannot use sequence packing (e.g., due to unsupported attention kernel), we recommend using dynamic_batching instead (see config). Dynamic batching is mutually exclusive with sequence packing, so it should be enabled on its own.

For more details on sequence packing and dynamic batching and how to use it, refer to our design documentation.

๐Ÿ’Ž Expanded Model Support ๐Ÿ”น Qwen3 Support

Full support for Qwen3 model family with optimized configurations is available on the Megatron backend.

Qwen3 dense variants and the smallest MoE variant (Qwen/Qwen3-30B-A3B) is also available in the DTensor backend. If you need full N-d parallelism and the largest scale, we recommend the Megatron backend.

๐Ÿ”น DeepSeekV3 Support

DeepSeekV3 (671B) is now supported on the Megatron backend. See #591 for more details on how to launch. We are continuing to optimize performance on DeepSeekV3 and other large MoE models, which we hope to land in our next release.

๐Ÿš€ Async VLLM Engine

We have added Async VLLM Engine (v1) support in v0.3 which enables two important features not possible before:

Async engine can be enabled with the following config change:

  generation:
    backend: "vllm"
    vllm_cfg:
      async_engine: true

With Async VLLM Engine enabled, multi-turn rollouts are now much faster since we no longer block at each turn to wait for all the batch elements to complete.

๐Ÿ“ Non-colocated Generation ("Split Placement")

NeMo RL now supports placing the training backend on a different set of GPUs than the generation backend. This is currently supported in the DTensor backend, with support in the Megatron backend coming soon #613

This feature can be useful if:

Non-colcated generation can be enabled with the following config changes:

  generation:
    backend: "vllm"
    colocated:
      # true: generation shares training GPUs
      # false: uses dedicated generation resources
      enabled: false
      # only relevant when enabled is false
      resources:
        gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1
        num_nodes: 1 # Decides number of nodes to be dedicated to generation

An example multi-node command is:

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

Non-colocated generation is also an important prerequisite for our continued work on Async RL.

๐Ÿ“Š MLFlow Integration for Experiment Tracking

NeMo RL now supports MLFlow integration for comprehensive experiment tracking and management. This extends our suite of loggers which included Tensorboard and Wandb.

Enable MLFlow tracking in your configuration:

logger:
  mlflow_enabled: true
  mlflow:
    experiment_name: "grpo-dev"
    run_name: "grpo-dev-logger"
โšก Performance Optimizations ๐Ÿš€ Refit Optimizations

Multiple improvements to the refit process (weight updates from training to generation backend) lead to a several fold speedup. In large MoE models this has a significant effect on E2E step time. We have measured these optimizations on DeepSeekV3 which has brought down refit time from 850 seconds to 51 seconds (16x improvement). The improvements are particularly beneficial for extra-large models with large TP sizes in vLLM.

The core engineering team is planning on sharing some of the insights and optimization techniques used to achieve this in a blog; so stay tuned!

๐Ÿ“Š VLLM CUDA Graphs

In v0.3 we now enable CUDA Graphs in VLLM by default

๐Ÿšซ FSDP1 Deprecation

NeMo RL has officially removed the original FSDP1 path used for multi-gpu multi-node training in pure Pytorch. For training in pure Pytorch without the Megatron backend, we now recommend using the DTensor path which uses FSDP2 by default and is strictly better in terms of functionality and performance.

For more information on the deprecation and the burn testing done before its removal see #614.

๐Ÿ› ๏ธ Known Issues ๐Ÿ“Š Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed New Contributors

Full Changelog: v0.2.1...v0.3.0


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4