Collection of scripts demonstrating different optimization and fine-tuning techniques for OpenAI's GPT-OSS models (20B and 120B parameters).
Resources
generate_tp.py
- Model with Tensor Parallelism.generate_flash_attention.py
- Model with Flash Attention + Tensor Parallelism.generate_tp_continuous_batching.py
- Model with Flash Attention + Tensor Parallelism and Continuous Batching.generate_all.py
- Model with all optimizations: Expert Parallelism, Tensor Parallelism, Flash Attention.sft.py
- Script for fine-tuning the model using supervised fine-tuning (SFT). Supports both full-parameter training and LoRA training.All generation scripts support both 20B and 120B models. To switch between model sizes, simply edit the model_path
variable at the top of each script:
# Model configuration - uncomment the model size you want to use model_path = "openai/gpt-oss-120b" # 120B model (default) # model_path = "openai/gpt-oss-20b" # 20B model - uncomment this line and comment the line above
The scripts automatically configure the appropriate device mapping and settings based on the selected model size.
First create a virtual environment using e.g. uv
:
uv venv gpt-oss --python 3.11 && source gpt-oss/bin/activate && uv pip install --upgrade pip
Next install PyTorch and Triton kernels:
uv pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128
If your hardware supports the MXFP4 quantization format, you can also install Triton kernels for optimized performance:
uv pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Finall install the remaining dependencies:
uv pip install -r requirements.txt
Important
Before running any script, edit the model_path
variable to select your desired model size (20B or 120B).
Run a generation script:
python generate_<script_name>.py
or for distributed:
torchrun --nproc_per_node=x generate_<script_name>.py
For full-parameter training on one node of 8 GPUs, run:
# Eager attention accelerate launch --config_file configs/zero3.yaml sft.py --config configs/sft_full.yaml # FlashAttention3 accelerate launch --config_file configs/zero3.yaml sft.py --config configs/sft_full.yaml --attn_implementation kernels-community/vllm-flash-attn3
For LoRA training on one GPU, run:
python sft.py --config configs/sft_lora.yaml
To change the dataset or training hyperparameters, either modify the sft_lora.yaml
or sft_full.yaml
files or pass them as command line arguments e.g.:
accelerate launch --config_file configs/zero3.yaml \ sft.py --config configs/sft_full.yaml \ --dataset_name DATASET_NAME
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4