For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss
GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.
Overview of Capabilities and ArchitectureThe following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.
from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto", ) messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) generated = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))Flash Attention 3
The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install βupgrade kernels and add the following line to your snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto", + # Flash Attention with Sinks + attn_implementation="kernels-community/vllm-flash-attn3", ) messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) generated = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.distributed import DistributedConfig import torch model_path = "openai/gpt-oss-120b" tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left") device_map = { "tp_plan": "auto", # Enable Tensor Parallelism } model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", attn_implementation="kernels-community/vllm-flash-attn3", **device_map, ) messages = [ {"role": "user", "content": "Explain how expert parallelism works in large language models."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=1000) # Decode and print response = tokenizer.decode(outputs[0]) print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())Other optimizations
If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
Tip
If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:
from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto", + # Optimize MoE layers with downloadable MegaBlocksMoeMLP + use_kernels=True, ) messages = [ {"role": "user", "content": "How many rs are in the word 'strawberry'?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True, ).to(model.device) generated = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
Tip
MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.
transformers serveYou can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just:
transformers serve
To which you can send requests using the Responses API.
# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'
You can also send requests using the standard Completions API:
# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'
Command A Vision
Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.
Bugfixes and improvementsFastSpeech2Conformer
by @bvantuan in #39689classmethod
by @zucchini-nlp in #38812CI
] Add Eric to comment slow ci by @vasqu in #39601QAPipelineTests::test_large_model_course
after #39193 by @ydshieh in #39666Glm4MoeModelTest::test_torch_compile_for_training
by @ydshieh in #39670Qwen2AudioForConditionalGeneration.forward()
and test_flash_attn_kernels_inference_equivalence
by @ebezzam in #39503models/__init__.py
for typo checking by @hebangwen in #39745GemmaIntegrationTest::test_model_2b_bf16_dola
again by @ydshieh in #39731--gpus all
in workflow files by @ydshieh in #39752libcst
to extras["testing"]
in setup.py
by @ydshieh in #39761main_classes/peft.md
by @luckyvickyricky in #39515tvp.md
to Korean by @Kim-Ju-won in #39578tokenizer.md
to Korean by @seopp in #39532pipeline_gradio.md
to Korean by @AhnJoonSung in #39520perf_train_gpu_one.md
to Korean by @D15M4S in #39552how_to_hack_models.md
to Korean by @skwh54 in #39536run_name
when none by @qgallouedec in #39695model_results.json
by @ydshieh in #39783attn_implementation
] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823plot_keypoint_matching
, make visualize_keypoint_matching
as a standard by @sbucaille in #39830TrackioCallback
to work when pynvml is not installed by @qgallouedec in #39851is_wandb_available
function to verify WandB installation by @qgallouedec in #39875sub_configs
by @qubvel in #39855Tokenizer
with PreTrainedTokenizerFast
in ContinuousBatchProcessor
by @qgallouedec in #39858torch.backends.cudnn.allow_tf32 = False
for CI by @ydshieh in #39885AutoModelForCausalLM
and AutoModelForImageTextToText
by @qubvel in #39881ModernBertForMultipleChoice
by @netique in #39232Exaone4
] Fixes the attn implementation! by @ArthurZucker in #39906The following contributors have made significant changes to the library over the last release:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4