RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/huggingface/transformers/releases/tag/v4.55.0 below:

New openai GPT OSS model! · huggingface/transformers · GitHub

Welcome GPT OSS, the new open-source model family from OpenAI!

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
Instruction following and tool use support.
Inference implementations using transformers, vLLM, llama.cpp, and ollama.
Responses API is recommended for inference.
License: Apache 2.0, with a small complementary use policy.

Architecture

Token-choice MoE with SwiGLU activations.
When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
Each attention layer uses RoPE with 128K context.
Alternate attention layers: full-context, and sliding 128-token window.
Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
It uses the same tokenizer as GPT-4o and other OpenAI API models.
Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]  

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)  

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

Tip

If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Tip

MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just:
transformers serve

To which you can send requests using the Responses API.

# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

You can also send requests using the standard Completions API:

# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

Command A Vision

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

[Model] Cohere2 Vision by @zucchini-nlp in #39810

MM Grounding DINO

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

Add MM Grounding DINO by @rziga in #37925

Bugfixes and improvements

More robust tied weight test by @Cyrilvallez in #39681
fix missing model._tp_size from ep refactor by @winglian in #39688
Fix missing initialization of FastSpeech2Conformer by @bvantuan in #39689
fix(tokenization): check token.content for trie by @pjo256 in #39587
xpu optimization for generation case by @sywangyi in #39573
[processors] add tests for helper fn by @zucchini-nlp in #39629
update ernie model card by @jzhang533 in #39657
[configuration] remove redundant classmethod by @zucchini-nlp in #38812
Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
[CI] Add Eric to comment slow ci by @vasqu in #39601
Remove all expired deprecation cycles by @Cyrilvallez in #39725
mllama outputs refactor by @itazap in #39643
Update QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666
skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670
Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503
Fix Layer device placement in Caches by @Cyrilvallez in #39732
Fix cache-related tests by @zucchini-nlp in #39676
Fix AMD dockerfile for audio models by @remi-or in #39669
Superpoint fast image processor by @arkhamHack in #37804
Add Fast Segformer Processor by @capnmav77 in #37024
BLIPs clean-up by @zucchini-nlp in #35560
extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
fix cache inheritance by @ArthurZucker in #39748
[Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in #39745
Fix: add back base model plan by @S1ro1 in #39733
update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731
Update IMPORTANT_MODELS list by @ivarflakstad in #39734
Fix mamba regression by @manueldeprada in #39728
Apply several ruff SIM rules by @cyyever in #37283
Use --gpus all in workflow files by @ydshieh in #39752
AMD disable torchcodec by @ivarflakstad in #39757
Avoid OOM when other tests are failing by @ydshieh in #39758
Fix GPT2 with cross attention by @zucchini-nlp in #39754
Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
Enable xpu allocator on caching_allocator_warmup by @jiqing-feng in #39654
Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
add libcst to extras["testing"] in setup.py by @ydshieh in #39761
[modenbert] fix regression by @zucchini-nlp in #39750
🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in #39515
🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in #39578
🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in #39532
🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in #39520
🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in #39552
🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in #39536
fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
Fix Cache.max_cache_len max value for Hybrid models by @manueldeprada in #39737
[docs] Ko doc fixes after toc update by @gante in #39660
Remove python3.7 reference from doc link by @st81 in #39706
Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
docs: Update EfficientLoFTR documentation by @sbucaille in #39620
Standardize CLAP model card format by @yanamis in #39738
Don't set run_name when none by @qgallouedec in #39695
Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
enable static cache on vision encoder decoder by @jiqing-feng in #39773
[ASR pipline] fix with datasets 4.0 by @eustlb in #39504
more info in model_results.json by @ydshieh in #39783
Super tiny update by @zucchini-nlp in #39727
fix chameleonvision UT failure by @yao-matrix in #39646
Fix an invalid condition by @cyyever in #39762
Simplify conditional code by @cyyever in #39781
Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
standardized BARThez model card by @EthanV431 in #39701
Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
Update mT5 model card by @dross20 in #39702
Add callback to monitor progress in whisper transcription by @poke1024 in #37483
fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
[docs] fix korean docs yet again by @gante in #39813
Update documentation for Cohere2Vision models by @kyle-cohere in #39817
[cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
Fix broken links by @oToToT in #39809
Fix bad markdown links by @ebezzam in #39819
Fix tp cb by @ArthurZucker in #39838
[VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
[attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823
[typecheck] proper export of private symbols by @cyyever in #39729
Update ux cb by @ArthurZucker in #39845
Fix responses add tests by @LysandreJik in #39848
Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
[image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830
Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851
remove dtensors, not explicit by @ArthurZucker in #39840
Improve is_wandb_available function to verify WandB installation by @qgallouedec in #39875
Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
Use comment to build doc on PRs by @ydshieh in #39846
Add support for including in-memory videos (not just files/urls) in apply_chat_template by @akibjawad in #39494
[core] Fix attn_implementation setter with missing sub_configs by @qubvel in #39855
Fix quant docker for fp-quant by @SunMarc in #39641
Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858
Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885
[typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881
Fix link to models in README by @qubvel in #39880
[DOCS] : Improved mimi model card by @rohitthewanderer in #39824
Update cohere2 vision test by @ydshieh in #39888
send some feedback when manually building doc via comment by @ydshieh in #39889
Add support for ModernBertForMultipleChoice by @netique in #39232
chore: update DETR model card by @arpon-kapuria in #39822
Reorder serving docs by @LysandreJik in #39634
[Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906
fix test_working_of_tp failure of accelerate ut by @yao-matrix in #39828
[qwen] remove unnecessary CUDA sync in qwen2_5_vl by @cyyever in #39870
Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
Replace video_fps with fps in tests by @cyyever in #39898
Fix eval thread fork bomb by @JustinVanHeek in #39717
Fix aria tests by @zucchini-nlp in #39879

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@capnmav77
- Add Fast Segformer Processor (#37024)
@cyyever
- Apply several ruff SIM rules (#37283)
- Fix an invalid condition (#39762)
- Simplify conditional code (#39781)
- [typecheck] proper export of private symbols (#39729)
- [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
- Replace video_fps with fps in tests (#39898)
@rziga
- Add MM Grounding DINO (#37925)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4