RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://docs.unsloth.ai/basics/gemma-3n below:

Gemma 3n: How to Run & Fine-tune

Gemma 3n: How to Run & Fine-tune | Unsloth Documentation Text-to-speech, all model types & full fine-tuning now supported!

Basics

✨Gemma 3n: How to Run & Fine-tune

How to run Google's new Gemma 3n locally with Dynamic GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!

Google’s Gemma 3n multimodal model handles image, audio, video, and text inputs. Available in 2B and 4B sizes, it supports 140 languages for text and multimodal tasks. You can now run and fine-tune Gemma-3n-E4B and E2B locally using Unsloth .

Fine-tune Gemma 3n with our free Colab notebook

Gemma 3n has 32K context length, 30s audio input, OCR, auto speech recognition (ASR), and speech translation via prompts.

Running TutorialFine-tuning TutorialFixes + Technical Analysis

Unsloth Gemma 3n (Instruct) uploads with optimal configs:

Dynamic 2.0 GGUF (text only) Dynamic 4-bit Instruct (to fine-tune)

See all our Gemma 3n uploads including base and more formats in our collection here .

Currently Gemma 3n is only supported in text format for inference.

We’ve fixed issues with GGUFs not working properly in Ollama only. Please redownload if using Ollama.

⚙️ Official Recommended Settings

According to the Gemma team, the official recommended settings for inference:

temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0

Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Repetition Penalty of 1.0. (1.0 means disabled in llama.cpp and transformers)

Chat template:

<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n

Chat template with \nnewlines rendered (except for the last)

<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
<start_of_turn>user
What is 1+1?<end_of_turn>
<start_of_turn>model\n

llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

🦙 Tutorial: How to Run Gemma 3n in Ollama

Please re download Gemma 3N quants or remove the old ones via Ollama since there are some bug fixes. You can do the below to delete the old file and refresh it:

ollama rm hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL

ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL

Install ollama if you haven't already!

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL

📖 Tutorial: How to Run Gemma 3n in llama.cpp

Obtain the latest llama.cpp on GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja

OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/gemma-3n-E4B-it-GGUF",
    local_dir = "unsloth/gemma-3n-E4B-it-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*", "mmproj-BF16.gguf"], # For Q4_K_XL
)

Edit --threads 32 for the number of CPU threads, --ctx-size 32768 for context length (Gemma 3 supports 32K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
    --ctx-size 32768 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.00 \
    --top-k 64 \
    --top-p 0.95

For non conversation mode to test Flappy Bird:

./llama.cpp/llama-cli \
    --model unsloth/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-UD-Q4_K_XL.gguf \
    --ctx-size 32768 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --repeat-penalty 1.0 \
    --min-p 0.00 \
    --top-k 64 \
    --top-p 0.95 \
    -no-cnv \
    --prompt "<start_of_turn>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<end_of_turn>\n<start_of_turn>model\n"

Remember to remove <bos> since Gemma 3N auto adds a <bos>!

🦥 Fine-tuning Gemma 3n with Unsloth

Gemma 3n, like Gemma 3, had issues running on Flotat16 GPUs such as Tesla T4s in Colab. You will encounter NaNs and infinities if you do not patch Gemma 3n for inference or finetuning. More information below.

We also found that because Gemma 3n's unique architecture reuses hidden states in the vision encoder it poses another interesting quirk with Gradient Checkpointing described below

Unsloth is the only framework which works in float16 machines for Gemma 3n inference and training. This means Colab Notebooks with free Tesla T4 GPUs also work! Overall, Unsloth makes Gemma 3n training 1.5x faster, 50% less VRAM and 4x longer context lengths.

Our free Gemma 3n Colab notebooks default to fine-tuning text layers. If you want to fine-tune vision or audio layers too, be aware this will require much more VRAM - beyond the 15GB free Colab or Kaggle provides. You can still fine-tune all layers including audio and vision and Unsloth also lets you fine-tune only specific areas, like just vision. Simply adjust as needed:

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # False if not finetuning vision layers
    finetune_language_layers   = True,  # False if not finetuning language layers
    finetune_attention_modules = True,  # False if not finetuning attention layers
    finetune_mlp_modules       = True,  # False if not finetuning MLP layers
)

We also heard you guys wanted a Vision notebook for Gemma 3 (4B) so here it is:

If you love Kaggle, Google is holding a competition where the best model fine-tuned with Gemma 3n and Unsloth will win a $10K prize! See more here .

Thanks to discussions from Michael from the Ollama team and also Xuan from Hugging Face, there were 2 issues we had to fix specifically for GGUFs:

The add_shared_kv_layers parameter was accidentally encoded in float32 which is fine, but becomes slightly complicated to decode on Ollama's side - a simple change to uint32 solves the issue. Pull request addressing this issue.
The per_layer_token_embd layer should be Q8_0 in precision. Anything lower does not function properly and errors out in the Ollama engine - to reduce issues for our community, we made this all Q8_0 in all quants - unfortunately this does use more space.
1. As an update , Matt mentioned we can also use Q4_0, Q4_1, Q5_0, Q5_1 for the embeddings - and we confirmed it does also work in Ollama! This means once again the smaller 2, 3 and 4bit quants are smaller in size, and don't need Q8_0!

♾️Infinities and NaN gradients and activations

Gemma 3n just like Gemma 3 has issues on FP16 GPUs (e.g., Tesla T4s in Colab).

Our previous fixes for Gemma 3 is discussed here. For Gemma 3, we found that activations exceed float16's maximum range of 65504.

Gemma 3N does not have this activation issue, but we still managed to encounter infinities!

To get to the bottom of these infinities, we plotted the absolute maximum weight entries for Gemma 3N, and we see the below:

We find that the green crosses are the Conv2D convolutional weights. We can see that the magnitude of Conv2D layers is much larger on average.

Below is a table for Conv2D weights which have large magnitudes. Our hypothesis is that during a Conv2D operation, large weights multiply and sum together, and unfortunately by chance exceed float16's maximum range of 65504. Bfloat16 is fine, since it's maximum range is 10^38.

msfa.ffn.pw_proj.conv.weight

blocks.2.21.attn.key.down_conv.weight

blocks.2.32.pw_exp.conv.weight

blocks.2.30.pw_exp.conv.weight

blocks.2.34.pw_exp.conv.weight

The naive solution is to upcast all Conv2D weights to float32 (if bfloat16 isn't available). But that would increase VRAM usage. To tackle this, we instead make use of autocast on the fly to upcast the weights and inputs to float32, and so we perform the accumulation in float32 as part of the matrix multiplication itself, without having to upcast the weights.

Unsloth is the only framework that enables Gemma 3n inference and training on float16 GPUs, so Colab Notebooks with free Tesla T4s work!

🏁Gradient Checkpointing issues

We found Gemma 3N's vision encoder to be quite unique as well since it re-uses hidden states. This unfortunately limits the usage of Unsloth's gradient checkpointing , which could have reduced VRAM usage significantly. since it cannot be applied to Vision encoder.

However, we still managed to leverage Unsloth's automatic compiler to optimize Gemma 3N!

🌵Large losses during finetuning

We also found losses are interestingly very large during the start of finetuning - in the range of 6 to 7, but they do decrease over time quickly. We theorize this is either because of 2 possibilities:

There might be some implementation issue, but this is unlikely since inference seems to work.
Multi-modal models always seem to exhibit this behavior - we found Llama 3.2 Vision's loss starts at 3 or 4, Pixtral at 8 or so, and Qwen 2.5 VL also 4 ish. Because Gemma 3N includes audio as well, it might amplify the starting loss. But this is just a hypothesis. We also found quantizing Qwen 2.5 VL 72B Instruct to have extremely high perplexity scores of around 30 or so, but the model interestingly performs fine.

So what is so special about Gemma 3n you ask? It is based on Matryoshka Transformer or MatFormer architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.

There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.

The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes S, S/2, S/4, S/8 etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4 sized sub blocks in each layer.

You can also choose to Mix and Match where you pick say, S/4 sized sub block of one layer, S/2 sized sub block of another layer and S/8 sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4