A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/OpenBMB/MiniCPM below:

OpenBMB/MiniCPM: MiniCPM4: Ultra-Efficient LLMs on End Devices, achieving 5+ speedup on typical end-side chips

MiniCPM 论文 | MiniCPM 技术博客 | MiniCPM 知识库 | MiniCPM-V 仓库 | 加入我们的 discord微信群 | 加入我们

demo.mp4

注: 更多模型版本见这里

MiniCPM 4 是一个极致高效的端侧大模型,从模型架构、学习算法、训练数据与推理系统四个层面进行了高效优化,实现了极致的效率提升。

在 Jetson AGX Orin 和 RTX 4090 两款典型端侧芯片上,MiniCPM4 在长文本处理任务中展现出大幅领先同尺寸模型的处理速度。随着文本长度的增加,MiniCPM4 的性能优势愈发显著。在 Jetson AGX Orin 平台上,相较于 Qwen3-8B,MiniCPM4 实现了约 7 倍的生成速度提升。

MiniCPM4 推出端侧 8B、0.5B 两种参数规模版本,均在同级别模型中实现了最佳性能表现。

MiniCPM4 基于 32K 长文本进行预训练,并通过 YaRN 技术实现长度扩展。在 128K 长文本的大海捞针任务中,MiniCPM4 展现出卓越的性能表现。

BitCPM4 是基于 MiniCPM 系列模型进行量化感知训练(QAT)后得到的三值量化模型,在训练效率和模型参数效率实现了有效的提升。

BitCPM4 在测试中的表现可以对标同级别的业界主流全精度模型。

BitCPM4 开源的模型参数为伪量化形式,可以直接使用 Huggingface 框架进行推理。

MiniCPM4-Survey 是由 THUNLP、中国人民大学和 ModelBest 联合开发的开源大语言模型智能体。它基于 MiniCPM4-8B 基座模型,接受用户质量作为输入,自主生成可信的长篇综述论文。 主要特性包括:

详见此处

Method Relevance Coverage Depth Novelty Avg. Fact Score Naive RAG (driven by G2FT) 3.25 2.95 3.35 2.60 3.04 43.68 AutoSurvey (driven by G2FT) 3.10 3.25 3.15 3.15 3.16 46.56 Webthinker (driven by WTR1-7B) 3.30 3.00 2.75 2.50 2.89 -- Webthinker (driven by QwQ-32B) 3.40 3.30 3.30 2.50 3.13 -- OpenAI Deep Research (driven by GPT-4o) 3.50 3.95 3.55 3.00 3.50 -- MiniCPM4-Survey 3.45 3.70 3.85 3.00 3.50 68.73    w/o RL 3.55 3.35 3.30 2.25 3.11 50.24

GPT-4o 对综述生成系统的性能比较。“G2FT” 代表 Gemini-2.0-Flash-Thinking,“WTR1-7B” 代表 Webthinker-R1-7B。由于 Webthinker 不包括引用功能,OpenAI Deep Research 在导出结果时不提供引用,因此省略了对它们的 FactScore 评估。我们的技术报告中包含评测的详细信息。

MiniCPM4-MCP 是由清华大学自然语言处理实验室(THUNLP)、中国人民大学与 ModelBest 联合开发的开源本地大语言模型代理,它基于 MiniCPM-4-8B,拥有 80 亿参数。它能够通过 MCP 协议与各种工具和数据资源交互,解决多种真实世界任务。截至目前,MiniCPM4-MCP 已支持:

详见此处

MCP 服务器 gpt-4o qwen3 minicpm4 函数名正确率 参数名正确率 数值正确率 函数名正确率 参数名正确率 数值正确率 函数名正确率 参数名正确率 数值正确率 Airbnb 89.3 67.9 53.6 92.8 60.7 50.0 96.4 67.9 50.0 Amap-Maps 79.8 77.5 50.0 74.4 72.0 41.0 89.3 85.7 39.9 Arxiv-MCP-Server 85.7 85.7 85.7 81.8 54.5 50.0 57.1 57.1 52.4 Calculator 100.0 100.0 20.0 80.0 80.0 13.3 100.0 100.0 6.67 Computor-Control-MCP 90.0 90.0 90.0 90.0 90.0 90.0 90.0 90.0 86.7 Desktop-Commander 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Filesystem 63.5 63.5 31.3 69.7 69.7 26.0 83.3 83.3 42.7 Github 92.0 80.0 58.0 80.5 50.0 27.7 62.8 25.7 17.1 Gaode 71.1 55.6 17.8 68.8 46.6 24.4 68.9 46.7 15.6 MCP-Code-Executor 85.0 80.0 70.0 80.0 80.0 70.0 90.0 90.0 65.0 MCP-Docx 95.8 86.7 67.1 94.9 81.6 60.1 95.1 86.6 76.1 PPT 72.6 49.8 40.9 85.9 50.7 37.5 91.2 72.1 56.7 PPTx 64.2 53.7 13.4 91.0 68.6 20.9 91.0 58.2 26.9 Simple-Time-Server 90.0 70.0 70.0 90.0 90.0 90.0 90.0 60.0 60.0 Slack 100.0 90.0 70.0 100.0 100.0 65.0 100.0 100.0 100.0 Whisper 90.0 90.0 90.0 90.0 90.0 90.0 90.0 90.0 30.0 平均值 80.2 70.2 49.1 83.5 67.7 43.8 88.3 76.1 51.2 MiniCPM Intel AIPC Client: 端侧大模型客户端

MiniCPM Intel AIPC Client 是面壁智能和 Intel 合作推出的端侧大模型客户端,专为搭载 Intel Core Ultra 系列处理器的设备设计,旨在为开发者、研究人员与 AI 爱好者带来低延迟、高效率、高隐私的本地大模型使用体验。其核心特性如下:

配置要求:

应用下载:

下载地址

我们推荐使用 CPM.cu 对 MiniCPM4 模型进行推理。CPM.cu 是面壁开发的一个集合了高效稀疏、投机采样、量化等技术的 CUDA 推理框架,能够完全发挥 MiniCPM4 的效率优势。

你可以通过以下脚本安装 CPM.cu 并进行推理:

git clone https://github.com/OpenBMB/CPM.cu.git --recursive
cd CPM.cu
python3 setup.py install

你可以通过以下命令进行推理并查看模型的运行速度。

python3 tests/long_prompt_gen.py # 生成 prompt.txt
python3 tests/test_generate.py --prompt-file prompt.txt

更多关于 CPM.cu 的细节,请参考 CPM.cu 仓库

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)

# User can also use the generate interface
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

本模型支持稀疏注意力机制 InfLLM v2,可高效处理长序列推理。如需启用该功能,请先安装依赖库 infllmv2_cuda_impl

运行以下命令即可安装:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install 

启用 InfLLM v2 需在 config.json 配置文件中添加 sparse_config 字段:

{
    ...,
    "sparse_config": {
        "kernel_size": 32,
        "kernel_stride": 16,
        "init_blocks": 1,
        "block_size": 64,
        "window_size": 2048,
        "topk": 64,
        "use_nope": false,
        "dense_len": 8192
    }
}

这些参数控制 InfLLM v2 的行为:

Minicpm4 原生支持 32,768 tokens 的上下文长度。若对话总长度(输入 + 输出)远超此限制,建议通过 RoPE 缩放技术扩展上下文。我们已验证通过调整 LongRoPE 因子,模型可稳定支持 131,072 tokens 的超长上下文。

修改方法:在 config.json 文件中调整 rope_scaling 字段参数即可。

{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

参照 vLLM 官方仓库,通过源码安装最新版本。

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "推荐5个北京的景点。"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
    speculative_config={
        "method": "eagle",
        "model": "openbmb/MiniCPM4-8B-Eagle-vLLM",
        "num_speculative_tokens": 2,
        "max_model_len": 32768,
    },
)
llm = LLM(
    model="openbmb/MiniCPM4-8B-marlin-vLLM",
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
)
llm = LLM(
    model="openbmb/MiniCPM4-8B-marlin-vLLM",
    trust_remote_code=True,
    max_num_batched_tokens=32768,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
    speculative_config={
        "method": "eagle",
        "model": "openbmb/MiniCPM4-8B-marlin-Eagle-vLLM",
        "num_speculative_tokens": 2,
        "max_model_len": 32768,
    },
)

注意:如果你使用 vLLM 中的 OpenAI 兼容的服务端,chat API 默认会将 add_special_tokens 设置为 False。这会导致缺失一些特殊标记(例如,BOS),而这些标记对 MiniCPM4 模型至关重要。为确保模型行为正常,你需要在 API 调用中显式设置 extra_body={"add_special_tokens": True},如下所示:

import openai

client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
    extra_body={"add_special_tokens": True},  # 确保添加了诸如 BOS 等特殊标记
)

print(response.choices[0].message.content)

参考 SGLang 官方仓库,通过源码安装。

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"
python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
import openai

client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
)
print(response.choices[0].message.content)
python3 -m sglang.launch_server --model-path [model] \ 
    --speculative_draft_model_path [draft_model] \
    --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5

目前模型微调支持 LLaMA-Factory,使用方法参考 LLaMA-Factory 微调

查看 MiniCPM 3.0 的详细信息

MiniCPM 3.0 是一个 4B 参数量的语言模型,相比 MiniCPM1.0/2.0,功能更加全面,综合能力大幅提升,多数评测集上的效果比肩甚至超越众多 7B-9B 模型。

评测集 Qwen2-7B-Instruct GLM-4-9B-Chat Gemma2-9B-it Llama3.1-8B-Instruct GPT-3.5-Turbo-0125 Phi-3.5-mini-Instruct(3.8B) MiniCPM3-4B 英文能力 MMLU 70.5 72.4 72.6 69.4 69.2 68.4 67.2 BBH 64.9 76.3 65.2 67.8 70.3 68.6 70.2 MT-Bench 8.41 8.35 7.88 8.28 8.17 8.60 8.41 IFEVAL (Prompt Strict-Acc.) 51.0 64.5 71.9 71.5 58.8 49.4 68.4 中文能力 CMMLU 80.9 71.5 59.5 55.8 54.5 46.9 73.3 CEVAL 77.2 75.6 56.7 55.2 52.8 46.1 73.6 AlignBench v1.1 7.10 6.61 7.10 5.68 5.82 5.73 6.74 FollowBench-zh (SSR) 63.0 56.4 57.0 50.6 64.6 58.1 66.8 数学能力 MATH 49.6 50.6 46.0 51.9 41.8 46.4 46.6 GSM8K 82.3 79.6 79.7 84.5 76.4 82.7 81.1 MathBench 63.4 59.4 45.8 54.3 48.9 54.9 65.6 代码能力 HumanEval+ 70.1 67.1 61.6 62.8 66.5 68.9 68.3 MBPP+ 57.1 62.2 64.3 55.3 71.4 55.8 63.2 LiveCodeBench v3 22.2 20.2 19.2 20.4 24.0 19.6 22.6 工具调用能力 BFCL v2 71.6 70.1 19.2 73.3 75.4 48.4 76.0 综合能力 平均分 65.3 65.0 57.9 60.8 61.0 57.2 66.3

我们在 Berkeley Function Calling Leaderboard (BFCL) 上测试了模型的工具调用能力,MiniCPM3-4B 在该榜单上的表现超越了多个 7B-9B 参数量的模型,优于 GPT-3.5-Turbo-0125。

模型 总体准确率 AST Summary Exec Summary Irrelevance Detection Relevance Detection MiniCPM3-4B 76.03% 68.55% 85.54% 53.71% 90.24% Llama3.1-8B-Instruct 73.28% 64.61% 86.48% 43.12% 85.37% Qwen2-7B-Instruct 71.61% 65.71% 79.57% 44.70% 90.24% GLM-4-9B-Chat 70.08% 60.69% 80.02% 55.02% 82.93% Phi-3.5-mini-instruct 48.44% 38.89% 54.04% 46.78% 65.85% Gemma2-9B-it 19.18% 5.41% 18.50% 88.88% 7.32%

在 32k 的上下文长度进行大海捞针测试,结果如下图:

同时我们提出LLMxMapReduce,利用分治的策略,理论上可以处理无限长度的文本。我们在InfiniteBench上测试了模型的长文本处理能力,在LLMxMapReduce框架的加持下,MiniCPM3-4B在这个榜单的平均得分能够超越 GPT-4、KimiChat 等标杆模型。

Context length Qwen2-70b Kimi-Chat(2024.06) GPT-4 (From InfiniteBench) MiniCPM 3.0 x MR Qwen2-70b x MR Llama3-70bx MR Math.Find 87.9k 59.71% 18.57% 60.00% 83.43% 54.29% 91.43% Retrieve.KV 89.9k 29.00% 69.20% 89.00% 93.80% 98.80% 98.89% En.Dia 103.6K 23.00% 23.00% 7.50% 12.50% 46.50% 17.50% Code.Debug 114.7k 45.43% 38.32% 54.31% 25.63% 54.82% 62.94% Retrieve.Number 122.4k 100.00% 97.45% 100.00% 99.32% 100.00% 99.79% Retrieve.PassKey 122.4k 100.00% 99.32% 100.00% 98.81% 100.00% 100.00% En.Sum 171.5K 31.85% 29.94% 14.73% 25.89% 32.39% 30.63% En.MC 184.4k 81.66% 79.91% 68.12% 66.38% 83.84% 82.10% En.QA 192.6k 21.97% 18.80% 22.44% 28.39% 23.13% 34.70% Zh.QA 2068.6k 21.40% 19.84% 25.96% 23.66% 19.10% N/A avg w/o Zh.QA / 51.92% 52.96% 55.33% 59.29% 64.98% 68.64% avg / 48.86% 49.65% 52.39% 55.55% 60.39% N/A
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM3-4B'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

responds, history = model.chat(tokenizer, "请写一篇关于人工智能的文章,详细介绍人工智能的未来发展和隐患。", temperature=0.7, top_p=0.7)
print(responds)

参考 SGLang 官方仓库,通过源码安装最新版本。

python -m sglang.launch_server --model openbmb/MiniCPM3-4B --trust-remote-code --port 30000 --chat-template chatml
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=1024))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=1024))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="介绍一下人工智能",
    question_2="写一篇关于它的文章",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

我们提供了 MiniCPM3 的 GGUF 版本,可以直接使用 llama.cpp 推理。

目前模型微调支持 LLaMA-Factory,使用方法参考 LLaMA-Factory 微调

对于以下进阶功能,我们的样例代码中使用 vLLM 进行推理。

我们提供了使用 MiniCPM3 调用工具的示例代码:

cd demo/minicpm3/function_call
python function_call.py

如果你想启动一个能够调用工具的推理服务,使用以下代码:

cd demo/minicpm3/function_call
pip install -r requirements.txt
python openai_api_server.py \
    --model openbmb/MiniCPM3-4B \
    --served-model-name MiniCPM3-4B \
    --chat-template chatml.jinja \
    --dtype auto \
    --api-key token-abc123 \
    --tensor-parallel-size 1 \
    --trust-remote-code

下面是一个调用搜索工具回答问题的演示:

我们提供了一个 MiniCPM3 使用代码解释器的示例代码:

cd demo/minicpm3/code_interpreter
pip install -r requirements.txt
python code_interpreter.py openbmb/MiniCPM3-4B

下面是一个使用代码解释器生成二维码的演示:

查看 MiniCPM 2.0 的详细信息

MiniCPM 2.0 系列模型对 MiniCPM 进行了多个维度的升级,包括以下模型版本:

Model avg avg w/o code&math passkey number_string kv_retrieval longbook_choice_eng longbook_qa_chn longbook_qa_eng longbook_sum_eng longdialogue_qa_eng math_calc math_find code_debug code_run LWM-Text-128k 24.45 33.62 100 97.8 0.6 28.82 15.93 14.31 9.99 1.5 0 3.43 20.05 1 Yarn-Mistral-7b-128k 19.84 27.36 92.71 0 27.95 15.49 9.55 9.06 7.5 0 17.14 0.76 1.25 Mistral-7B-Instruct-v0.2(ABF 1000w) 27.75 36.9 100 78.98 3.6 37.12 11.74 17.37 21.12 9.5 0 29.43 17.51 0 Yi-6B-200k 22.15 32.54 100 94.92 0 36.68 15.07 9.2 0.92 3.5 0 4.29 0.51 0.75 chatglm3-6b-128k 25.58 36.57 89.93 99.66 5.2 46.29 10.7 8.38 25.91 6.5 0 8 5.33 1 MiniCPM-2.4B-128k 27.32 37.68 98.31 99.83 9 29.69 23.06 16.33 15.73 9.5 0 4.29 22.08 0 Model BBH MMLU CEval CMMLU HumanEval MBPP† GSM8K MATH Llama2-34B* 44.1 62.6 - - 22.6 33.0 42.2 6.24 Mistral-7B-Instruct-v0.2 39.81 60.51 42.55 41.92 36.59 39.63 40.49 4.95 Gemma-7B* 55.1 64.3 - - 32.3 44.4 46.4 24.3 Qwen1.5-7B* 40.2 61 74.1 73.1 36 37.4 62.5 20.3 Deepseek-MoE(16B)* - 45.0 40.6 42.5 26.8 39.2 18.8 4.3 MiniCPM-2.4B 36.87 53.46 51.13 51.07 50.00 35.93 53.83 10.24 MiniCPM-MoE-8x2B 39.22 58.90 58.11 58.80 55.49 41.68 61.56 10.52

注:* 表示结果取自技术报告。† 表示评测集为MBPP全集。

其他测试集:我们报告在GSM8K(8-shot)、MMLU(5-shot)、BBH(3-shot)和 AGI-Eval(0-shot)上的平均准确率。

Setting Average
Sparsity Average
Performance Code
Generation Commonsense
Reasoning Reading
Comprehension GSM8K MMLU BBH AGI Eval LLaMA2-7B - 37.96 16.37 69.59 61.87 12.96 44.45 32.96 27.53 ReluLLaMA-7B 66.98 37.62 15.85 69.64 70.54 5.84 38.64 35.07 27.73 ProSparse-7B* 88.11 38.31 19.47 66.29 63.33 12.74 45.21 33.59 27.55 ProSparse-7B 89.32 38.46 19.42 66.27 63.50 12.13 45.48 34.99 27.46 LLaMA2-13B - 44.06 20.19 72.58 71.55 22.21 54.69 37.89 29.33 ReluLLaMA-13B 71.56 42.74 20.19 70.44 73.29 18.50 50.58 37.97 28.22 ProSparse-13B* 87.97 45.07 29.03 69.75 67.54 25.40 54.78 40.20 28.76 ProSparse-13B 88.80 44.90 28.42 69.76 66.91 26.31 54.35 39.90 28.67 MiniCPM-1B - 44.44 36.85 63.67 60.90 35.48 50.44 35.03 28.71 MiniCPM-S-1B* 86.25 44.72 41.38 64.55 60.69 34.72 49.36 34.04 28.27 MiniCPM-S-1B 87.89 44.72 42.04 64.37 60.73 34.57 49.51 34.08 27.77

注:

  1. ReluLLaMA-7B 和 ReluLLaMA-13B 的下载链接分别是 7B and 13B。"ProSparse-7B*"、"ProSparse-13B*" 和 "MiniCPM-S-1B*" 代表没有激活阈值偏移的 ProSparse 版本。
  2. 对于 PIQA、SIQA、HellaSwag、WinoGrande、COPA、BoolQ、LAMBADA、TyDi QA 和 AGI-Eval,我们根据各个选项的 PPL 来进行答案选择。对于 GSM8K、MMLU 和 BBH,我们直接生成答案。

参考 MiniCPM 1.0 中的模型推理部分。

针对 MiniCPM-S-1B 模型,我们可以使用 Powerinfer 进行推理加速,使用方法如下:

  1. 保证cmake版本3.17以上,如果已经安装过,则跳过此步骤
  # 下载安装包
  sudo wget https://cmake.org/files/v3.23/cmake-3.23.0.tar.gz
  # 解压安装包
  sudo tar -zxvf cmake-3.23.0.tar.gz
  # 配置安装环境
  sudo ./configure
  sudo make -j8
  # 编译安装
  sudo make install
  # 查看安装后版本
  cmake --version
  # 返回版本号则安装成功
  #cmake version 3.23.0
  1. 安装powerinfer:
  git clone https://github.com/SJTU-IPADS/PowerInfer
  cd PowerInfer
  pip install -r requirements.txt # install Python helpers' dependencies
  1. cpu版本powerinfer编译,如果你的机器只有cpu,或者只想使用cpu进行推理,则运行以下命令:
  cmake -S . -B build
  cmake --build build --config Release
  1. gpu版本powerinfer编译,如果你的机器有gpu,则可以运行以下命令:
  cmake -S . -B build -DLLAMA_CUBLAS=ON
  cmake --build build --config Release
  1. 获取稀疏模型
git clone https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf/tree/main
#or
git clone https://modelscope.cn/models/OpenBMB/MiniCPM-S-1B-sft-gguf
  1. 模型推理:
cd PowerInfer
# 以下是命令模版,output_token_count为最大输出tokens,thread_num 为线程数,prompt为输入prompt字符
#./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
# 以下是示例
./build/bin/main -m /root/ld/ld_model_pretrain/1b-s-minicpm/MiniCPM-S-1B-sft.gguf -n 2048 -t 8 -p '<用户>hello,tell me a story please.<AI>'
查看 MiniCPM 1.0 的详细信息

MiniCPM-2B 语言模型有 24亿(2.4B)的非词嵌入参数量, 总计 2.7B 参数量。

注意:为了保证在学术研究用途上模型的通用性,我们未对 MiniCPM-2B 进行任何身份认同训练。同时由于我们用 ShareGPT 开源语料作为部分训练数据,模型可能会输出类似 GPT 系列模型的身份认同信息。

越级比较:

模型 平均分 英文均分 中文均分 C-Eval CMMLU MMLU HumanEval MBPP GSM8K MATH BBH ARC-E ARC-C HellaSwag Llama2-7B 35.40 36.21 31.765 32.42 31.11 44.32 12.2 27.17 13.57 1.8 33.23 75.25 42.75 75.62* Qwen-7B 49.46 47.19 59.655 58.96 60.35 57.65 17.07 42.15 41.24 5.34 37.75 83.42 64.76 75.32* Deepseek-7B 39.96 39.15 43.64 42.82 44.45 47.82 20.12 41.45 15.85 1.53 33.38 74.58* 42.15* 75.45* Mistral-7B 48.97 49.96 44.54 46.12 42.96 62.69 27.44 45.2 33.13 5.0 41.06 83.92 70.73 80.43* Llama2-13B 41.48 42.44 37.19 37.32 37.06 54.71 17.07 32.55 21.15 2.25 37.92 78.87* 58.19 79.23* MPT-30B 38.17 39.82 30.72 29.34 32.09 46.56 21.95 35.36 10.31 1.56 38.22 78.66* 46.08* 79.72* Falcon-40B 43.62 44.21 40.93 40.29 41.57 53.53 24.39 36.53 22.44 1.92 36.24 81.94* 57.68 83.26* MiniCPM-2B 52.33 52.6 51.1 51.13 51.07 53.46 50.00 47.31 53.83 10.24 36.87 85.44 68.00 68.25

同级比较:

模型 平均分 英文均分 中文均分 C-Eval CMMLU MMLU HumanEval MBPP GSM8K MATH BBH ARC-E ARC-C HellaSwag TinyLlama-1.1B 25.36 25.55 24.525 25.02 24.03 24.3 6.71 19.91 2.27 0.74 28.78 60.77* 28.15* 58.33* Qwen-1.8B 34.72 31.87 47.57 49.81 45.32 43.37 7.93 17.80 19.26 2.42 29.07 63.97* 43.69 59.28* Gemini Nano-3B - - - - - - - 27.2(report) 22.8(report) - 42.4(report) - - - StableLM-Zephyr-3B 43.46 46.31 30.62 30.34 30.89 45.9 35.37 31.85 52.54 12.49 37.68 73.78 55.38 71.87* Phi-2-2B 48.84 54.41 23.78 23.37 24.18 52.66 47.56 55.04 57.16 3.5 43.39 86.11 71.25 73.07* MiniCPM-2B 52.33 52.6 51.10 51.13 51.07 53.46 50.00 47.31 53.83 10.24 36.87 85.44 68.00 68.25

Chat模型比较:

模型 平均分 英文均分 中文均分 C-Eval CMMLU MMLU HumanEval MBPP GSM8K MATH BBH ARC-E ARC-C HellaSwag ChatGLM2-6B 37.98 35.17 50.63 52.05 49.21 45.77 10.37 9.38 22.74 5.96 32.6 74.45 56.82 58.48* Mistral-7B-Instruct-v0.1 44.36 45.89 37.51 38.06 36.96 53.56 29.27 39.34 28.73 3.48 39.52 81.61 63.99 73.47* Mistral-7B-Instruct-v0.2 50.91 52.83 42.235 42.55 41.92 60.51 36.59 48.95 40.49 4.95 39.81 86.28 73.38 84.55* Qwen-7B-Chat 44.93 42.05 57.9 58.57 57.23 56.03 15.85 40.52 42.23 8.3 37.34 64.44* 39.25* 74.52* Yi-6B-Chat 50.46 45.89 70.995 70.88 71.11 62.95 14.02 28.34 36.54 3.88 37.43 84.89 70.39 74.6* Baichuan2-7B-Chat 44.68 42.74 53.39 53.28 53.5 53 21.34 32.32 25.25 6.32 37.46 79.63 60.15 69.23* Deepseek-7B-chat 49.34 49.56 48.335 46.95 49.72 51.67 40.85 48.48 48.52 4.26 35.7 76.85 63.05 76.68* Llama2-7B-Chat 38.16 39.17 33.59 34.54 32.64 47.64 14.02 27.4 21.15 2.08 35.54 74.28 54.78 75.65* MiniCPM-2B 52.33 52.6 51.10 51.13 51.07 53.46 50.00 47.31 53.83 10.24 36.87 85.44 68.00 68.25

DPO后模型比较:

模型 MT-bench GPT-4-turbo 9.32 GPT-3.5-turbo 8.39 Mistral-8*7b-Instruct-v0.1 8.30 Claude-2.1 8.18 Zephyr-7B-beta 7.34 MiniCPM-2B 7.25 Vicuna-33B 7.12 Zephyr-7B-alpha 6.88 LLaMA-2-70B-chat 6.86 Mistral-7B-Instruct-v0.1 6.84 MPT-34B-instruct 6.39
# generation powered by vllm
python demo/minicpm/vllm_based_demo.py --model_path <vllmcpm_repo_path>
# generation powered by huggingface
python demo/minicpm/hf_based_demo.py --model_path <hf_repo_path>

安装transformers>=4.36.0以及accelerate后,运行以下代码:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM-2B-dpo-bf16'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

responds, history = model.chat(tokenizer, "山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?", temperature=0.5, top_p=0.8, repetition_penalty=1.02)
print(responds)
MiniCPM-2B (Llama Format)

我们将MiniCPM的模型权重转化成了Llama代码可以直接调用的格式,以便大家尝试:

import torch
from transformers import LlamaTokenizerFast, LlamaForCausalLM
model_path = "openbmb/MiniCPM-2B-dpo-bf16-llama-format"
tokenizer = LlamaTokenizerFast.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)

prompt="Now you act like a terminal situated within a beginner's C++ practice repository folder, please provide the output for the command: `ls -l`"
input_ids = tokenizer.encode("<用户>{}<AI>".format(prompt), return_tensors='pt', add_special_tokens=True).cuda()
responds = model.generate(input_ids, temperature=0.3, top_p=0.8, repetition_penalty=1.02, max_length=1024)
responds = tokenizer.decode(responds[0], skip_special_tokens=True)
print(responds)

安装 vLLM

pip install "vllm>=0.4.1"

具体推理代码见这里

安装 SGLang

python -m sglang.launch_server --model-path openbmb/MiniCPM-2B-dpo-fp16 --trust-remote-code --port 30000
from sglang import function, gen, set_default_backend, RuntimeEndpoint

@function
def text_qa(s, question):
    s += "<用户>" + question + "<AI>"
    s += gen("answer", max_tokens=1024, temperature=0.7, top_p=0.7)

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = text_qa.run(
    question="What is the capital of China?",
)

print(state["answer"])
llama.cpp、Ollama、fastllm、mlx_lm推理

MiniCPM支持llama.cppollamafastllmmlx_lm推理。感谢@runfuture对llama.cpp和ollama的适配。

请参考 MiniCPM 知识库中的边端部署教程

请参考 MiniCPM 知识库中的量化指南

本项目由以下机构共同开发:

@article{minicpm4,
  title={MiniCPM4: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

@inproceedings{huminicpm,
  title={MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies},
  author={Hu, Shengding and Tu, Yuge and Han, Xu and Cui, Ganqu and He, Chaoqun and Zhao, Weilin and Long, Xiang and Zheng, Zhi and Fang, Yewei and Huang, Yuxiang and others},
  booktitle={First Conference on Language Modeling},
  year={2024}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4