A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/QwenLM/Qwen2-Audio below:

QwenLM/Qwen2-Audio: The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.

中文  ď˝œ   English  

Qwen2-Audio-7B 🤖 | 🤗  | Qwen-Audio-7B-Instruct 🤖 | 🤗  | Demo 🤖 | 🤗 
📑 Paper    |    📑 Blog    |    💬 WeChat (垎俥)   |    Discord  

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:

We've released two models of the Qwen2-Audio series: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.

The overview of three-stage training process of Qwen2-Audio.

We evaluated the Qwen2-Audio's abilities on 13 standard benchmarks as follows:

Task Description Dataset Split Metric ASR Automatic Speech Recognition Fleurs dev | test WER Aishell2 test Librispeech dev | test Common Voice dev | test S2TT Speech-to-Text Translation CoVoST2 test BLEU SER Speech Emotion Recognition Meld test ACC VSC Vocal Sound Classification VocalSound test ACC AIR-Bench
Chat-Benchmark-Speech Fisher
SpokenWOZ
IEMOCAP
Common voice dev | test GPT-4 Eval Chat-Benchmark-Sound Clotho dev | test GPT-4 Eval Chat-Benchmark-Music MusicCaps dev | test GPT-4 Eval Chat-Benchmark-Mixed-Audio Common voice
AudioCaps
MusicCaps dev | test GPT-4 Eval

The below is the overal performance:

The details of evaluation are as follows:
(Note: The evaluation results we present are based on the initial model of the original training framework. However, the scores showed some fluctuations after converting the framework to Huggingface. Here, we present our complete evaluation results, starting with the initial model results from the paper.)

Task Dataset Model Performance Metrics Results ASR Librispeech
dev-clean | dev-other |
test-clean | test-other SpeechT5 WER 2.1 | 5.5 | 2.4 | 5.8 SpeechNet - | - | 30.7 | - SLM-FT - | - | 2.6 | 5.0 SALMONN - | - | 2.1 | 4.9 SpeechVerse - | - | 2.1 | 4.4 Qwen-Audio 1.8 | 4.0 | 2.0 | 4.2 Qwen2-Audio 1.3 | 3.4 | 1.6 | 3.6 Common Voice 15
en | zh | yue | fr Whisper-large-v3 WER 9.3 | 12.8 | 10.9 | 10.8 Qwen2-Audio 8.6 | 6.9 | 5.9 | 9.6 Fleurs
zh Whisper-large-v3 WER 7.7 Qwen2-Audio 7.5 Aishell2
Mic | iOS | Android MMSpeech-base WER 4.5 | 3.9 | 4.0 Paraformer-large - | 2.9 | - Qwen-Audio 3.3 | 3.1 | 3.3 Qwen2-Audio 3.0 | 3.0 | 2.9 S2TT CoVoST2
en-de | de-en |
en-zh | zh-en SALMONN BLEU 18.6 | - | 33.1 | - SpeechLLaMA - | 27.1 | - | 12.3 BLSP 14.1 | - | - | - Qwen-Audio 25.1 | 33.9 | 41.5 | 15.7 Qwen2-Audio 29.9 | 35.2 | 45.2 | 24.4 CoVoST2
es-en | fr-en | it-en | SpeechLLaMA BLEU 27.9 | 25.2 | 25.9 Qwen-Audio 39.7 | 38.5 | 36.0 Qwen2-Audio 40.0 | 38.5 | 36.3 SER Meld WavLM-large ACC 0.542 Qwen-Audio 0.557 Qwen2-Audio 0.553 VSC VocalSound CLAP ACC 0.4945 Pengi 0.6035 Qwen-Audio 0.9289 Qwen2-Audio 0.9392 AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.18 | 6.99 | 6.79 | 6.77

(Second is after converting huggingface)

Task Dataset Model Performance Metrics Results ASR Librispeech
dev-clean | dev-other |
test-clean | test-other SpeechT5 WER 2.1 | 5.5 | 2.4 | 5.8 SpeechNet - | - | 30.7 | - SLM-FT - | - | 2.6 | 5.0 SALMONN - | - | 2.1 | 4.9 SpeechVerse - | - | 2.1 | 4.4 Qwen-Audio 1.8 | 4.0 | 2.0 | 4.2 Qwen2-Audio 1.7 | 3.6 | 1.7 | 4.0 Common Voice 15
en | zh | yue | fr Whisper-large-v3 WER 9.3 | 12.8 | 10.9 | 10.8 Qwen2-Audio 8.7 | 6.5 | 5.9 | 9.6 Fleurs
zh Whisper-large-v3 WER 7.7 Qwen2-Audio 7.0 Aishell2
Mic | iOS | Android MMSpeech-base WER 4.5 | 3.9 | 4.0 Paraformer-large - | 2.9 | - Qwen-Audio 3.3 | 3.1 | 3.3 Qwen2-Audio 3.2 | 3.1 | 2.9 S2TT CoVoST2
en-de | de-en |
en-zh | zh-en SALMONN BLEU 18.6 | - | 33.1 | - SpeechLLaMA - | 27.1 | - | 12.3 BLSP 14.1 | - | - | - Qwen-Audio 25.1 | 33.9 | 41.5 | 15.7 Qwen2-Audio 29.6 | 33.6 | 45.6 | 24.0 CoVoST2
es-en | fr-en | it-en | SpeechLLaMA BLEU 27.9 | 25.2 | 25.9 Qwen-Audio 39.7 | 38.5 | 36.0 Qwen2-Audio 38.7 | 37.2 | 35.2 SER Meld WavLM-large ACC 0.542 Qwen-Audio 0.557 Qwen2-Audio 0.535 VSC VocalSound CLAP ACC 0.4945 Pengi 0.6035 Qwen-Audio 0.9289 Qwen2-Audio 0.9395 AIR-Bench
Chat Benchmark
Speech | Sound |
Music | Mixed-Audio SALMONN
BLSP
Pandagpt
Macaw-LLM
SpeechGPT
Next-gpt
Qwen-Audio
Gemini-1.5-pro
Qwen2-Audio GPT-4 6.16 | 6.28 | 5.95 | 6.08
6.17 | 5.55 | 5.08 | 5.33
3.58 | 5.46 | 5.06 | 4.25
0.97 | 1.01 | 0.91 | 1.01
1.57 | 0.95 | 0.95 | 4.13
3.86 | 4.76 | 4.18 | 4.13
6.47 | 6.95 | 5.52 | 6.08
6.97 | 5.49 | 5.06 | 5.27
7.24 | 6.83 | 6.73 | 6.42

We have provided all evaluation scripts to reproduce our results. Please refer to eval_audio/EVALUATION.md for details.

The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

Below, we provide simple examples to show how to use Qwen2-Audio and Qwen2-Audio-Instruct with 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Now you can start with ModelScope or Transformers. Qwen2-Audio models currently perform best with audio clips under 30 seconds.

In the following, we demonstrate how to use Qwen2-Audio-7B-Instruct for the inference, supporting both voice chat and audio analysis modes. Note that we have used the ChatML format for dialog, in this demo we show how to leverage apply_chat_template for this purpose.

In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

In the audio analysis, users could provide both audio and text instructions for analysis:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

We also support batch inference:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation1 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/f2641_0_throatclearing.wav"},
        {"type": "text", "text": "What can you hear?"},
    ]}
]

conversation2 = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]

conversations = [conversation1, conversation2]

text = [processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) for conversation in conversations]

audios = []
for conversation in conversations:
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audios.append(
                        librosa.load(
                            BytesIO(urlopen(ele['audio_url']).read()), 
                            sr=processor.feature_extractor.sampling_rate)[0]
                    )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs['input_ids'] = inputs['input_ids'].to("cuda")
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Running Qwen2-Audio pretrained base model is also simple.

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B" ,trust_remote_code=True)

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")

generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

We strongly advise users especially those in mainland China to use ModelScope. snapshot_download can help you solve issues concerning downloading checkpoints.

We provide code for users to build a web UI demo. Before you start, make sure you install the following packages:

pip install -r requirements_web_demo.txt

Then run the command below and click on the generated link:

python demo/web_demo_audio.py

More impressive cases will be updated on our blog at Qwen's blog.

If you are interested in joining us as full-time or intern, please contact us at qwen_audio@list.alibaba-inc.com.

Check the license of each model inside its HF repo. It is NOT necessary for you to submit a request for commercial usage.

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}
@article{Qwen2-Audio,
  title={Qwen2-Audio Technical Report},
  author={Chu, Yunfei and Xu, Jin and Yang, Qian and Wei, Haojie and Wei, Xipin and Guo,  Zhifang and Leng, Yichong and Lv, Yuanjun and He, Jinzheng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2407.10759},
  year={2024}
}

If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4