A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/papers/2503.20215 below:

Website Navigation


Paper page - Qwen2.5-Omni Technical Report

Hi
Very great model and i hope this model archetecture does become the standard for llms !
The problem has been the wrappers used by the transformers library !

which is that we need to be able to stack PreProcessors on top of each other , such as a image processor and a video processor and a audio processor ! ...

this is what essenatially you have pioneered here :

I would like to request that you make the model more universal !

As with the model creation process seems to be locked to the qwen models only when in fact it could be unlocked for all llama based models as most models are also using a descendant of this model type, such as mistral etc llama .... so these models which have also been highly trained will need to be able to use as a text model segment of the omni model :
Such as the whisper / wave2vec is being used as the audio component and the Clip/vit being used as the imaging component... these parent architectures can be considered standardized archtectures so they should be enbabled for the generation of a pretrained configuration , perhaps with additional configuration for the cross attention ...

It can be seen that the qwen models are a part of the llava model family which also did have the ablity to enable for llama models to be configured for training with a image processor ... the llava next models also enabled for this features ! as well as the onevision models : so i would hope that the omni models would also be able to be as universal as these model s.

we still have issues to face with other providers taking time to enable to the easy creation of the gguf files !
as this enables for external platforms to run these quantized models , will you also release a quantizing library for these vision / audio / omni modalitys ? to solve this isssue ?

Please explain if you will be making the model truly universal by implementin suc ideals ?
as in truth you have solved what most people have been waiting for !

With a universal configuration we can expect to see more realtime application for the ai , such that sampling from your webcam input and accepting the microphone input , via streaming to the models would be availbale with the omni models !
as the inputs can be processed in a single model instead of the model stack which people have been forced to use ! causing the GPU providers and model providers to skim the cream in sales of units and services !

when in fact a good 7b model can perform as well as a 70b model !

Another Note: Is that the models can already produce images output !
the image out is because it was already trained on captions so we should be able to produce a base64 image(text) as output and reconvert the base64 code to the image generated ! for my personal mistral i did train it on this sucessfully but only the training set was regenerating ! ( so it needs mass training on base64 images , to be able to regenerate such images or generalise some image )


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4