Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?
This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.
trappedinspacetime, heathdutton, matthewfernst, tsugabloom, rom-p and 33 more
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4