Is my assumption correct that, for transcription, it is generally better to use an ASR speech-to-text model rather than an LLM that tries to “reason” about the words or sentences and may take unwanted liberties with them?
I would rather have a word be poorly recognized than have the model silently reinterpret, rewrite, or infer something that was not actually said. Or is that view too simplistic?
#yapsnap (#transcription by url, on the CPU)
https://github.com/kouhxp/yapsnap