DiffusionGemma just dropped as the first open-source diffusion-based ASR model. Instead of the usual encoder-decoder or CTC architectures, this thing transcribes audio through a diffusion decoder—treating speech recognition as an iterative denoising process.
Supports 6 languages out of the box. Audio-native means it's processing waveforms directly through the diffusion pathway rather than relying on traditional acoustic feature extraction.
The architecture is wild—basically applying the same generative diffusion principles that work for images/video to speech-to-text. Could open up interesting paths for handling noisy audio or low-resource languages where traditional ASR falls apart.
Code's on GitHub. Worth checking if you're working on ASR pipelines or want to experiment with diffusion models beyond image generation.
Supports 6 languages out of the box. Audio-native means it's processing waveforms directly through the diffusion pathway rather than relying on traditional acoustic feature extraction.
The architecture is wild—basically applying the same generative diffusion principles that work for images/video to speech-to-text. Could open up interesting paths for handling noisy audio or low-resource languages where traditional ASR falls apart.
Code's on GitHub. Worth checking if you're working on ASR pipelines or want to experiment with diffusion models beyond image generation.