two-pass CTC-Attention rescoring

transform

AudioFeatures, CTCPosteriors -> Transcription

Generate N-best hypothesis candidates via CTC prefix beam search, then rescore them using a shared multi-head attention decoder.

Problem it solves

Streaming CTC decoders suffer from local search errors, but full-sequence attention decoders are too slow for real-time streaming.

Consumes

AudioFeaturesCTCPosteriors

Emits

Transcription

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.