Speculative decoding accelerates autoregressive speech era by letting a quick draft mannequin suggest tokens {that a} bigger goal mannequin verifies. Nonetheless, for speech LLMs that generate acoustic tokens, actual token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, decreasing acceptance charges and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals on the degree of Acoustic Similarity Teams (ASGs) derived from the goal mannequin’s embedding area. By splitting every token’s likelihood mass throughout the overlapping teams that comprise it, we outline an overlap-aware coarse-grained distribution and carry out rejection sampling on the ensuing group variable. This yields an exactness assure on the group degree whereas permitting the accepted draft token to face in for any member of the group in apply. On LibriTTS, PCG will increase acceptance and throughput relative to plain speculative decoding and prior speech-specific relaxations whereas sustaining intelligibility and speaker similarity. These outcomes counsel acoustically conscious, group-level acceptance as a easy and normal strategy to speed up speech token era whereas sustaining speech high quality.
