Monday, April 13, 2026

A Theoretical Framework for Acoustic Neighbor Embeddings


This paper supplies a theoretical framework for deciphering acoustic neighbor embeddings, that are representations of the phonetic content material of variable-width audio or textual content in a fixed-dimensional embedding area. A probabilistic interpretation of the distances between embeddings is proposed, based mostly on a common quantitative definition of phonetic similarity between phrases. This supplies us a framework for understanding and making use of the embeddings in a principled method. Theoretical and empirical proof to assist an approximation of uniform cluster-wise isotropy are proven, which permits us to scale back the distances to easy Euclidean distances. 4 experiments that validate the framework and show how it may be utilized to various issues are described. Nearest-neighbor search between audio and textual content embeddings can provide remoted phrase classification accuracy that’s equivalent to that of finite state transducers (FSTs) for vocabularies as massive as 500k. Embedding distances give accuracy with 0.5% level distinction in comparison with cellphone edit distances in out-of-vocabulary phrase restoration, in addition to producing clustering hierarchies equivalent to these derived from human listening experiments in English dialect clustering. The theoretical framework additionally permits us to make use of the embeddings to foretell the anticipated confusion of system wake-up phrases. All supply code and pretrained fashions are offered.

Related Articles

Latest Articles