Humans possess an extraordinary ability to localize sound sources and interpret their environment using auditory cues, a phenomenon termed spatial hearing. This capability enables tasks such as identifying speakers in noisy settings or navigating complex environments. Emulating such auditory spatial perception is crucial for enhancing the immersive experience in technologies like augmented reality (AR) and virtual reality (VR). However, the transition from monaural (single-channel) to binaural (two-channel) audio synthesis—which captures spatial auditory effects—faces significant challenges, particularly due to the limited availability of multi-channel and positional audio data.
Traditional mono-to-binaural synthesis approaches often rely on digital signal processing (DSP) frameworks. These methods model auditory effects using components such as the head-related transfer function (HRTF), room impulse response (RIR), and ambient noise, typically treated as linear time-invariant (LTI) systems. Although DSP-based techniques are well-established and can generate realistic audio experiences, they fail to account for the nonlinear acoustic wave effects inherent in real-world sound propagation.
Supervised learning models have emerged as an alternative to DSP, leveraging neural networks to synthesize binaural audio. However, such models face two major limitations: First, the scarcity of position-annotated binaural datasets and second, susceptibility to overfitting to specific acoustic environments, speaker characteristics, and training datasets. The need for specialized equipment for data collection further constraints these approaches, making supervised methods costly and less practical.
To address these challenges, researchers from Google have proposed ZeroBAS, a zero-shot neural method for mono-to-binaural speech synthesis that does not rely on binaural training data. This innovative approach employs parameter-free geometric time warping (GTW) and amplitude scaling (AS) techniques based on source position. These initial binaural signals are further refined using a pretrained denoising vocoder, yielding perceptually realistic binaural audio. Remarkably, ZeroBAS generalizes effectively across diverse room conditions, as demonstrated using the newly introduced TUT Mono-to-Binaural dataset, and achieves performance comparable to, or even better than, state-of-the-art supervised methods on out-of-distribution data.
The ZeroBAS framework comprises a three-stage architecture as follows:
In stage 1, Geometric time warping (GTW) transforms the monaural input into two channels (left and right) by simulating interaural time differences (ITD) based on the relative positions of the sound source and listener’s ears. GTW computes the time delays for the left and right ear channels. The warped signals are then interpolated linearly to generate initial binaural channels.
In stage 2, Amplitude scaling (AS) enhances the spatial realism of the warped signals by simulating the interaural level difference (ILD) based on the inverse-square law. As human perception of sound spatiality relies on both ITD and ILD, with the latter dominant for high-frequency sounds. Using the Euclidean distances of source from both ears and , the amplitudes are scaled.
In stage 3, involves an iterative refinement of the warped and scaled signals using a pretrained denoising vocoder, WaveFit. This vocoder leverages log-mel spectrogram features and denoising diffusion probabilistic models (DDPMs) to generate clean binaural waveforms. By iteratively applying the vocoder, the system mitigates acoustic artifacts and ensures high-quality binaural audio output.
Coming to evaluations, ZeroBAS was evaluated on two datasets (results in Table 1 and 2): the Binaural Speech dataset and the newly introduced TUT Mono-to-Binaural dataset. The latter was designed to test the generalization capabilities of mono-to-binaural synthesis methods in diverse acoustic environments. In objective evaluations, ZeroBAS demonstrated significant improvements over DSP baselines and approached the performance of supervised methods despite not being trained on binaural data. Notably, ZeroBAS achieved superior results on the out-of-distribution TUT dataset, highlighting its robustness across varied conditions.
Subjective evaluations further confirmed the efficacy of ZeroBAS. Mean Opinion Score (MOS) assessments showed that human listeners rated ZeroBAS’s outputs as slightly more natural than those of supervised methods. In MUSHRA evaluations, ZeroBAS achieved comparable spatial quality to supervised models, with listeners unable to discern statistically significant differences.
Even though this method is quite remarkable, it does have some limitations. ZeroBAS struggles to directly process phase information because the vocoder lacks positional conditioning, and it relies on general models instead of environment-specific ones. Despite these constraints, its ability to generalize effectively highlights the potential of zero-shot learning in binaural audio synthesis.
In conclusion, ZeroBAS offers a fascinating, room-agnostic approach to binaural speech synthesis that achieves perceptual quality comparable to supervised methods without requiring binaural training data. Its robust performance across diverse acoustic environments makes it a promising candidate for real-world applications in AR, VR, and immersive audio systems.
Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.
 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
 The post Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data appeared first on MarkTechPost.