Music generation using deep learning involves training models to create musical compositions, imitating the patterns and structures found in existing music. Deep learning techniques are commonly used, such as RNNs, LSTM networks, and transformer models. This research explores an innovative approach for generating musical audio using non-autoregressive, transformer-based models that respond to musical context. This new paradigm emphasizes listening and responding, unlike existing models that rely on abstract conditioning. The study incorporates recent advancements in the field and discusses the improvements made to the architecture.
Researchers from SAMI, ByteDance Inc., introduce a non-autoregressive, transformer-based model that listens and responds to musical context, leveraging a publicly available Encodec checkpoint for the MusicGen model. Evaluation employs standard metrics and a music information retrieval descriptor approach, including Frechet Audio Distance (FAD) and Music Information Retrieval Descriptor Distance (MIRDD). The resulting model demonstrates competitive audio quality and robust musical alignment with context, validated through objective metrics and subjective MOS tests.
The research highlights recent strides in end-to-end musical audio generation through deep learning, borrowing techniques from image and language processing. It emphasizes the challenge of aligning stems in music composition and critiques existing models relying on abstract conditioning. It proposes a training paradigm using a non-autoregressive, transformer-based architecture for models that respond to musical context. It introduces two conditioning sources and frames the problem as a conditional generation. Objective metrics, music information retrieval descriptors, and listening tests are necessary for model evaluation.
The method utilizes a non-autoregressive, transformer-based model for music generation, incorporating a residual vector quantizer in a separate audio encoding model. It combines multiple audio channels into a single sequence element through concatenated embeddings. Training employs a masking procedure, and classifier-free guidance is used during token sampling for enhanced audio context alignment. Objective metrics assess model performance, including Fr’echet Audio Distance and Music Information Retrieval Descriptor Distance. Evaluation involves generating and comparing example outputs with real stems using various metrics.
The study evaluates generated models using standard metrics and a music information retrieval descriptor approach, including FAD and MIRDD. Comparison with real stems indicates that the models achieve audio quality comparable to state-of-the-art text-conditioned models and demonstrate strong musical coherence with context. A Mean Opinion Score test involving participants with music training further validates the model’s ability to produce plausible musical outcomes. MIRDD, assessing the distributional alignment of generated and real stems, provides a measure of musical coherence and alignment.
In conclusion, the research conducted can be summarized in below points:
The research proposes a new training approach for generative models that can respond to musical context.
The approach introduces a non-autoregressive language model with a transformer backbone and two untested improvements: multi-source classifier-free guidance and causal bias during iterative decoding.
The models achieve state-of-the-art audio quality by training on open-source and proprietary datasets.
Standard metrics and a music information retrieval descriptor approach have validated the state-of-the-art audio quality.
A Mean Opinion Score test confirms the model’s capability to generate realistic musical outcomes.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
The post ByteDance AI Research Introduces StemGen: An End-to-End Music Generation Deep Learning Model Trained to Listen to Musical Context and Respond Appropriately appeared first on MarkTechPost.