Select Page

How to create a generative audio model?

Listen to the article

What is Chainlink VRF

Generative audio model

Generative AI has risen to remarkable prominence, driven by notable AI advancements like ChatGPT. Whether it’s ChatGPT-4, which seamlessly merges text and image generation capabilities or Midjourney, renowned for its exceptional visual outputs, generative AI tools have begun to assert their dominance in the tech sphere while driving positive transformations for businesses.

From creating compelling content to streamlining workflows, generative AI stands to transform how we work, play, and communicate with the world around us. And that is why everyone- from investors and policymakers to developers and end users, is talking about generative AI. Amidst the far-reaching influence of generative AI across multiple domains, the advent of generative audio models stands out as a particularly fascinating innovation. These models exhibit the capacity to generate an extensive spectrum of sounds, from musical compositions and ambient noise to instrumental melodies, human speech, and even lifelike environmental sounds, highlighting the diverse and captivating possibilities offered by this technology.

Advancements in the realm of generative audio models has reached impressive levels, as demonstrated by the viral release of a song named “Heart on my sleeve,” using Artificial Intelligence that clones the voices of Drake and The Weeknd. In addition, generative audio models have been utilized to create AI-generated versions of popular artists like Ariana Grande, Juice WRLD, and XXXTENTACION, resulting in millions of views on platforms like TikTok. These events showcase both the technical advancements and growing popularity of generative audio models in recent times.

This article takes a deep dive into generative audio models, where you will learn about their advantages and how they work, followed by a TensorFlow implementation of the WaveNet generative neural network architecture for results in text-to-speech and general audio generation.

Generative AI models and their types

In order to create new data, generative models learn the hidden patterns and relationships in the training data to produce similar data based on their knowledge acquired from the training.

There are several types of generative models, including Generative Adversarial Networks (GANs), Stable Diffusion Models (SDMs), Autoregressive Models, Variational Autoencoders (VAEs), and Convolutional Generative Adversarial Networks (CGANs).

Generative Adversarial Networks (GANs)

GANs are composed of two main neural networks—a generator network and a discriminator network. The generator takes random noise as input and generates synthetic samples, while the discriminator tries to distinguish between real and generated samples. Through an adversarial training process, the generator learns to produce samples that are increasingly similar to the real data, while the discriminator becomes more accurate in distinguishing between real and generated samples. GANs have successfully generated realistic images, videos, and even audio.

Stable Diffusion Models

Stable Diffusion Models rely on a diffusion process, where a noise-corrupted image is iteratively transformed toward the target distribution. This process starts with a noisy image and gradually reduces the noise level at each iteration. By modeling the conditional distributions of the image at each step, the model learns to generate realistic and sharp outputs.

Autoregressive models

Autoregressive AI models are a class of models that are designed to generate sequential data, such as text or time series data, by making predictions one step at a time. These models employ a conditional probability distribution, where the probability of generating the next element in the sequence depends on the previously generated elements. These models use sequential dependencies in the data, generating each new point based on the previously generated points. Popular autoregressive models include PixelCNN for image generation and WaveNet for audio generation.

Variational Autoencoders (VAEs)

VAEs are generative models that aim to learn the underlying distribution of the training data and generate new samples from that distribution. They consist of an encoder network that maps input data into a lower-dimensional latent space and a decoder network that reconstructs the original data back into the original data from the latent space. By sampling points from the latent space, VAEs can generate new samples with similar characteristics to the training data.

Convolutional GANs

Convolutional GANs are a type of generative model that use Convolutional Neural Networks (CNNs) for both the generator and discriminator components. CGANs learn the relationships between the different parts of an image or video, making them well-suited for tasks like generating realistic and high-quality images and videos.

What is a generative audio model?

Simply put, generative audio models use artificial intelligence, machine learning techniques, and algorithms to generate new sounds based on existing data. This preexisting data can be in the form of audio recordings, musical scores, speech-to-sound effects and even environmental sounds. Once trained, the models can generate new audio content that is unique and original, making them a robust tool for creating immersive and engaging audio experiences.

Several types of generative audio models are available such as Autoregressive Models, Variational Autoencoders and Generative Adversarial Networks.

Each type of generative audio model has its own set of strengths, and the choice of which one to use could depend on the specific application and the available data. Generative audio models utilize different types of prompts to generate audio content. Some common examples of prompts include Text prompts, MIDI data, existing audio recordings, environmental data, and user input in real-time. The more data and information the model has to learn from, the more sophisticated and nuanced the generated audio can be.

Partner with LeewayHertz for robust generative AI solutions

Our deep domain knowledge and technical expertise allow us to develop efficient and effective generative AI solutions tailored to your unique needs.

Real-world applications of generative audio models

Generative audio models have a wide range of applications. They can be used to create music, sound effects, and voices for various media projects such as films, video games, and virtual reality experiences. Let us look into the practical applications of generative audio models, one by one:

  1. Music composition and generation: Generative audio models can be used to compose original music pieces or generate musical accompaniment. These models can learn patterns and styles from existing compositions and create new melodies, harmonies, and rhythms.
  2. Sound synthesis and effects creation: Generative audio models enable the synthesis of realistic and unique sounds, including instruments, environmental effects, and abstract soundscapes. These models can generate sounds that mimic real-world audio or create entirely novel audio experiences.
  3. Voice cloning and speech synthesis: With generative audio models, it is possible to clone a person’s voice and generate speech that sounds like them. This technology has applications in voice assistants, audiobook narration, and voice-over production.
  4. Audio restoration and enhancement: Generative audio models can help restore and enhance audio recordings by reducing noise, removing artifacts, and improving overall sound quality. They can be particularly useful in audio restoration for archival purposes.
  5. Interactive audio experiences: Generative audio models can create dynamic and interactive audio experiences, such as generating adaptive soundtracks for video games or virtual reality environments. These models can respond to user inputs or environmental changes, enhancing immersion and engagement.
  6. Personalized audio content: By leveraging generative audio models, personalized audio content can be created based on individual preferences. This can include personalized playlists, ambient soundscapes, or even AI-generated podcasts tailored to users’ interests.
  7. Creative sound design: Generative audio models offer creative sound design possibilities by generating unique and unconventional sounds for multimedia projects, films, advertisements, and artistic expressions.
  8. Audio transcription and captioning: Generative audio models can aid in automatic speech-to-text transcription and audio captioning, benefiting accessibility in various media formats, including videos, podcasts, and live events.
  9. Data sonification: Generative audio models can convert complex data patterns into auditory representations, allowing researchers and analysts to explore and understand data through sound. This has applications in data visualization, scientific research, and exploratory data analysis.

Generative audio models can also be used in fields such as Natural Language Processing (NLP), speech recognition, and audio data analysis. Generative audio models are able to create unique and personalized sounds and modify existing audio recordings by changing their pitch, tempo, or other characteristics- this way, the model is able to create immersive and engaging audio experiences. Additionally, generative audio models can be used for ambient noise generation and even to create remixes and mashups of existing songs and soundscapes within the creative sound industry.

Benefits of generative audio models

Generative audio models offer several benefits:

  • Creativity: Generative audio models can create completely new sounds and music compositions, supporting all kinds of creative and experimental ventures.
  • Efficiency: Generative models can produce high-quality audio content in less time than humans, which can speed up production processes and save resources.
  • Accessibility: Generative audio models make sound generation possible for all, especially people who may not have a musical background or technical skills to create music using traditional tools.
  • Personalization: Generative audio models can be trained on individual preferences, allowing for a personalized output that caters to individual needs.
  • Innovation: Generative audio models can be used to create unique soundscapes and sound effects that can be utilized in video games and movies.
  • Preservation: Generative audio models can be used to restore and enhance old or imperfect audio recordings.

How do generative audio models work?

Like any other AI model, generative audio models are trained on large data sets to produce new audio. The training process varies from model to model depending on the model’s architecture. Let us understand how this typically works by taking the example of two different models: WaveNet and GANs.


WaveNet is a deep neural network-based generative audio model developed by Google DeepMind. It uses dilated convolutions to generate high-quality audio by conditioning on previous audio samples. The model can generate natural-sounding speech and music with various use cases in applications such as speech synthesis, audio super-resolution, and audio style transfer.

As a generative audio model that uses deep neural networks to generate high-quality audio waveforms, Wavenet follows the following steps:

  • Waveform sampling: The model takes an input waveform, which is typically a sequence of audio samples, and processes it through a series of convolutional layers.
  • Dilated convolution: WaveNet uses dilated convolutional layers to capture long-range dependencies in the audio waveform. The dilation factor determines the size of the receptive field of the convolutional layer, which allows the model to capture patterns that occur over long time scales.
  • Autoregressive model: WaveNet is an autoregressive model, which means that it generates audio samples one at a time, conditioned on previous audio samples. The model predicts the probability distribution of the next sample given the previous samples.
  • Sampling Strategy: To generate audio samples from the probability distribution predicted by the model, WaveNet uses a sampling strategy called softmax sampling. This strategy ensures that the generated audio is diverse and has a natural-sounding quality.
  • Training: WaveNet is trained using a maximum likelihood estimation approach, which aims to maximize the likelihood of the training data given the model parameters. The model is trained to forecast the next audio sample in the sequence given the previous samples.

Generative Adversarial Networks (GANs)

Generative Adversarial Network (GAN) is a generative model consisting of two neural networks: a generator network that generates audio samples and a discriminator network that evaluates whether the samples are real or fake.

Here is how GANs work:

  • Architecture: GANs consist of two neural networks: a generator and a discriminator network. The generator network takes as input a random noise vector and produces an output audio sample. The discriminator network takes the output audio sample and produces another binary output indicating whether the sample is real or a fake one.
  • Training: During training, the generator network generates a set of audio samples from random noise, and the discriminator network is trained to classify these samples as generated correctly. At the same time, the generator network is trained to create audio samples that can come off as real for the discriminator network. This is done by minimizing the binary cross-entropy loss between the discriminator network’s output and the true label for each generated sample.
  • Adversarial loss: The objective of a GAN is to minimize the adversarial loss, which is the difference between the true distribution of audio samples and the distribution of fake audio samples. This loss is minimized by alternating between updating the generator network to generate realistic audio samples better and updating the discriminator network to better differentiate between real and generated audio samples.
  • Audio Applications: GANs have been used in various audio applications, including music generation, audio style transfer, and audio restoration. In music generation, the generator network is trained to produce new music samples based on existing samples. In audio style transfer, the generator network is trained to transfer the style of one audio sample onto another audio sample. In audio restoration, the generator network is trained to remove noise or distortions from audio samples.

Generative Adversarial Networks

How to create a generative audio model?

This is a TensorFlow implementation of the WaveNet generative neural network architecture for audio generation. The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation. The network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.

After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels).

A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.

The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.

The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.

The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.

In this repository, the network implementation can be found in


TensorFlow needs to be installed before running the training script. Code is tested on TensorFlow version 1.0.1 for Python 2.7 and Python 3.5.

In addition, librosa must be installed for reading and writing audio.

To install the required python packages, run

pip install -r requirements.txt

For GPU support, you can run the following

pip install -r requirements_gpu.txt

While running the above, make sure to replace “tensorflow-gpu” with “tensorflow,” as “tensorflow-gpu” package has been removed. Instead, verify that your system has CUDA installed.

Partner with LeewayHertz for robust generative AI solutions

Our deep domain knowledge and technical expertise allow us to develop efficient and effective generative AI solutions tailored to your unique needs.

Training the network

You can use any corpus containing .wav files. We’ve mainly used the VCTK corpus (around 10.4GB, Alternative host) so far.

In order to train the network, execute

python --data_dir=corpus

The above command trains the network, where corpus is a directory containing .wav files. The script will recursively collect all .wav files in the directory.

You can see documentation on each of the training settings by running:

python --help

You can find the configuration of the model parameters in wavenet_params.json. These need to stay the same between training and generation.

Global conditioning

Global conditioning refers to modifying the model such that the id of a set of mutually exclusive categories is specified during training and generation of .wav file. In the case of the VCTK, this id is the integer id of the speaker, of which there are over a hundred. This allows (indeed requires) that a speaker id be specified at time of generation to select which of the speakers it should mimic. For more details see the paper or source code.

Training with global conditioning

The instructions above for training refer to training without global conditioning. To train with global conditioning, specify command-line arguments as follows:

python –data_dir=corpus –gc_channels=32
The –gc_channels argument does two things:

  • It tells the script that it should build a model that includes global conditioning.
  • It specifies the size of the embedding vector that is looked up based on the id of the speaker.

The global conditioning logic in and is “hard-wired” to the VCTK corpus at the moment in that it expects to be able to determine the speaker id from the pattern of file naming used in VCTK, but can be easily be modified.

Generating audio

Example output generated by @jyegerlehner based on speaker 280 from the VCTK corpus.

You can use the script to generate audio using a previously trained model.

Generating without global conditioning

Run the following:

python --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

In the above, logdir/train/2017-02-13T16-45-34/model.ckpt-80000 needs to be a path to previously saved model (without extension). The –samples parameter specifies how many audio samples you would like to generate (16000 corresponds to 1 second by default).

The generated waveform can be played back using TensorBoard, or stored as a .wav file by using the –wav_out_path parameter:

python --wav_out_path=generated.wav --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Passing –save_every in addition to –wav_out_path will save the in-progress wav file every n samples.

python --wav_out_path=generated.wav --save_every 2000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Fast generation is enabled by default. It uses the implementation from the Fast Wavenet repository. You can follow the link for an explanation of how it works. This reduces the time needed to generate samples to a few minutes.

To disable fast generation, run the below mentioned command:

python --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000 --fast_generation=false

Generating with global conditioning

Generate from a model incorporating global conditioning as follows:

python --samples 16000  --wav_out_path speaker311.wav --gc_channels=32 --gc_cardinality=377 --gc_id=311 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

In the above set of code:

The parameter “–gc_channels=32” designates that 32 is the dimensionality of the embedding vector and it should correspond to the value set during the training process.

The inclusion of “–gc_cardinality=377” is necessary because 376 represents the highest speaker ID in the VCTK corpus. If a different corpus is utilized, this number should correspond to the value automatically determined and displayed by the “” script during training.

Furthermore, “–gc_id=311” specifies the ID of the desired speaker, specifically speaker 311, for which a sample will be generated.

Running tests

Install the test requirements

pip install -r requirements_test.txt

Run the test suite


Missing features

Currently there is no local conditioning on extra information which would allow context stacks or controlling what speech is generated.

Evaluation, refining and deployment

After training, you will need to evaluate the performance of your generative audio model. This could involve listening to the generated audio samples, computing metrics such as the signal-to-noise ratio, or running user studies to evaluate the quality of the audio. Based on your evaluation, the model will need refining in its architecture or training parameters. Once you’re satisfied with the performance of your generative audio model, you can deploy it to a production environment. This could typically involve integrating it into an existing audio application, or creating a new application specifically for the model.

Some notable generative audio models

There are multiple audio models with different capabilities. Let us look into the most prominent ones:


Whisper is a general-purpose speech recognition model developed by OpenAI and trained on a diverse audio dataset. It is a multi-task model capable of performing tasks like multilingual speech recognition, speech translation, and language identification. The Whisper v2-large model is accessible via OpenAI’s API under the name “whisper-1.”


Jukebox is an audio neural network model developed by OpenAI that can generate music, including rudimentary singing, in various genres and artistic styles. OpenAI has made the Jukebox model’s weights and code publicly available, along with a tool that allows users to explore and listen to the generated music samples. The model utilizes a Hierarchical VQ-VAE (Vector Quantized Variational Autoencoder) to effectively compress the music into tokens and generate high-quality and extended compositions, surpassing the limitations of previous models that focused on shorter music clips.


MuseNet is a deep neural network created by OpenAI that has the capability to generate 4-minute musical compositions using 10 different instruments, and it can combine various styles ranging from country to Mozart to the Beatles. Instead of being explicitly programmed with musical rules, MuseNet has learned harmony, rhythm, and style patterns by analyzing hundreds of thousands of MIDI files and predicting the next token in the sequence. Similar to GPT-2, MuseNet utilizes unsupervised learning and a large-scale transformer model to generate coherent musical compositions. In the default mode, users can listen to pre-generated random samples, while in advanced mode, users can directly interact with the model to create entirely new musical pieces, although the completions may take longer.


DeepJ is an innovative deep learning model introduced in a research paper focusing on generating music with tunable parameters. It is an end-to-end generative model that can compose music based on a specific mixture of composer styles. It incorporates novel techniques for learning musical style and dynamics, showcasing the ability to control the style of the generated music. In addition to its composition capabilities, DeepJ serves as a virtual DJ, capable of composing and mixing music from different genres. The DeepJ website demonstrates its near real-time music generation capabilities, allowing users to experience its creative potential. DeepJ is also compatible with Amazon Echo, enabling users to play the music it composes in real-time through Alexa skills.

Generative audio models: The future

The future of generative audio models is incredibly promising, with exciting possibilities on the horizon. These advancements will revolutionize various aspects of our lives:

One key area of development is achieving unprecedented levels of realism. Generative audio models will continue to refine their output quality, aiming for audio that is virtually indistinguishable from human-generated content. This will lead to more immersive and authentic audio experiences.

Creative expression will be greatly enhanced through generative AI. Artists, musicians, and content creators will have access to powerful tools that enable them to explore new frontiers of creativity. From generating original compositions to crafting unique soundscapes, generative AI will provide innovative avenues for artistic expression.

Personalized audio experiences will become the norm. Generative AI models will enable individuals to customize their audio content based on their preferences and specific contexts. Personalized music playlists, tailored soundtracks for movies or games, and even AI-generated voices that match desired characteristics will be possible.

Generative models will also pave the way for cross-modal fusion. This means seamlessly integrating audio with other sensory modalities such as visuals or haptic feedback. As a result, multisensory experiences that engage multiple senses simultaneously will become more prevalent.

Furthermore, generative audio models will continue to evolve and improve, becoming more accessible and user-friendly. This will enable a broader range of individuals to harness the power of these models for various applications, from entertainment to education and beyond.

Overall, the future of generative audio models holds tremendous promise, reshaping how we experience, create, and interact with audio content in ways that were previously unimaginable.

Summing up

Creating a generative audio model is an iterative process involving several steps, including data collection, preprocessing, model design, and training. These models find applications in the fields of entertainment, advertising, and gaming and assist with various audio-related tasks such as music generation, speech synthesis, and sound effects creation. The field is continuously evolving, with advancements to improve the quality and versatility of the generated audio. Given the growing popularity and demand for customized audio content, generative audio models can potentially bring significant changes to the audio industry.

Want to leverage a generative audio model to step up your business? Contact LeewayHertz today for robust generative audio model-based solutions.

Listen to the article
What is Chainlink VRF

Author’s Bio


Akash Takyar

Akash Takyar LinkedIn
CEO LeewayHertz
Akash Takyar is the founder and CEO of LeewayHertz. With a proven track record of conceptualizing and architecting 100+ user-centric and scalable solutions for startups and enterprises, he brings a deep understanding of both technical and user experience aspects.
Akash's ability to build enterprise-grade technology solutions has garnered the trust of over 30 Fortune 500 companies, including Siemens, 3M, P&G, and Hershey's. Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups.

Related Services

Generative AI Development

Unlock the transformative power of AI with our tailored generative AI development services. Set new industry benchmarks through our innovation and expertise

Explore Service

Start a conversation by filling the form

Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail post sign of an NDA.
All information will be kept confidential.


Follow Us