Select Page

Exploring diffusion models: A comprehensive guide to key concepts and applications

diffusion models
Listen to the article
What is Chainlink VRF

In a seminal 2015 paper named “The Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” Sohl-Dickstein et al. first introduced diffusion models in deep learning. Again in 2019, Song et al. published a paper called, “Generative Modeling by Estimating Gradients of the Data Distribution,” using the same principle but with a different approach. The actual development and training of diffusion models gained momentum in 2020 with the publication of Ho et al.’s paper, “Denoising Diffusion Probabilistic Models,” which has since become widely popular. Despite their relatively recent inception, diffusion models have quickly gained prominence and are now recognized as a vital component within the realm of machine learning. Diffusion models are a new class of deep generative models that break the long-time dominance of Generative Adversarial Networks (GANs) in the challenging task of image synthesis in a variety of domains, ranging from computer vision, natural language processing, temporal data modeling to multi-modal modeling. These models have demonstrated their versatility and effectiveness in addressing challenges and solving problems within diverse areas, including computational chemistry and medical image reconstruction. Diffusion models work on the core principle of creating data comparable to the inputs they were trained on. They function fundamentally by corrupting training data by successively adding gaussian noise, and then learning to recover the data by reversing this noising process. In this article, we will take a look at some of the technical underpinnings of diffusion models, focusing first on their key concepts, image generation techniques, their comparison with GANs followed by their training and applications.

What are diffusion models?

Diffusion models are a type of probabilistic generative model that transforms noise into a representative data sample. Diffusion models function by adding noise to the training data and then learning to retrieve the data by reversing the noise process. The training of diffusion models involves iteratively denoising the input data and updating the model’s parameters to learn the underlying probability distribution and improve the quality of generated samples. Diffusion models take inspiration from the movement of gas molecules from high to low density areas, as observed in thermodynamics. This concept of increasing entropy or heat in physics is also applicable to loss of information due to noise in information theory. By building a learning model that can understand the systematic decay of information, it becomes possible to reverse the process and recover the data from the noise. Similar to VAEs, diffusion modeling optimizes an objective function by projecting data onto the latent space and then recovering it back to the initial state. However, instead of learning the data distribution, diffusion models use a Markov Chain to model a series of noise distributions and decode data by undoing the noise in a hierarchical manner.

Variants of diffusion models

Diffusion models can be categorized into three main variants: Denoising Diffusion Probabilistic Models (DDPMs), Score-based Generative Models (SGMs), and Stochastic Differential Equations (Score SDEs). Each formulation represents a distinct approach to modeling and generating data using diffusion processes.

DDPMs: A DDPM model makes use of two markov chains: a forward chain that perturbs data to noise, and a reverse chain that converts noise back to data. The former is typically hand-designed with the goal to transform any data distribution into a simple prior distribution (e.g., standard gaussian), while the latter markov chain reverses the former by learning transition kernels parameterized by deep neural networks. New data points are subsequently generated by first sampling a random vector from the prior distribution, followed by ancestral sampling through the reverse markov chain. DDPMs are commonly used to eliminate noise from visual data. These models have shown outstanding results in a variety of image denoising applications. They are used in redefining the image and video processing techniques to improve the visual production quality.

SGMs: The key idea of SGM model is to perturb data with a sequence of intensifying gaussian noise and jointly estimate the score functions for all noisy data distributions by training a deep neural network model conditioned on noise levels (called a Noise-conditional Score Network or NCSN). Samples are generated by chaining the score functions at decreasing noise levels with score-based sampling approaches, including stochastic differential equations and ordinary differential equations and their various combinations. Training and sampling are completely decoupled in the formulation of score-based generative models, so one can use a multitude of sampling techniques after the estimation of score functions. SGMs create fresh samples from a specified distribution by learning the estimation score function that estimates the log density of the target distribution. This scoring function may be used to produce new distribution data points. SGMs have shown similar capabilities to GANs in producing high-quality images and videos.

SDEs: DDPMs and SGMs can be further generalized to the case of infinite steps or noise levels, where the perturbation and denoising processes are solutions to Stochastic Differential Equations (SDEs). We call this formulation Score SDE, as it leverages SDEs for noise perturbation and sample generation, and the denoising process requires estimating score functions of noisy data distributions. SDEs are used in model fluctuation tasks in quantum physics and they are also used by financial professionals to calculate financial derivatives for different prices.

How do diffusion models work? A detailed overview of the iterative process at work

diffusion models work

The iterative process of diffusion models in AI is a fundamental aspect of their functioning, involving multiple iterations or steps to generate high-quality output. To understand this process, let’s delve deeper into how diffusion models work.

Diffusion models are generative models that aim to capture the underlying distribution of a given dataset. They learn to generate new samples that resemble the training data by iteratively refining their output. The process starts with an initial input or “noise” sample, which is passed through the model. The model then applies probabilistic transformations to iteratively update the sample, making it more closely resemble the desired output.

During each iteration, the diffusion model generates latent variables, which serve as intermediate representations of the data. These latent variables capture essential features and patterns present in the training data. The model then feeds these latent variables back into the diffusion model, allowing it to refine and enhance the generated output further. This feedback loop between the model and the latent variables enables the diffusion model to progressively improve the quality of the generated samples.

The iterative process typically involves applying reversible transformations to the latent variables, which help maintain the statistical properties of the data distribution. By applying these transformations, the model can update the latent variables while preserving the key characteristics of the data. As a result, the generated samples become more coherent, realistic, and representative of the training data distribution.

Launch your project with LeewayHertz!

Leveraging our hands-on experience and knowledge of AI foundation models, we develop custom foundation model-based solutions tailored to your business needs.

The significance of the iterative process in diffusion models

The significance of the iterative process in diffusion models lies in its ability to generate high-quality output that closely resembles the training data. Through multiple iterations, the model learns to capture complex patterns, dependencies, and statistical properties of the data distribution. By iteratively refining the generated samples, diffusion models can overcome initial noise and improve the fidelity and accuracy of the output.

The iterative process allows diffusion models to capture fine-grained details, subtle correlations, and higher-order dependencies that exist in the training data. By repeatedly updating the latent variables and refining the generated samples, the model gradually aligns its output distribution with the target data distribution. This iterative refinement ensures that the generated samples become increasingly realistic and indistinguishable from the real data.

Moreover, the iterative process enables diffusion models to handle a wide range of data modalities, such as images, text, audio, and more. The model can learn the specific characteristics of each modality by iteratively adapting its generative process. This flexibility makes diffusion models suitable for diverse applications in various domains, including image synthesis, text generation, and data augmentation.

The iterative feedback loop is a vital part of both training and functioning of a diffusion model

During training, the model optimizes its parameters to minimize a loss function that quantifies the discrepancy between the generated samples and the training data. The iterative steps in the training process allow the model to gradually refine its generative capabilities, improving the quality and coherence of the output.

Once the diffusion model is trained, the iterative process continues to be a crucial aspect of its functioning during the generation phase. When generating new samples, the model starts with an initial noise sample and iteratively refines it by updating the latent variables. The model’s ability to generate high-quality samples relies on the iterative feedback loop, where the latent variables guide the refinement process.

Overall, the iterative process in diffusion models plays a vital role in their ability to generate realistic and high-quality output. By iteratively refining the generated samples based on the feedback loop between the model and the latent variables, diffusion models can capture complex data distributions and produce output that closely resembles the training data.

Diffusion models Vs GANs

Diffusion models have gained popularity in recent years as they offer several advantages over GANs. One of the most significant advantages is the stability in the training and generation process due to the iterative nature of the diffusion processes. Unlike GANs, where the generator model has to go from pure noise to image in a single step, diffusion models operate in a much more controlled and steady manner. Instead of producing an image in a single step, diffusion models use iterative refinement to gradually improve the generated image quality.

As compared to GANs, diffusion models require only a single model for training and generation, making them less complex and more efficient. Additionally, diffusion models can handle a wide range of data types, including images, audio, and text. This flexibility has enabled researchers to explore various applications of diffusion models, including text-to-image and image inpainting.

How are diffusion models used in image generation?

Diffusion models, are designed to learn the underlying patterns and structures in an image dataset and then use this knowledge to generate new, synthetic data samples. In the case of image generation, the goal is to learn the visual patterns and styles that characterize a set of images and then use this knowledge to create new images that are similar in style and content.

Unconditional image generation is a type of generative modeling where the model is tasked with generating images from random noise vectors. The idea behind this approach is that by providing the model with random noise, it is forced to learn the patterns and structures that are common across all images in the dataset. This means that the model can generate completely new and unique images that do not necessarily correspond to any specific image in the dataset.

Conditional image generation, on the other hand, involves providing the model with additional information or conditioning variables that guide the image generation process. For example, we could provide the model with a textual description of the image we want to generate, such as “a red apple on a white plate,” or with a class label that specifies the category of the object we want to generate, such as “car” or “dog.” By conditioning the image generation process on this extra information, the model can generate images that are tailored to specific requirements or preferences. For example, if we provide the model with the textual description “a red apple on a white plate,” it will generate an image that matches this description. This approach is useful in many applications, such as image synthesis, style transfer, and image editing. Both unconditional and conditional image generation are important techniques in the field of generative modeling. Unconditional image generation allows the model to create completely new and unique images, while conditional image generation allows the model to generate images that are tailored to specific requirements or preferences.

Examples of diffusion models used for image generation

Diffusion models have gained popularity for image generation tasks due to their ability to generate high-quality, diverse and realistic images. Examples include: Dall-E 2 by OpenAI, Imagen by Google, Stable Diffusion by StabilityAI and Midjourney

Dall-E 2

Dall-E 2 was launched by OpenAI in April 2022. It is based on OpenAI’s previous ground-breaking works on GLIDE, CLIP, and Dall-E to create original, realistic images and art from text descriptions. DALL-E 2, generates more realistic and accurate images with 4x greater resolution.


Google’s diffusion-based image generation algorithm called Imagen harnesses the capabilities of large transformer language models to comprehend text, while relying on the prowess of diffusion models for generating high-quality images with remarkable fidelity. Imagen consists of three image generation diffusion models:

  • A diffusion model to generate a 64×64 resolution image.
  • Followed by a super-resolution diffusion model to upsample the image to 256×256 resolution.
  • And one final super-resolution model, which upsamples the image to 1024×1024 resolution.

Stable Diffusion

Created by StabilityAI, Stable Diffusion is built upon the concept of High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. It is the only diffusion-based image generation model in this list that is entirely open-source.

The complete architecture of Stable Diffusion consists of three models:

  • A text-encoder that accepts the text prompt- Convert text prompts to computer-readable vectors.
  • A U-Net- This is the diffusion model responsible for generating images.
  • A Variational autoencoder consisting of an encoder and a decoder model; the encoder is used to reduce the image dimensions. The UNet diffusion model works on this smaller dimension. The decoder is responsible for enhancing/reconstructing the image generated by the diffusion model back to its original size.


Midjourney is one of the many AI image generators that have emerged recently. Unlike DALL-E 2 or some of its other competitors, Midjourney offers more dream-like art style visuals. It can appeal to those working within science-fiction literature or artwork that requires a more gothic feel. Where other AI generators lean more towards photos, Midjourney is more of a painting tool. It aims to offers higher image quality, more diverse outputs, wider stylistic range, support for seamless textures, wider aspect rations, better image promoting, and dynamic range.

How to train diffusion models?

Training is critical in diffusion models as it is the process through which the model learns to generate new samples that closely resemble the training data. By optimizing the model parameters to maximize the likelihood of the observed data, the model can learn the underlying patterns and structure in the data, and generate new samples that capture the same characteristics. The training process enables the model to generalize to new data and perform well on tasks such as image, audio, or text generation. The quality and efficiency of the training process can significantly impact the performance of the model, making it essential to carefully tune hyperparameters and apply regularization techniques to prevent overfitting.

Data gathering: Data gathering is a critical stage in training a diffusion model. The data required to train the model must accurately represent the network structure and the connections between all the datapoints to get the desired result.

Data pre-processing: After collecting the data, it must be cleaned and pre-processed to guarantee that it can be used to train a diffusion model. This might include deleting missing or repetitive data, dealing with outliers, or converting the data into a training-ready format.

Data transformation: The next step in training the data for a diffusion model is data transformation. The data may be graphed or scaled to verify that all variables have similar ranges. The type of data transformation utilized will be determined by the specific needs of the diffusion model being trained as well as the nature of the data.

Division of the training test sets: The training set is used to train the model, while the test set is used to evaluate the model’s performance. It is critical to ensure that the training and test sets accurately represent the data as a whole and are not biased towards specific conditions.

Comparison of the diffusion models: Threshold models, susceptible-infected (SI) models, and independent cascade models are some of the most well-known forms of diffusion models. The diffusion model chosen is determined by the application’s specific needs. These might range from size of the model to network architectural complexity or the sort of diffusion being modelled.

Selection criteria: When choosing a diffusion model for training, consider the model’s accuracy, computational efficiency and interoperability. It may also be necessary to evaluate the availability of data and the simplicity with which the model may be integrated into an existing system.

Model hyperparameters: The hyperparameters that impact and govern the behavior of a diffusion model is determined by the application’s unique requirements and the type of data being used. To guarantee that the model performs best, the hyperparameters must be properly tuned.

Establishing the model parameters: This stage comprises establishing the hyperparameters outlined in the preceding section, as well as any additional model parameters necessary for the kind of diffusion model being utilized. It is critical to properly tune the model parameters so that the model can understand the underlying structure of the data and avoid overfitting.

After the data has been divided and the model parameters have been determined, the next step is to train the model. The training procedure usually entails repeatedly iterating over the training set and adjusting the model parameters based on the model’s performance on the training set.

Applications of diffusion models

Diffusion models have multidimensional applications catering to diverse industries like gaming, architecture, interior design, healthcare etc. They can be used to generate videos, 3D models, human motion, modify existing images, and restore images.

Text to videos: One of the significant applications of diffusion models is generating videos directly with text prompts. By extending the concept of text-to-image to videos, one can use diffusion models to generate videos from stories, songs, poems, etc. The model can learn the underlying patterns and structures of the video content and generate videos that match the given text prompt.

Text to 3D: In the paper “Dreamfusion,” the authors used NeRFs (Neural Radiance Fields) along with a trained 2D text-to-image diffusion model to perform text-to-3D synthesis. This technique is useful in generating 3D models from textual descriptions, which can be used in various industries like architecture, interior design, and gaming.

Text to motion: Text to motion is another exciting application of diffusion models, where the models are used to generate simple human motion. For instance, the “Human Motion Diffusion Model” can learn human motion and generate a variety of motions like walking, running, and jumping from textual descriptions.

Image to image: Image-to-image (Img2Img) is a technique used to modify existing images. The technique can transform an existing image to a target domain using a text prompt. For example, we can generate a new image with the same content as an existing image but with some transformations. The text prompt provides a textual description of the transformation we want.

Image inpainting: Image inpainting is a technique used to restore images by removing unwanted objects or replacing them with other objects/textures/designs. To perform image inpainting, the user first draws a mask around the object or pixels that need to be altered. After creating the mask, the user can tell the model how it should alter the masked pixels. Diffusion models can be used to generate high-quality images in this context.

Image outpainting: Image outpainting, also known as infinity painting, is a process in which the diffusion model adds details outside/beyond the original image. This technique can be used to extend the original image by utilizing parts of the original image and adding newly generated pixels or reference pixels to bring new textures and concepts using the text prompt.

Research processes: Diffusion models also find application in the study of brain processes, cognitive functions, and the intricate pathways involved in human decision-making. By simulating cognitive processes using the neural foundations of diffusion models, neuroscience researchers can unlock profound insights into the underlying mechanisms at work. These discoveries hold immense potential for advancing the diagnosis and treatment of neurological disorders, ultimately leading to improved patient care and well-being.


The potential of diffusion models is truly remarkable, and we are only scratching the surface of what they can do. These models are expanding rapidly and opening up new opportunities for art, business, and society at large. However, embracing this technology and its capabilities is essential for unlocking its full potential. Businesses need to take action and start implementing diffusion models to keep up with the rapidly changing landscape of technology. By doing so, they can unlock previously untapped levels of productivity and creativity, giving them an edge in their respective industries.

The possibilities for innovation and advancement within the realm of diffusion models are endless, and the time to start exploring them is now. Diffusion models have the potential to redefine the way we live, work, and interact with technology, and we can’t wait to see what the future holds. As we continue to push the boundaries of what is possible, we hope that this guide serves as a valuable resource for those looking to explore the capabilities of diffusion models and the world of AI more broadly.

Stay ahead of the curve and explore the possibilities of diffusion models to position your business at the forefront of innovation. Contact LeewayHertz’s AI experts to build your subsequent diffusion model-powered solution tailored to your needs.

Listen to the article
What is Chainlink VRF

Author’s Bio

Akash Takyar

Akash Takyar
CEO LeewayHertz
Akash Takyar is the founder and CEO at LeewayHertz. The experience of building over 100+ platforms for startups and enterprises allows Akash to rapidly architect and design solutions that are scalable and beautiful.
Akash's ability to build enterprise-grade technology solutions has attracted over 30 Fortune 500 companies, including Siemens, 3M, P&G and Hershey’s.
Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups.

Start a conversation by filling the form

Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail following the signing of an NDA.
All information will be kept confidential.