Select Page

A comprehensive guide on foundation models

Security in AI development

Listen to the article

What is Chainlink VRF

Artificial intelligence has experienced a remarkable surge in adoption thanks to exciting technological breakthroughs and our ever-increasing access to vast amounts of data. AI systems now have the power to learn from enormous datasets comprising thousands or even millions of examples, fueling their ability to comprehend and interpret spoken or written language. From smart recommendation systems on e-com apps that suggest items based on our preferences to virtual assistants making lives easier, we rely heavily on AI in our daily experiences. In fact, as per Grand View Research, the market size for artificial intelligence globally was worth USD 136.55 billion in 2022, and it is estimated to grow at a CAGR of 37.3% between 2023 and 2030.

However, building and deploying new AI systems can be time-consuming and resource-heavy. Developing a new system requires a sizeable dataset that is well-labeled and specific to the task at hand. If such a dataset does not exist, significant effort and time are required to find and label suitable images, text, or graphs for the dataset. This is where the concept of foundation models steps in, empowering AI systems to leverage pre-existing knowledge and significantly mitigate the demand for extensive training. Foundation models have been instrumental in advancing AI by serving as powerful building blocks for creative output generation. Popularized by the Stanford Institute for Human-centered Artificial Intelligence, foundation models have shown great potential in areas like imagery and language. Examples like GPT-3, BERT, and DALL-E 2 have demonstrated their capabilities by generating complex images and essays based on short prompts, even if they were not explicitly trained to execute such tasks.

Undoubtedly, foundation models are a groundbreaking development in the field of Natural Language Processing (NLP) and serve as the core architecture upon which various language models are built, providing a solid base of understanding and generating high-quality text.

Now, let’s dive deeper into foundation models and understand their operational mechanics, architecture, capabilities and applications.

What are foundation models?

Foundation models are pre-trained models that can be fine-tuned on specific tasks or domains. These highly adaptable and high-performing models find applications across diverse domains, including Natural Language Processing (NLP), computer vision, and multimodal tasks. They play a key role in driving innovation and advancements in artificial intelligence. They are large-scale models with billions of parameters that can generate output in various forms, such as text, images and even code. Unlike traditional models that rely on labeled data for training (supervised learning), foundation models are trained on unlabeled data using deep neural networks. Unsupervised learning enables these models to discover patterns and structures within data more efficiently.

Delving deeper into their functioning, foundation models are pre-trained on a massive and diverse dataset, such as images or text. The pretraining process involves exposing the model to vast amounts of data, enabling it to learn patterns and features from the data. This makes foundation models extremely powerful and versatile, allowing for seamless utilization across a wide range of use cases with minimal training data. After pretraining, the model is ready to be fine-tuned.

AI foundation models use deep neural networks that allow them to emulate the human brain’s functionality and handle complex tasks such as generating code or tackling intricate mathematical problems. This power stems from their ability to match patterns, which is vital for AI applications. For instance, a deep neural network can analyze millions of image sets and associate a particular word (say, “cat”) and can further enhance the ability to recognize, visualize, and predict image components with additional data and examples. As a result, the model’s scope expands as it analyzes more complex patterns and correlations. However, creating a foundation model in AI is a demanding and expensive process that involves large amounts of computational resources and expertise. Instead, developers can use existing foundation models as the base for building task-specific models through adaptation. The term “foundation” highlights these models’ importance of stability, safety, and security.

The need for foundation models

Pretrained models have become ubiquitous in machine learning, particularly for text and image tasks. Earlier, these models were trained on large labeled data, which enabled them to generalize well to new tasks. However, this approach had drawbacks since the model’s ability to leverage the vast amounts of available unlabeled data was limited. To overcome this, researchers have developed foundation models which are capable of making use of unlabeled data.

Foundation models, such as BERT and GPT, utilize the transformer architecture, which employs self-attention to weigh the importance of different input elements. Introduced in 2017, transformers have replaced traditional recurrent neural networks, offering advantages like parallel processing and bidirectionality for better language understanding.

The transformer architecture

Transformers are a type of neural network architecture that has witnessed a remarkable surge in popularity within the field of artificial intelligence. The development of transformers was intended to address the issue of sequence transduction, which includes tasks such as neural machine translation, speech recognition, text-to-speech conversion, and more. Transformers utilize an encoder-decoder architecture based on attention layers. The transformer model has proven to be highly effective in capturing dependencies between elements in a sequence and has significantly advanced the field of natural language processing. Let’s now understand in detail how this encoder-decoder architecture works:


The encoder in a transformer model plays a crucial role in enabling the model to comprehend and encode the input sequence, facilitating effective understanding and subsequent processing of the text data. The encoder architecture consists of several components and processes that contribute to its functionality:


Computers do not comprehend language in its textual form but rather in the form of numerical data such as vectors or matrices. Therefore, it is necessary to convert words into vectors to enable machine processing. This is where the concept of an embedding space comes into play. An embedding space is a space or dictionary where words with similar meanings are clustered or close to one another. Each word is assigned a particular value in the embedding space according to its meaning. This process allows us to convert words into vectors for machine processing.

Positional encoding

Positional encoders address the issue of word ambiguity by providing context based on the position of words within a sentence. Each word is converted into an embedding and augmented with a positional embedding. These combined embeddings create a context vector, which is then processed by the encoder block. The positional encoders enable the model to understand word meanings in different sentence contexts, enhancing its ability to capture relationships and dependencies in the input sequence.

Multi-head attention

The central aspect of the transformer architecture is its “self-attention” mechanism, which determines the importance of each word with other words within a sentence. This is accomplished by generating an attention vector for each word, which captures the contextual relationships between the words in the sentence.

One limitation of the self-attention mechanism is that it assigns a higher weight to a word’s value in the sentence rather than its interactions with other words. Multiple attention vectors are calculated for each word to overcome this issue, and a weighted average is taken to compute the final attention vector for that word. This enables the model to capture the relevant contextual relationships between the words in the sentence. The use of multiple attention vectors is referred to as the multi-head attention block.

Feed-forward network

The second step involves the feed-forward neural network applied to each attention vector. The goal is to transform the attention vectors into a format that the subsequent encoder or decoder layers can process. The feed-forward network processes the attention vectors individually, one at a time. Unlike Recurrent Neural Networks (RNNs), the attention vectors are independent of each other. As a result, parallelization can be utilized, which can significantly improve processing speed and efficiency.

With the ability to process attention vectors independently and in parallel, we can pass all the words into the encoder block at the same time, resulting in a set of encoded vectors for each word, which can be computed simultaneously.

Partner with LeewayHertz for robust LLM-based solutions!

Leveraging our hands-on experience and knowledge of AI foundation models, we develop custom foundation model-based solutions tailored to your business needs.


In a transformer model, the decoder plays a crucial role in generating the output sequence based on the encoded input representation. While the encoder focuses on understanding the input sequence, the decoder focuses on generating the target sequence by attending to the encoded information.

For example, let’s consider training a language translator from English to French. We provide an English sentence and the corresponding French translation to train the model. The English sentence is processed through the encoder block, while the French sentence is processed through the decoder block.

Like the encoder block, the decoder block also includes an embedding layer and a positional encoder component, which converts the words in the input sentence into corresponding vectors.

The decoder architecture consists of several components and processes that contribute to its functionality, such as:

Masked multi-head attention

In the masked multi-head attention process, each word in a sentence dynamically interacts with the surrounding words to uncover the intricate relationships between them. The term “masked” is used to prevent the model from looking at future words during this process.

Let’s consider the example of the language translator for better understanding. The input sentence in French is processed through the self-attention mechanism. The self-attention mechanism is a fundamental component used in both the encoder and decoder blocks. It enables the model to capture the relationships and dependencies between different words within a sentence or sequence.

To elaborate on the learning mechanism, during the training process, the model first predicts the French translation of each English word using its previous results. These predicted translations are then compared with the actual French translation (which we provide as input to the decoder block). Based on the comparison, the model updates its matrix values, allowing it to improve its predictions with each iteration. This iterative learning process continues until the model accurately translates the input sentences.

During training, we must hide (or mask) the next French word from the input sequence. This ensures that the model learns to predict each French word based solely on its corresponding English and previously predicted French words without knowledge of the next French word. By masking the next French word, we prevent the model from simply memorizing the input-output pairs and encourage it to learn the underlying patterns and relationships between the two languages. When processing the French sentence in the decoder block, we can only use the previously generated French words to predict the next word. We can’t use any information from future French words as that information is unavailable during inference.

To achieve this, we mask all future French words by transforming them into 0s. This is done by creating a mask matrix with the same shape as the attention matrix, with values of 1 where the corresponding French word is available and 0 where it is masked. During the attention operation, we use this mask matrix to zero out the attention weights for the masked French words, ensuring they don’t contribute to predicting the next word.

Multi-head attention block

The next step involves passing the attention vectors obtained from the previous layer and the encoded vectors from the encoder block through another multi-head attention block. This step is called the encoder-decoder attention block, where the results from the encoder block are used. The encoder-decoder attention block is where the main mapping between English and French words happens. The attention vectors generated in this block capture the contextual relationship between the words in the English sentence and those in the corresponding French sentence. This is important for generating accurate translations, as the model needs to comprehend the relationship between the words in both languages to translate them accurately.

Feed-forward network

The feed-forward unit is typically a simple two-layer neural network with a ReLU activation function in between. It is applied to each attention vector independently, allowing for parallelization. This layer helps transform the attention vectors into a form easily digestible by the next layer of the model.

The linear layer’s output is passed via the softmax layer, producing a probability distribution over the possible output words in French. Each word is assigned a probability based on if the translation accurately renders the English word entered. The word with the highest probability is then selected as the translated word for that position in the sentence. Repeating this method for each word in the original English sentence produces the translated French sentence.

Types of foundation models and how they work

AI foundation models can broadly be categorized into two types: LLMs and diffusion models. Let’s discuss the two in detail:


LLMs, or Large Language Models, are machine learning models that utilize deep learning techniques to process and generate natural language. They are trained on vast amounts of textual data and can perform various language-related tasks, such as language translation, text summarization, and question-answering. Transformer-based LLMs have received significant attention owing to their remarkable performance on various natural language processing tasks. Popular examples of LLMs include GPT-3, BERT, and RoBERTa. Let us understand the working of LLMs in detail:


During pertaining, LLMs are trained on vast and diverse datasets that contain a wide range of text, such as books, articles, and websites. Pretraining aims to enable the model to learn language patterns, including grammar, syntax, and semantics.

Pretraining is typically achieved through unsupervised learning, and LLMs can be trained in various ways during this stage. For example, OpenAI’s GPT models are trained to predict subsequent words in a partially complete sentence, while Google’s BERT is trained using a technique called masked language modeling, where the model is required to guess the randomly blanked words in a sentence.

During pretraining, the LLM regularly updates the weights of its parameters to minimize the prediction error, which enables it to learn to generate coherent and contextually relevant text.


After pretraining, the LLM is fine-tuned on a smaller, task-specific dataset using supervised learning. During fine-tuning, the model is provided with labeled examples of the desired output for a specific task.

The fine-tuning process allows the model to adapt its pre-trained knowledge to the specific requirements of the target task, such as translation, summarization, sentiment analysis, and more. This adaptation is achieved by adjusting the model’s parameters through techniques such as gradient descent and backpropagation, which optimize the model’s performance on the task.

Pretraining and fine-tuning are highly effective in building LLMs that can perform a wide range of NLP tasks with state-of-the-art accuracy.

In-context learning

In-context learning refers to the ability of a language model to learn and perform a task based on a few examples or specific context, even if it was not explicitly trained for that task. It implies that the model can generalize its knowledge from the provided examples to similar scenarios without the need for retraining or additional labeled data.

For example, GPT-3, a language model, can accurately determine the sentiment (positive or negative) of a new sentence after being exposed to several sentences with known sentiments. This capability arises from the model’s capacity to update its parameters and adapt its behavior based on the given context. In other words, the model can leverage the contextual information it has learned to make informed predictions or decisions without requiring explicit training for each specific task.

In this context, the model appears to have learned something new without undergoing additional training because it can apply its existing knowledge and adapt it to the specific task or context at hand. This phenomenon of in-context learning allows the model to exhibit a level of flexibility and generalization, enabling it to perform tasks beyond its initial training scope.

This is the basic working of an LLM model. The process may vary depending on the type of LLM model.

Diffusion models

Diffusion models are generative models used to generate data similar to the data they were trained on. These models work by adding Gaussian noise to the training data and then learning to reverse this noising process to recover the original data. Diffusion models have shown promising results in various applications, including image and speech synthesis, and are known for generating high-quality samples with fine details. The most popular examples of diffusion models are Dall-E, Imagen, Glide etc.

In essence, diffusion models are a type of probabilistic generative model that learn the underlying structure of a dataset by modeling how data points diffuse through the latent space. The diffusion process describes how a data point moves through the latent space over time. In a diffusion model, the process is modeled using a markov chain, where the markov chain’s current state represents the data point’s current location in the latent space. The diffusion process is usually defined as a series of stochastic transformations that gradually spread out the data points in the latent space. Neural networks often parameterize these transformations and may depend on additional input, such as the noise level in the data.

Once the diffusion process is defined, the diffusion model is trained using variational inference. The goal of variational inference is to maximize the log-likelihood of the training data for the model parameters. This is done by introducing a variational distribution over the latent variables and minimizing the Kullback-Leibler (KL) divergence between this distribution and the true posterior distribution. In the case of diffusion models, the variational distribution is often defined as a normal distribution with mean and variance parameters that are functions of the observed data. The KL divergence between the variational and true posterior distributions can be estimated using monte carlo methods such as importance sampling or Markov Chain Monte Carlo (MCMC).

During training, the diffusion process is applied to the training data in reverse order, starting from a fully blurred or corrupted version of the original data and then iteratively applying the inverse transformations to estimate the true data. By optimizing the variational parameters and the diffusion process parameters jointly, the model learns to denoise.

After training, the diffusion model can be used for various tasks such as denoising, inpainting, super-resolution, and image generation. For image generation, the model takes a random noise image as input and then applies the learned diffusion process in the forward direction to generate a new image that follows the same distribution as the training data.

Partner with LeewayHertz for robust LLM-based solutions!

Leveraging our hands-on experience and knowledge of AI foundation models, we develop custom foundation model-based solutions tailored to your business needs.

Capabilities of the foundation model

Foundation models have diverse capabilities that can be utilized to power various applications. From processing different modalities to affecting the physical world and interacting with humans, the capabilities of foundation models are almost limitless. Let’s discuss more about its capabilities below:

Natural language processing

The most impactful feature of AI foundation models in NLP is not their generation abilities but their remarkable generality and adaptability. A single foundation model can be adapted in various ways to accomplish multiple linguistic tasks. NLP encompasses various tasks, including sentence or document classification, sequence labeling, span relation classification, and text generation. Generating general-purpose language was widely considered a challenging and practically unattainable task in NLP and was believed to require the accomplishment of other linguistic sub-tasks. However, the advent of foundation models in AI that are predominantly trained to generate language has marked a significant change in the role of language generation in NLP.

Highly coherent foundation models can be trained with a simple language generation objective, such as predicting the next word in a sentence. These generative models have become the primary means for language-based machine learning, including analysis and understanding tasks once considered prerequisites for a generation. Before the emergence of foundation models, researchers focused on tackling complex NLP tasks. However, most of these tasks are handled at an almost-human level using publicly released foundation models.

Visual comprehension

Computer vision has been one of the key areas that have driven the adoption of deep learning in AI. Deep learning models have demonstrated remarkable success in image classification, object detection, and other standard tasks in computer vision, thanks to their ability to learn complex representations directly from raw data. However, traditional deep learning models require large annotated datasets for training, which can be time-consuming and expensive.

In recent years, there has been a growing interest in pretraining deep learning models on web-scale raw data rather than curated datasets. These foundation models have shown great promise in computer vision, enabling knowledge transfer to various downstream tasks. Training on multimodal and embodied data has paved the way for progress in more challenging areas, such as 3D geometric and physical understanding and commonsense reasoning. The emergence of foundation models in AI has facilitated acquiring a contextual understanding of the visual world using large quantities of raw data. With the use of self-supervision techniques in foundation models, there has been significant progress in traditional computer vision tasks like image classification, object detection, and visual synthesis. Foundation models in AI have enabled training without explicit annotations and with massive amounts of diverse and raw visual data, resulting in competitive performance.


In robotics, foundation models could address task specification and task learning challenges, which are more complex than those in other domains like NLP. Unlike in NLP, where many problems can be formulated as “text-in, text-out” tasks, robotic tasks often have unique input-output that require specialized models. As such, there is a need for AI foundation models that can generalize across different tasks, environments, and robot embodiments. In artificial intelligence, there has been a shift towards developing more advanced models that can not only perform specific tasks but also learn from their experiences, adapt to new situations, and reason about the physical world in a way generalizable to different tasks and contexts. This means that instead of building separate models for each task or situation, a more general model can be created that can be applied to various scenarios.

Foundation models for task specification need to understand and interpret various forms of human communication, including natural language, gestures, demonstrations, and other forms of feedback. They should also be able to incorporate prior knowledge about the task and environment and adapt to changes in task requirements and environmental conditions. This requires the development of models that can reason about task structure, incorporate probabilistic inference to deal with uncertainty, and learn from few or even single demonstrations. Developing foundation models for task specification is a critical step toward enabling robots to operate in various environments and perform various tasks without requiring significant manual engineering or reprogramming for each new setting. Robotic foundation models can leverage self-supervised learning to acquire knowledge from interaction data, enabling quick adaptation to new tasks and environments, thus reducing the need for extensive new data. This approach enhances robots’ learning efficiency and consistency.

Due to the huge combinatorial search space involved, AI has faced long-standing challenges in solving reasoning and search problems, such as theorem proving and program synthesis. Nevertheless, humans possess an innate capacity for intuitive reasoning in most mathematical domains and the ability to transfer knowledge across tasks to facilitate efficient adaptation and abstract reasoning. Recent AI research has shown that deep neural networks can effectively guide the search space of these problems. However, these approaches still have limitations in transferring knowledge and reasoning abstractly. To address these challenges, these multi-purpose models possess strong generative and multimodal capabilities, potentially controlling the inherent combinatorial explosion in search and enabling more efficient adaptation and abstract reasoning.

Thus, combining deep neural networks and AI foundation models holds great potential for advancing state-of-the-art reasoning and search problems in AI.

Human engagement

Foundation models in AI can provide a starting point for developers with little experience in creating effective AI applications. By using pre-trained models, developers can leverage the knowledge and experience already in the model, allowing them to create more effective and efficient applications.

Regarding user experience, AI foundation models can help improve the interaction between humans and AI by focusing on human agency and reflecting user values. For example, a foundation model could provide a chatbot application’s underlying language processing capabilities while incorporating user-specific information and preferences to provide a more personalized experience. This leads to more natural and engaging interactions, improving the user experience.

Components of foundation models

Foundation models in AI are based on a few important components, which are discussed below:


Data is a fundamental component of foundation models. It serves as the raw material upon which these models are trained, enabling them to learn language patterns, context, and semantics. The data’s quality, diversity, and size play a crucial role in determining the performance and capabilities of foundation models.

Training a foundation model involves exposing it to vast amounts of text data from various sources. This data can include books, articles, websites, social media posts, and more. The larger and more diverse the dataset, the better the model’s ability to understand and generate human-like language.

Data serves multiple purposes in the training of foundation models:

  1. Language understanding: By training on a wide range of texts, the model learns a language’s statistical patterns and linguistic structures. This enables it to comprehend the meanings, context, and relationships between words and phrases.
  2. Contextual knowledge: Data provides the model with a wealth of knowledge about different domains, topics, and cultural nuances. This contextual knowledge helps the model generate responses and make predictions that align with human understanding.
  3. Generalization: Exposure to diverse data enables the model to generalize its knowledge beyond the specific examples it has seen during training. It learns to infer patterns and apply them to unseen or similar situations, allowing it to perform well on a variety of language-related tasks.
  4. Bias and fairness: Data plays a crucial role in addressing bias and fairness in foundation models. The quality and diversity of the training data help mitigate biases and ensure that the model’s responses are more inclusive and representative of diverse perspectives.

It is essential to note that the data used for training foundation models should be carefully curated and preprocessed to maintain quality and mitigate potential biases. Pre-training often involves large-scale unsupervised learning, where the model learns to predict the next word in a sentence based on the context provided by the preceding words. This process helps the model acquire a general understanding of language.

Partner with LeewayHertz for robust LLM-based solutions!

Leveraging our hands-on experience and knowledge of AI foundation models, we develop custom foundation model-based solutions tailored to your business needs.


Modality is another important aspect of foundation models. It refers to the different types of input or output that a model can handle, such as text, images, speech, or a combination of these modalities. The ability to handle diverse modalities allows foundation models to understand and generate content in various formats, enhancing their versatility and applicability across different domains.

Here’s how modality functions as a core component of foundation models:

  1. Input modality: Foundation models can accept inputs in different modalities. For example, they can process text inputs to understand the meaning and context of written language. They can also handle image inputs, analyzing visual features and extracting information from images. Additionally, models may support speech inputs, enabling them to transcribe spoken language or perform tasks like voice recognition.
  2. Output modality: Foundation models can generate output in various modalities. They can generate coherent text responses, providing detailed answers, creative writing, or natural language conversation. They can also produce image-based outputs, generating images based on given prompts or completing missing parts of an image. Furthermore, models can generate speech outputs, synthesizing natural-sounding speech in different languages or voices.
  3. Multimodal fusion: Foundation models can effectively combine information from multiple modalities. They can process and analyze inputs that contain both text and images or text and speech, enabling them to understand complex scenarios that involve different forms of data. By fusing information from multiple modalities, the models can provide richer and more comprehensive responses or generate content that incorporates multiple modalities.
  4. Transfer learning: The inclusion of multiple modalities in foundation models allows for transfer learning between different tasks and modalities. Knowledge acquired from one modality can be transferred and applied to another. For example, a model trained on text-to-speech synthesis can leverage that knowledge to improve its performance in other speech-related tasks. This transferability enhances the efficiency and effectiveness of the models across various domains and applications.

By incorporating different modalities, foundation models can handle a wide range of inputs and generate outputs in diverse formats. This flexibility makes them valuable tools for various applications, such as natural language processing, computer vision, speech recognition, multimodal understanding, and more. The modality component broadens the scope of foundation models, enabling them to process and generate content across multiple domains and modalities, advancing the capabilities of AI systems.


parameters to enable it to learn and improve its performance on specific tasks. Training is a critical step in building robust and effective foundation models.

Here’s how training functions as a core component of foundation models:

  1. Dataset preparation: The training process begins with the preparation of a suitable dataset. The dataset should be diverse, representative of the target domain, and appropriately labeled or annotated, depending on the task. This dataset serves as the training input, allowing the model to learn from the examples and patterns present in the data.
  2. Model architecture: Before training, the model’s architecture needs to be defined. This includes determining the structure, layers, and connections of the neural network that constitutes the foundation model. The architecture defines the model’s capacity to learn and represent information.
  3. Backpropagation and optimization: During training, the model is exposed to the training dataset, and its parameters are adjusted iteratively to minimize the difference between the model’s predictions and the ground truth labels. This process, known as backpropagation, uses techniques like gradient descent to update the model’s parameters and optimize its performance. The goal is to find the optimal set of parameters that minimizes the overall loss or error of the model.
  4. Iterative learning: Training occurs over multiple iterations or epochs. In each iteration, the model is presented with batches of training data, and the parameters are updated based on the computed gradients. By repeatedly exposing the model to the dataset and updating its parameters, the model gradually learns to generalize patterns, understand relationships, and improve its performance on the target task.
  5. Regularization and hyperparameter tuning: To prevent overfitting and improve generalization, regularization techniques such as dropout, weight decay, or batch normalization may be applied during training. Hyperparameters, including learning rate, batch size, and regularization strength, are tuned to find the optimal settings for training the model.
  6. Evaluation and validation: Throughout training, the model’s performance is evaluated using separate validation datasets or metrics. This helps assess its generalization ability and prevent overfitting. The evaluation results guide the decision-making process, such as early stopping criteria or further adjustments to the training process.
  7. Transfer learning: Foundation models can also benefit from transfer learning, where pre-trained models on large-scale datasets are fine-tuned or used as starting points for specific tasks. Transfer learning leverages the knowledge and representations learned from one task or dataset to improve performance on related tasks with smaller or domain-specific datasets.

Training is an iterative and resource-intensive process that involves adjusting model parameters, optimizing performance, and fine-tuning to achieve the desired task-specific outcomes. The quality of the training data, the architecture design, and the training techniques employed significantly influence the effectiveness and capabilities of foundation models.


Adapting foundation models in AI is crucial for effective deployment in various applications. There are different approaches to adaptation, including priming, fine-tuning, and updating. Priming involves providing additional input, such as a prompt, to condition the model for a specific task. Fine-tuning involves updating some or all of the model parameters to reflect new information, while updating involves modifying the model’s architecture to fit the specific application better.

Adaptation can be used in various settings, such as domain-specific tasks, where a foundation model is specialized to a particular domain, improving its accuracy. It can also be used for test-time data removal, where the model is adapted to work without access to specific data. Another use case is editing the model behavior on specific inputs, such as increasing or decreasing the sentiment of a text.

Several factors determine the suitability of a particular adaptation procedure for a given application, including the available data, the desired level of accuracy, and the task requirements. Additionally, ethical considerations, such as potential biases and fairness, must be considered.

A long-term goal for future research in foundation model adaptation is to develop more efficient and effective adaptation techniques that can quickly adapt a foundation model to new tasks and domains while addressing ethical considerations. The goal is to ensure that the adaptation process does not degrade the generalization capabilities of the model, which is essential for its long-term usefulness. Many methods are available for adapting foundation models, making it challenging for practitioners to choose the best one for their specific problem or computing environment. To help with this decision-making process, three important factors should be taken into consideration, which are computing budget, data availability and access to foundation model gradients.

Evaluation metrics

Evaluation is an important part of machine learning models because it helps us understand how well the model performs and how we can improve it. Evaluation is even more important for foundation models, which are very large and complex because they are generally used for many different tasks. However, evaluating foundation models comes with its challenges because they need to be adapted to perform specific tasks and can acquire unexpected skills. Additionally, documenting the performance of foundation models can be difficult because they can be used for many different applications. To frame the evaluation of foundation models, we identify two types of evaluation that stem from the abstraction of these models: intrinsic evaluation of the foundation model, which is inherently independent of any particular task as foundation models are task-agnostic, and extrinsic evaluation of task-specific models, which relies on both the foundation model and the method of adaptation.

Intrinsic evaluation: An approach to evaluate foundation models in AI is to assess them based on the task associated with their training objective. For instance, a language model trained to predict the next word given the preceding context can be evaluated by measuring the probabilities it assigns to words given their context in held-out test data (perplexity on language modeling benchmarks such as LAMBADA). While this approach has shown promise in NLP, it has two primary limitations. Firstly, it needs more generality, as foundation models trained using different, incompatible objectives cannot be easily compared or understood within a consistent framework. Secondly, evaluation based on the training objective relies on a proxy relationship to be meaningful. Although this proxy relationship has worked well in some contexts, it is likely to break down when assessing the diverse capabilities of foundation models, their behavior in different domains, and factors beyond in-domain accuracy. Hence, two complementary approaches will be necessary to evaluate foundation models effectively.

Extrinsic evaluation: When evaluating AI models, it’s important to measure their performance on specific tasks, but comparing their performance can be tricky because different models may require different resources to achieve the same level of accuracy. This is especially true for AI foundation models, which are the basis for many task-specific models. The proposal to address this issue is to track the resources used for pre-training the foundation models. Another option is to consider the resources used to adapt the foundation model to specific tasks, not just the data used for adaptation. This would help identify the most effective adaptation methods and ensure fair model comparisons.

To evaluate task-specific models, we must consider the resources used to adapt the foundation models to these tasks. This involves understanding all the data used to inform adaptation and access requirements for different adaptation methods. While current evaluations may provide some understanding of specific models, they do not necessarily help us understand how different adaptation methods perform or which one to choose in a given context.


Foundation models are so large that they often require custom software systems like Megatron, DeepSpeed, or Mesh Transformer JAX to train efficiently at scale. These systems are built on top of standard frameworks like PyTorch, TensorFlow, and JAX and rely on several innovations to optimize training, such as new parallelization dimensions like pipeline parallelism and state-sharding optimizers to reduce memory usage. JIT compilers are used to optimize the computation graph, and optimized libraries like cuDNN and NCCL accelerate the training process. Overall, these innovations enable the training of large-scale foundation models but still require careful co-design across algorithms, models, software, and hardware systems, as well as new interfaces for programming and deploying ML applications. In addition, training large-scale models can be very time-consuming and expensive due to the computational power required. For example, training the GPT-3 model required over 1000 petaFLOP/s-days, a massive amount of computational resources. As a result, training such models requires access to specialized hardware, such as high-end GPUs or TPUs, which can be expensive and difficult to obtain.

Moreover, once these large models are trained, they can be difficult to deploy and maintain in production applications due to their size and complexity. Performing inference with these models can be computationally intensive and may require specialized hardware. Debugging and monitoring these models can also be challenging, particularly in complex systems with many interacting components.

Addressing these challenges will require careful co-design across algorithms, models, software, hardware systems, and new interfaces for programming and deploying ML applications.

Using foundation models for downstream tasks

We have covered components of foundation models in AI and their significance. Now, we will explore how to utilize these models for downstream tasks after training them. There are two approaches to achieving this: Fine-tuning and prompting.

The crucial distinction between fine-tuning and prompting is that fine-tuning modifies the model while prompting alters how the model is utilized by providing input prompts or instructions to guide its output generation.


Fine-tuning is a technique where a pre-trained foundation model is further trained on a smaller, task-specific dataset. The aim is to adapt the pre-trained model’s parameters to the specific task. This process typically involves adding a few additional layers to the pre-trained model and training the entire network on the new dataset. We need a smaller, task-specific dataset of labeled examples to fine-tune a foundation model. For example, if we use a pre-trained foundation model for sentiment analysis on a specific product review type, we would need a dataset of product reviews for that specific product.

The advantage of fine-tuning is that it significantly reduces the labeled dataset required to train a high-quality model for a specific task. Since the pre-trained model has already learned general features from a large dataset, fine-tuning on a smaller dataset can perform well with fewer labeled examples. However, it’s important to note that fine-tuning changes the pre-trained model’s parameters, which may affect its performance on other tasks. Additionally, fine-tuning requires access to a labeled dataset, which may only sometimes be available or costly.


Prompting is a technique to use a pre-trained foundation model to solve a specific task without requiring any further model training. Instead of fine-tuning the model on a new dataset, prompting involves providing a few task-specific examples or cues to the model, which helps it to understand the context and make predictions accordingly. For example, in the case of language models, the prompting technique involves providing a few task-related examples and asking the model to fill in the blank space or predict the next token in a sentence. The model uses the examples provided to it to understand the context and generate the appropriate output.

Applications of foundation models across industries

Some applications of AI foundation models are:


Using foundation model-powered solutions in healthcare can improve efficiency and accuracy for healthcare providers by reducing time spent editing Electronic Health Records (EHRs) and preventing medical errors. These solutions can be used to help providers create patient visit summaries, retrieve relevant cases and literature, and suggest lab tests, diagnoses, treatments, and discharges. Additionally, AI foundation models can be adapted to help surgical robots achieve accurate surgeries.

Foundation model-based solutions can also serve as an interface to patients, providing relevant information about clinical appointments, answering patient questions about preventive care, and providing explanatory medical information. They can also assist assistive-care robots for patients. Furthermore, solutions based on these models can serve as an interface to the general public, answering public health and pandemic prevention questions. It is crucial to note that the interface must guarantee factual accuracy to ensure public trust in medical advice.


Foundation models have also positively impacted biomedical research, particularly drug discovery and disease understanding. This is crucial as biomedical discovery is a complex and expensive process that requires significant human resources and experimental time. Foundation models can facilitate this process by using existing data and published findings, accelerating the discovery of new disease drugs and treatments. Additionally, foundation models can integrate diverse data modalities in medicine, enabling the investigation of biomedical concepts from multiple scales and knowledge sources. This leads to biomedical discoveries that are difficult to obtain using single-modality data. Overall, foundation models enable knowledge transfer across modalities and are helpful for generative tasks in biomedical research, such as generating experimental protocols and designing molecules that work given existing data.


Foundation model-powered solutions offer significant enhancements to legal businesses by automating and optimizing various aspects of legal operations. These solutions can streamline legal research and document analysis, accelerating the extraction of relevant information from vast volumes of legal documents. Virtual legal assistants and chatbots powered by foundation models provide accessible and accurate legal guidance, improving client interactions and support. Predictive analytics based on historical legal data enable lawyers to make informed decisions and develop effective strategies. Contract generation and review are simplified through automated drafting and risk identification. Additionally, foundation models facilitate compliance and regulatory analysis, ensuring adherence to complex legal frameworks. By leveraging these technologies, legal businesses can increase efficiency, improve decision-making, and deliver enhanced services to clients while empowering legal professionals to focus on high-value tasks.


Foundation model-powered solutions have the potential to revolutionize education businesses by offering personalized learning experiences, intelligent tutoring systems, automated content generation, and advanced data analytics. These solutions can tailor education to individual students, providing personalized recommendations and adaptive assessments. They can assist in creating high-quality educational resources, automate tasks like essay grading, and facilitate language learning. Additionally, foundation models enable educators to gain valuable insights from educational data, empowering data-informed decision-making. Virtual assistants and chatbots powered by these models can provide round-the-clock support to students. These solutions enhance the learning process, optimize educational outcomes, and empower educators to deliver personalized and data-driven instruction.


Foundation models represent a significant breakthrough in artificial intelligence. While they have made a remarkable impact in natural language processing, they also show promise in other domains, such as computer vision, speech recognition and reinforcement learning. Their potential is vast, and we can expect AI foundation models to transform how AI is used in various fields, like biomedical research and education. With the ability to automate processes within enterprises, foundation models in AI can significantly reduce the time, cost, and resources needed to train models for each task, leading to faster and more efficient deployment of AI systems. We should only expect more fascinating uses of AI foundation models as technology develops, ushering in a new era of artificial intelligence.

Leverage the power of generative AI with our robust generative AI solutions based on foundation models fine-tuned to your business needs. Contact LeewayHertz for all your consultancy and development needs.

Listen to the article

What is Chainlink VRF

Author’s Bio

Akash Takyar

Akash Takyar
CEO LeewayHertz
Akash Takyar is the founder and CEO at LeewayHertz. The experience of building over 100+ platforms for startups and enterprises allows Akash to rapidly architect and design solutions that are scalable and beautiful.
Akash's ability to build enterprise-grade technology solutions has attracted over 30 Fortune 500 companies, including Siemens, 3M, P&G and Hershey’s.
Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups.

Start a conversation by filling the form

Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail following the signing of an NDA.
All information will be kept confidential.