Fine-tuning Llama 2: Domain adaptation of a pre-trained model

Listen to the article

What is Chainlink VRF

In the dynamic and ever-evolving field of generative AI, a profound sense of competition has taken root, fueled by a relentless quest for innovation and excellence. The introduction of GPT by OpenAI has prompted various businesses to work on creating their own Large Language Models (LLMs). However, creating such sophisticated algorithms is like navigating through a maze of complexities. It demands exhaustive research, a massive amount of relevant data and overcoming numerous other challenges. Further, the substantial computational power required for these tasks remains a significant hurdle for many.

Amidst this fiercely competitive landscape, where industry heavyweights like OpenAI and Google have already etched their indelible marks, a new contender, Meta, entered the arena with their open-source LLM, Llama, with a goal of democratizing AI. They subsequently upgraded it to Llama 2, which was trained on 40% more data than its predecessor. While all large language models exhibit remarkable efficiency, their adaptability to handle domain-specific inquiries, such as those related to a business’s financial performance or inventory status, may be constrained. To empower these models with domain-specific competence and elevate their precision, a refinement process called fine-tuning is implemented. In this article, we will talk about fine-tuning Llama 2, a model that has opened up new avenues for innovation, research, and commercial applications. This process of fine-tuning may be considered imperative as it can yield numerous benefits like cost savings, secure management of confidential data, and the potential to surpass renowned models like GPT-4 in specialized tasks.

So, let’s dive deeper into the article and explore the transformative power of Llama 2 in redefining the boundaries of artificial intelligence, creating endless possibilities for businesses.

What is Llama 2?
Why use Llama 2?
Why does Llama 2 matter in the AI landscape?
How does Llama 2 work?
A thorough analysis of Llama 2 in comparison to other leading LLMs
What does fine-tuning an LLM mean?
- Techniques for LLM fine-tuning
How can we perform fine-tuning on Llama 2?
- PEFT approaches – LoRA and QLoRa
Fine-tuning the Llama 2 model with QLoRA
Challenges in fine-tuning Llama 2
How does LeewayHertz help in building Llama 2 model-powered solutions?

What is Llama 2?

Meta’s recent unveiling of the Llama 2 suite signifies an important milestone in the evolution of LLMs. Launched in mid-July, Llama 2 emerges as a versatile series of both pre-trained and fine-tuned models, characterized by its diverse parameter configurations of 7B, 13B, and 70B. This release included comprehensive papers detailing the intricacies of its design, training, and implementation, offering invaluable insights into the advancements made in the AI sector.

At the core of Llama 2’s development was an expansive training regimen built upon a staggering 2 trillion tokens—marking a 40% increase from previous endeavors. Sophisticated architectural interventions such as the grouped-query attention (GQA) mechanism further amplified this rigorous training. Particularly in the 70B model, GQA expedites inference, ensuring optimal performance without compromising speed. Furthermore, the model boasts a default context window of 4096 tokens, a significant advancement from previous iterations and a testament to its enhanced capability to handle complex contextual information.

Architecturally, Llama 2 distinguishes itself from its peers through several innovative attributes. It leverages the RMSNorm normalization, SwiGLU activation, and rotatory positional embedding to further enhance its data processing prowess. Applying the Adam optimizer with a cosine learning rate schedule, a weight decay of 0.1, and gradient clipping underscores Meta’s commitment to refining even the most nuanced aspects of model development.

Yet, the true innovation of Llama 2 lies not merely in its architectural and training advancements but in its fine-tuning strategies. Meta has judiciously prioritized quality over quantity in its Supervised Fine-Tuning (SFT) phase, a decision inspired by numerous studies indicating the superior model performance achieved through high-quality data. Complementing this is the Reinforcement Learning with Human Feedback (RLHF) stage, meticulously designed to calibrate the model in line with user preferences. Using a comparative approach where annotators evaluate model outputs, the RLHF process refines Llama 2 to accentuate helpfulness and safety in its responses.

Furthermore, Llama 2’s commercial adaptability is evident in its open-source and commercial character, facilitating ease of use and expansion. It’s not merely a static tool; it’s a dynamic solution optimized for dialogue use cases, as seen in the Llama-2-chat versions available on the Hugging Face platform. While the models differ in parameter size, their consistent optimization for both speed and accuracy underscores their adaptability to diverse operational demands.

Overall, Llama 2, as a member of the Llama family of LLMs, not only aligns with the technical prowess of contemporaries like GPT-3 and PaLM 2 but also introduces several groundbreaking innovations. Its optimized transformer architecture, rigorous training, fine-tuning procedures, and open-source accessibility position it as a formidable asset in the AI landscape, promising a future of more accurate, efficient, and user-aligned AI solutions.

Why use Llama 2?

In today’s AI-driven landscape, responsibility and accountability take center stage. Meta’s Llama 2 is evidence of this heightened focus on creating AI solutions that are transparent, accountable, and open to scrutiny. This section delves into why Llama 2’s approach is pivotal in reshaping our understanding and expectations of AI models.

Open source: The bedrock of transparency

Most LLMs, such as OpenAI’s GPT-3, GPT 4, Google’s PaLM and PaLM 2, and Anthropic’s Claude, have predominantly been closed source. This limited accessibility restricts the broader research community from fully understanding these models’ intricacies and decision-making processes. Llama 2 stands in stark contrast. Being open source enables anyone with relevant technical expertise not just to access but also to dissect, understand, and potentially modify the model. By enabling people to peruse the research paper detailing Llama 2’s development and training and even download the model for personal or business use, Meta is championing an era of transparency in AI.

Ensuring safety through red-teaming

Safety in AI is paramount, and Llama 2’s development process reflects this priority. Through internal and third-party commissions, adversarial prompts were generated through intensive red-teaming exercises to facilitate model fine-tuning. These rigorous processes are not just a one-time effort; they signify Meta’s ongoing commitment to refining model safety iteratively. The intention is clear: ensuring Llama 2 is robust against unforeseen challenges.

Transparent reporting: An insight into model evaluation

The research paper details Meta’s schematic transparency, highlighting the challenges encountered during the development of Llama 2. By highlighting known issues and outlining the steps taken to mitigate them – and those planned for future iterations – Meta is providing an open playbook on the model’s strengths and areas for improvement.

Empowering developers: “Responsible use guide” and “Acceptable use policy”

With great power comes great responsibility. Acknowledging LLMs’ vast potential and inherent risks, Meta has devised a “Responsible Use Guide” to steer developers towards best practices in AI development and safety evaluations. Complementing this is an “Acceptable Use Policy,” which defines boundaries for ensuring the responsible use of the model.

Engaging the global community

Meta recognizes the collective intelligence of the global community. Introducing initiatives such as the Open Innovation AI Research Community invites academic researchers to share insights and research on the responsible development of LLMs. Furthermore, the Llama Impact Challenge is a call to action for public, non-profit, and for-profit entities to harness Llama 2 in addressing critical global challenges like environmental conservation and education.

Launch your project with LeewayHertz

We specialize in fine-tuning pre-trained LLMs to ensure they offer domain-specific responses tailored to your unique business requirements. For the specifics you’re looking for, contact us today!

Learn More

Why does Llama 2 matter in the AI landscape?

The global AI community has long awaited a shift from commercial monopolization towards open-source research and experimentation. Meta’s Llama 2 heralds this change. By offering an open-source AI, Meta ensures a credible alternative to closed-source AI. It democratizes AI, allowing other companies to develop AI-powered applications under their control, bypassing the commercial constraints of tech giants like Apple, Google, and Amazon.

Llama 2 is not just a technological marvel; it’s a statement on the importance of responsibility, transparency, and collaboration in AI. It embodies a future where AI development prioritizes societal benefits, open dialogue, and ethical considerations.

How does Llama 2 work?

Llama 2, a state-of-the-art language model, has been built using sophisticated training techniques to understand and generate human-like text. To comprehend its operations, one must delve into its data sources, training methodologies, and potential applications.

Data sources and neural network training

Llama 2’s foundational strength is attributed to its extensive training on a staggering 2 trillion tokens. These tokens were sourced from publicly accessible repositories, including:

Common crawl: An expansive archive encompassing billions of web pages.
Wikipedia: The free encyclopedia offering a wealth of knowledge on myriad topics.
Project gutenberg: A treasure trove of public domain books.

Each token, be it a word or a semantic fragment, empowers Llama 2 to discern the meaning behind the text. For instance, if the model consistently encounters “Apple” and “iPhone” together, it infers the inherent relationship between these terms, distinguishing it from other related terms such as “apple” and “fruit.”

Ensuring quality and mitigating bias

Given the vastness and diversity of the internet, training a model solely on such data can inadvertently introduce biases or produce inappropriate content. Acknowledging this, the developers of Llama 2 incorporated additional training mechanisms:

Reinforcement Learning with Human Feedback (RLHF): This technique involves human testers who evaluate multiple AI-generated responses. Their feedback is instrumental in guiding the model towards generating more relevant and appropriate content.

Adaptation for conversational context

Llama 2’s chat versions were meticulously fine-tuned using specific data sets to enhance conversational prowess. This ensures that when engaged in a dialogue, Llama 2 responds naturally, simulating human interaction.

Customization and fine-tuning

One of Llama 2’s defining features is its adaptability. Organizations can mold it to resonate with their unique brand voice. For instance, if a firm wishes to produce summaries reflecting its distinct style, Llama 2 can be trained on numerous examples to achieve this. Similarly, the model can be fine-tuned for customer support optimization using FAQs and chat logs, allowing it to respond precisely to user queries.

Llama 2’s robustness and adaptability are products of its comprehensive training and fine-tuning methodologies. Its ability to assimilate vast data, combined with human feedback mechanisms and customization options, positions it at the forefront of the language model domain.

A thorough analysis of Llama 2 in comparison to other leading LLMs

The advancement of AI, especially in the domain of large language models, has been nothing short of extraordinary. This is prominently demonstrated by Llama 2, an LLM designed with adaptability in mind to empower developers and researchers to explore new horizons and create innovative applications. Here, we explore the outcomes of some experiments carried out to evaluate how Llama 2 compares to giants like OpenAI’s GPT and Google’s PaLM.

Creative aptitude: Llama 2 was prompted to simulate a sarcasm-laden dialogue on space exploration; the resultant discourse, although impressive, was trailing slightly behind ChatGPT. When compared with Google’s Bard, Llama 2 showcased a superior flair. Thus, while ChatGPT remains the frontrunner in creative engagements, Llama 2 holds a commendable position amongst its peers.
Programming capabilities: Llama 2 was pitted against ChatGPT and Bard in a coding challenge. The task? To develop functional applications ranging from a basic to-do list to a Tetris game. Although ChatGPT mastered each challenge, Llama 2, akin to Bard, efficiently crafted the to-do list and an authentication system, stumbling only on the Tetris game.
Mathematical proficiency: Llama 2’s prowess in solving algebraic and logical math problems was noteworthy, particularly when compared to Bard. However, ChatGPT’s mathematical proficiency remained unmatched. Remarkably, Llama 2 excelled in certain problems where its predecessors, in their early stages, had faltered.
Reasoning and commonsense: A facet that remains a challenge for many AI models is commonsense reasoning. ChatGPT unsurprisingly led the pack. The contest for the second spot was neck-to-neck between Bard and Llama 2, with Bard slightly edging out.

Llama 2, though an impressive foundational model, still has room for growth compared to certain other specialized, fine-tuned models on the market. Foundational models like Llama 2 are designed with versatility and future adaptability at their core, unlike fine-tuned models optimized for domain-specific expertise. Given its nascent stage and its ‘foundational’ nature, the potential avenues for Llama 2’s evolution are promising.

What does fine-tuning an LLM mean?

When discussing the fine-tuning of LLMs, it’s crucial to recognize that such practices extend beyond language models. Fine-tuning can be applied across various machine learning models based on different use cases.

Machine learning models are trained to identify patterns within given datasets. For instance, a Convolutional Neural Network (CNN) designed to detect cars in urban areas would be highly proficient in that domain due to training on relevant images. Yet, when faced with detecting trucks on highways, its efficacy might decrease due to unfamiliarity with that data distribution. Rather than starting from scratch with a new training dataset, fine-tuning allows for adjustments to be made to the model to accommodate new data types.

Several advanced LLMs are available, including GPT-3, Bloom, BERT, T5, and XLNet. GPT-3, for instance, is a premium model recognized for its vast training on 175 billion parameters, making it adept for various natural language processing tasks. BERT, conversely, is a more accessible open-source model excelling in understanding contextual word relationships. The choice between models like GPT-3 and BERT largely depends on the specific task at hand, be it text generation or text classification.

Techniques for LLM fine-tuning

The process of fine-tuning LLMs is intricate, with varying techniques ideal for specific applications. Sometimes, the goal is to train a model to suit a novel task.

Imagine having a pre-trained LLM skilled in text generation, but you want it to perform sentiment analysis. This will entail remodeling the model with subtle architectural tweaks before diving into the fine-tuning phase.

In such a context, you will primarily harness the numeric vectors called embeddings generated by the LLM’s transformer component. These embeddings carry detailed features of the given input.

Certain LLMs directly produce these embeddings, whereas others, such as the GPT series, use these embeddings for token or text generation. During adaptation, the LLM’s embedding layer gets linked to a classification system, typically a set of fully connected layers translating embeddings into class probabilities. The emphasis here lies in training the classification segment using model-driven embeddings.

While the LLM’s attention layers generally remain unchanged—offering computational efficiency—the classifier requires a supervised learning dataset with text instances and their respective classifications.

The magnitude of your fine-tuning data relies on task intricacy and classifier specifics. Yet, occasions demand a deeper adjustment, requiring unlocking attention layers for a full-blown fine-tuning project.

It’s worth noting that this intensive process is also dependent on the model size. Besides, there exist strategies to streamline costs related to fine-tuning. Let’s delve deeper and explore some prominent fine-tuning techniques.

Unsupervised versus supervised fine-tuning (SFT)Sometimes, there’s a need to refresh the LLM’s knowledge base without necessarily changing its behavior. If, for instance, you intend to adapt the model to medical terminologies or a novel language, harnessing an expansive, unstructured dataset suffices. You can choose between unsupervised pretraining with ample unstructured data or supervised fine-tuning with labeled datasets for a specific task.Here, the goal is to immerse the model in a sea of tokens representative of the new domain or anticipated input types. Leveraging vast unstructured datasets is scalable, thanks to unsupervised or self-supervised methodologies.However, there are cases where merely updating the model’s information reservoir falls short. An LLM’s behavior needs an overhaul, necessitating a supervised fine-tuning (SFT) dataset, complete with prompts and expected outcomes. This method is pivotal for models like ChatGPT, which are designed to be highly responsive to user directives.

Reinforcement Learning from Human Feedback (RLHF) In elevating SFT, some practitioners employ reinforcement learning from human feedback, which is a complex procedure. Currently, only well-resourced organizations have the capacity to employ RLHF. While RLHF techniques vary, they all emphasize human-guided LLM training. Human reviewers assess the model’s outputs for certain prompts, guiding the model toward desired results.Take ChatGPT by OpenAI as a RLHF benchmark. Human feedback aids in developing a reward model mirroring human preferences. The LLM then undergoes rigorous reinforcement learning to optimize its outcomes based on these reward pointers.

Parameter-efficient Fine-tuning (PEFT) PEFT, an emerging field within LLM fine-tuning, tries to minimize the resources spent on updating model parameters. PEFT techniques focus on limiting parameter alterations.One such method gaining traction is the Low-rank Adaptation (LoRA). The essence of LoRA is that only certain parameters need adjustments for downstream tasks. Thus, a compact matrix can capture task-specific nuances.Implementing LoRA implies training this compact matrix rather than the entire LLM’s parameters. Once trained, the LoRA model weights can either merge with the primary LLM or be used during inference.Adopting techniques like LoRA can reduce fine-tuning expenditures considerably while enabling the storage of numerous fine-tuned models ready for integration during LLM operations.

Reinforcement Learning from AI Feedback (RLAIF)

Fine-tuning a Large Language Model (LLM) using Reinforcement Learning from AI Feedback (RLAIF) involves a structured process that ensures the model’s behavior aligns with a set of predefined principles or guidelines, often encapsulated in a Constitution. Here’s an overview of the steps involved in fine-tuning an LLM using RLAIF:

Define the Constitution

Constitution creation: Begin by defining the Constitution, a document or set of guidelines that outlines the principles, ethics, and behavioral norms that the AI model should adhere to. This Constitution will guide the AI Feedback Model in generating preferences.

Set up the AI feedback model

Model selection: Choose or develop an AI feedback model capable of understanding and applying the principles outlined in the Constitution.
Model training (if necessary): If the AI feedback model isn’t pre-trained, you might need to train it to interpret the Constitution and evaluate responses based on it. This could involve supervised learning, using a dataset where responses are annotated based on their alignment with constitutional principles.

Generate feedback data

Feedback generation: Use the AI feedback model to evaluate pairs of prompt/response instances. For each pair, the model assigns a preference score, indicating which response aligns better with the principles in the Constitution.

Train the Preference Model (PM)

Data preparation: Organize the AI-generated feedback into a dataset suitable for training the Preference Model (PM).
Preference model training: Train the model on this dataset. It learns to predict the preferred response to a given prompt based on the feedback scores provided by the AI feedback model.

Fine-tune the LLM

Integration with reinforcement learning: Integrate the trained preference model into a reinforcement learning framework. In this setup, the preference model provides the reward signal based on how well a response from the LLM aligns with the constitutional principles.
LLM fine-tuning: Fine-tune the LLM using this reinforcement learning setup. The LLM generates responses to prompts, and the responses are evaluated by the PM. The LLM then adjusts its parameters to maximize the reward signal, effectively learning to produce responses that better align with the constitutional principles.

Evaluation and iteration

Model evaluation: After fine-tuning, evaluate the LLM’s performance to ensure it aligns with the desired principles and effectively handles a variety of prompts.
Feedback loop: If the performance is not satisfactory or if there’s room for improvement, you might need to iterate over the process. This could involve refining the Constitution, adjusting the AI feedback model, retraining the preference model, or further fine-tuning the LLM.

Deployment and monitoring

Deployment: Once the fine-tuning process meets the performance and ethical standards, deploy the model.
Continuous monitoring: Regularly monitor the model’s performance and behavior to ensure it continues to align with the constitutional principles, adapting to new data and evolving requirements.

Fine-tuning an LLM using RLAIF is a complex process that involves careful design, consistent evaluation, and ongoing adjustment to ensure that the model’s behavior aligns with human values and ethical standards. It’s a dynamic process that benefits from continuous monitoring and iterative improvement.

Launch your project with LeewayHertz

We specialize in fine-tuning pre-trained LLMs to ensure they offer domain-specific responses tailored to your unique business requirements. For the specifics you’re looking for, contact us today!

Learn More

How can we perform fine-tuning on Llama 2?

PEFT approaches – LoRA and QLoRA

Parameter-efficient Fine-tuning (PEFT) presents an effective approach to fine-tuning LLMs. Distinct from traditional methods that mandate extensive parameter updates, PEFT focuses on refining a select subset of parameters, minimizing computational demands and expediting the training process. By gauging the significance of individual parameters based on their influence on the overall model, PEFT prioritizes those with maximal impact. Consequently, only these pivotal parameters undergo adjustments during the fine-tuning phase, while others remain static. Such a strategy curtails computational and temporal overheads and paves the way for swift model iteration and deployment. As PEFT emerges as a frontrunner in optimization techniques, it’s vital to recognize that it remains a dynamic field, with continuous research ushering in nuanced variations and enhancements. The choice of PEFT application will invariably depend on specific research goals and practical contexts.

PEFT is an innovative approach that effectively reduces RAM and storage demands. It achieves this by primarily refining a select set of parameters while maintaining the majority in their original state. PEFT’s strength lies in its ability to foster robust generalization even when datasets are of limited volume. Moreover, it augments the model’s reusability and transferability. Small model checkpoints, derived from PEFT, seamlessly integrate with the foundational model, promoting versatile fine-tuning across diverse scenarios by incorporating PEFT-specific parameters. A salient feature is the preservation of insights from the pre-training phase, ensuring the model remains resilient to extensive memory loss or catastrophic forgetting.

Prominent PEFT strategies emphasize the integrity of the pre-trained base, introducing supplementary layers or parameters termed “Adapters.” Through a process dubbed “adapter-tuning,” these layers are integrated with the foundational model, with tuning efforts concentrated on the novel layers alone. A notable challenge with this model is the heightened latency during the inference stage, potentially hampering efficiency in various contexts.

Parameter-efficient fine-tuning has become a pivotal area of focus within AI, and there are myriad techniques to achieve this. Among these, the Low-rank Adaptation (LoRA) and its enhanced counterpart, QLoRA, are distinguished for their effectiveness.

Low-rank Adaptation (LoRA)

LoRA introduces an innovative paradigm in model fine-tuning, offering a modular method adept at domain-specific tasks and transferring learning capabilities. The intrinsic beauty of LoRA lies in its ability to be executed using minimal resources while being memory-conservative.

A closer examination of the LoRA technique reveals the following steps and intricacies:

Pre-trained parameter preservation: The original neural network’s foundational parameters (W) remain unaltered during the adaptation process.
Inclusion of new parameters: Accompanying this original setup, supplementary networks (denoted as WA and WB) are embedded. These networks champion the use of low-rank vectors. The dimensionality of these vectors (dxr and rxd) is purposefully diminished compared to the original network’s dimensions. Here, ‘d’ symbolizes the original vector’s dimension, and ‘r’ denotes the low rank. Notably, a smaller ‘r’ accelerates training, although it may require a fine balance to maintain optimal performance.
Dot product calculation: Both the original and low-rank networks are intertwined through a dot product, generating an ‘n’-dimensional weight matrix that informs the model’s results.
Loss function computation: The loss function is discerned by contrasting the derived results against expected outputs. Traditional backpropagation methods are then harnessed to calibrate the WA and WB weights.

The LoRA’s essence is its economical memory footprint and infrastructure demands. For instance, given a 512×512 parameter matrix in a typical feed-forward network (equivalent to 262,144 parameters), by leveraging a LoRA adapter with a rank of 2, only 2,048 parameters (512×2 for both WA and WB) undergo domain-specific data training. This streamlined process significantly elevates computational efficiency.

An exceptional facet of LoRA is its modular design. The trained adapter can be retained as an independent entity, serving as a modular component for specific domains. Furthermore, LoRA adeptly bypasses potential catastrophic memory loss by abstaining from modifying the foundational weights.

Further developments: QLoRA

To further accentuate the effectiveness of LoRA, QLoRA has been introduced as an augmented technique, promising enhanced optimization and performance. This advanced method builds upon the foundational principles of LoRA, optimizing it for even more intricate tasks.

QLoRA

QLoRA builds upon LoRA to further optimize efficiency by converting the weight values of the original network from high-definition formats, like Float32, to more compact types, such as int4. This conversion reduces memory usage and accelerates computational speeds.

QLoRA introduces three primary enhancements over LoRA, establishing it as a leading method in PEFT.

1. 4-bit NF4 quantization

Using 4-bit NormalFloat4 is a strategic move to decrease the storage requirements. This process is divided into three phases:

Normalization & quantization: Here, weights are shifted to a neutral mean and a consistent unit variance. Given that a 4-bit data format can hold just 16 distinct values, weights are aligned with the closest among these 16 based on their relative position. For example, if there’s an FP32 weight of value 0.2121, its nearest 4-bit equivalent would be stored, not the exact value.
Dequantization: This is the reverse process. Post-training, the original weights, which had been adjusted, are restored to their near-original form.
Double quantization: This phase enhances memory optimization further. Grouping quantization values and applying an 8-bit quantization can result in a significant reduction in memory usage. In essence, for a model with 1 million parameters, the memory demand can be slashed to around 125,000 bits.

2. Unified memory paging

Together with the quantization methods, QLoRA leverages nVidia’s unified memory capabilities. This feature facilitates smooth transfers between GPU and CPU memory. This is particularly useful during memory-intensive operations or unexpected GPU demand spikes, ensuring no memory overflow.

While both LoRA and QLoRA are at the forefront of PEFT, QLoRA’s advanced techniques offer superior efficiency and optimization.

Fine-tuning the Llama 2 model with QLoRA

Let’s delve into the process of fine-tuning the Llama 2 model, which features a massive 7 billion parameters. We will harness the computational power of a T4 GPU, backed by high RAM, available on Google Colab at a rate of 2.21 credits per hour. It’s worth noting that the T4 comes equipped with 16 GB of VRAM. Now, when you consider the weight of Llama 2-7b (7 billion parameters equating to 14 GB in FP16 format), the VRAM is stretched almost to its limit. This scenario doesn’t even factor in additional overheads such as optimizer states, gradients, and forward activations. The implication is clear: traditional fine-tuning won’t work here. We need to apply parameter-efficient fine-tuning techniques, such as LoRA or QLoRA.

One way to significantly cut down on VRAM usage is by fine-tuning the model using 4-bit precision. This makes QLoRA an apt choice. Fortunately, the Hugging Face ecosystem is equipped with libraries like transformers, accelerate, peft, trl, and bitsandbytes to facilitate this. Our step-by-step code is inspired by the contributions of Younes Belkada on GitHub. We initiate the process by installing and activating these libraries.

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Let’s delve into the adjustable parameters in this context. We will begin by loading the llama-2-7b-chat-hf model, commonly referred to as the chat model. Our aim is to train this model using the dataset mlabonne/guanaco-llama2-1k, which comprises 1,000 samples. Upon completion, the resulting fine-tuned model will be termed llama-2-7b-miniguanaco. For those curious about the origin and creation of this dataset, a detailed notebook is available for review. However, do note that customization is possible. The Hugging Face Hub boasts a plethora of valuable datasets, including the notable databricks/databricks-dolly-15k.

In employing QLoRA, we will set the rank at 64, coupled with a scaling parameter of 16. Our approach involves loading the Llama 2 model directly in 4-bit precision, specifically employing the NF4 type, and then training it over a single epoch. For insights into other associated parameters, you are encouraged to explore the TrainingArguments, PeftModel, and SFTTrainer documentation.

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Let’s commence the fine-tuning process, integrating various components for this task.

Launch your project with LeewayHertz

We specialize in fine-tuning pre-trained LLMs to ensure they offer domain-specific responses tailored to your unique business requirements. For the specifics you’re looking for, contact us today!

Learn More

Initially, we will source the previously defined dataset. It’s pertinent to note that our dataset is already refined; however, under typical circumstances, this step would entail reshaping prompts, filtering out inconsistent text, amalgamating multiple datasets, and so forth.

Subsequently, we will set up bitsandbytes to facilitate 4-bit quantization.

Following this, we will instantiate the Llama 2 model in 4-bit precision on a GPU, aligning it with the appropriate tokenizer.

To conclude our preparations, we will initialize the configurations for QLoRA, outline the standard training parameters, and forward all these settings to the SFTTrainer. With everything in place, the training journey begins!

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

output

The duration of the training process can vary significantly based on your dataset’s size. In this instance, it was completed in under an hour using a T4 GPU. To review the progress visually, one can refer to the plots available on Tensorboard.

%load_ext tensorboard
%tensorboard --logdir results/runs

Let’s confirm that the model is operating as expected. While a comprehensive evaluation would be ideal, we can utilize the text generation pipeline for preliminary assessments by posing questions such as, “What is a large language model?” It’s crucial to adjust the input to align with Llama 2’s prompt structure.

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

The model provides the subsequent answer:

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate human-like language outputs. It is designed to be able to understand and generate text in a way that is similar to human language, and can be used for a variety of applications such as chatbots, language translation, and text summarization.

Large language models are typically trained using deep learning techniques, such as recurrent neural networks (RNNs) or transformer models, and are often based on pre-trained models such as BERT or RoBERTa. These models are trained on large datasets of text, such as books, articles, or websites, and are designed to learn the patterns and structures of language.

Some examples of large language models include:

* BERT (Bidirectional Encoder Representations from Transformers):

Drawing from our observations, the coherence demonstrated by a model encompassing merely 7 billion parameters is quite impressive. Feel free to experiment further by posing more complex questions, perhaps drawing from datasets like BigBench-Hard. Historically, the Guanaco dataset has been pivotal in crafting top-tier models. To achieve this, consider training a Llama 2 model utilizing the mlabonne/guanaco-llama2 dataset.

So, how do we save our refined llama-2-7b-miniguanaco model? The key lies in integrating the LoRA weights with the foundational model. Presently, a direct, seamless method to achieve this eludes us. The procedure involves reloading the base model in FP16 precision and harnessing the capabilities of the peft library for amalgamation. Regrettably, this approach has occasionally been met with VRAM-related challenges, even after its clearance. It might be beneficial to restart the notebook, initiate the primary three cells, and then progress to the subsequent one.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Having successfully combined our weights and reinstated the tokenizer, we are positioned to upload the entirety to the Hugging Face Hub, ensuring our model’s preservation.

!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

This model is now ready for inference and can be accessed and loaded from the Hub just as you would with any other Llama 2 model.

Challenges in fine-tuning Llama 2

Navigating the fine-tuning process

Fine-tuning LLMs like Llama 2 presents its unique set of complexities, differing from standard text-to-text model adaptations. The process remains intricate for enterprise applications even with supportive libraries like HuggingFace’s transformers and trl. Key challenges include:

Absence of a standard interface to set prompt and task descriptors and to adjust datasets in alignment with these parameters.
The multitude of training parameters that necessitate manual configuration tailored to specific datasets.
The onus is establishing, managing, and scaling a robust infrastructure for fine-tuning distributed models. Achieving optimal performance with a model with around 7B parameters becomes challenging, especially when considering GPU memory constraints. Understanding and deploying distributed training effectively mandates a deep-rooted understanding of the subject.

Securing computational assets

LLMs, by nature, are voracious consumers of computational resources. Their memory, power, and time demands are lofty, constraining entities lacking these resources. This disparity can act as a barrier to universalizing the fine-tuning process.

Streamlining distributed model training

The sheer size of LLMs like Llama 2 makes it impractical to house them on a singular GPU, barring a few like the A100s. This necessitates a shift from standard parallel training to either model parallel or pipeline parallel training, whereby model weights are disseminated across multiple GPU instances. Open-source tools such as Deepspeed facilitate this, but mastering its vast array of configurable parameters can be daunting. Incorrect configurations can lead to memory overflow on CPUs/GPUs or suboptimal GPU usage due to unwarranted offloading, elevating training costs.

How does LeewayHertz help in building Llama 2 model-powered solutions?

LeewayHertz, a seasoned AI development company, offers expert solutions in fine-tuning the Llama 2 model to build custom solutions aligned with specific organizational needs and objectives. Here is how we can help you:

Strategic consulting

Our consulting process begins by deeply understanding your organization’s goals, challenges, and competitive landscape. We then recommend the most appropriate Llama 2 model-powered solution tailored to your specific needs. Finally, we develop a comprehensive implementation strategy, ensuring the solution aligns perfectly with your objectives and positions your organization for success in the rapidly evolving tech landscape.

Data engineering for Llama 2

With precise data engineering, we transform your organization’s valuable data into a powerful asset for the development of highly effective Llama 2 model-powered solutions. Our skilled developers carefully prepare your proprietary data, making sure it meets the necessary standards for fine-tuning the Llama 2 model, thus optimizing its performance to the fullest potential.

Fine-tuning expertise in Llama 2

We fine-tune the Llama 2 model with your proprietary data for domain-specific performance and build a customized solution around it. This approach ensures the solution delivers accurate and meaningful responses within your unique context.

Custom Llama 2 solutions

We ensure innovation, efficiency, and a competitive edge with our expertly developed Llama 2 model-powered solutions. Whether you need chatbots for personalized customer interactions, intelligent content generators, or context-aware recommendation systems, our Llama 2 model-powered applications are meticulously crafted to enhance your organization’s capabilities in the dynamic AI landscape.

Seamless integration of Llama 2

We ensure that the Llama 2 model-powered solutions we develop seamlessly align with your existing processes. Our approach involves analyzing your workflows, identifying key integration points, and developing a customized integration strategy. This minimizes disruptions while maximizing the benefits of our solutions, facilitating a smooth transition for your organization into a more efficient, AI-enhanced operational environment.

Continuous evolution: Upgrades and maintenance

We ensure to keep your Llama 2 model-powered application up-to-date and performance-optimized with our comprehensive upgrade and maintenance services. We diligently monitor emerging trends, security updates, and advancements in AI technology, ensuring your application stays competitive and secure in the rapidly evolving tech landscape.

Endnote

This article discusses the intricacies of fine-tuning the Llama 2 7b model leveraging a Colab notebook. We laid the foundational understanding of LLM training and the intricacies of fine-tuning, shedding light on the significance of instruction datasets. We effectively adapted the Llama 2 model in our practical section, ensuring compatibility with its intrinsic prompt templates and tailored parameters.

When incorporated into platforms like LangChain, these refined models emerge as potent alternatives to offerings like the OpenAI API. It’s imperative to recognize that instruction datasets stand paramount in the evolving landscape of language models. The efficacy of your model is intrinsically tied to the quality of its training data. As you embark on this journey, prioritizing high-caliber datasets becomes crucial. Navigating the complexities of models like Llama 2 may appear challenging, but the rewards are substantial with diligent application and a clear roadmap. Harnessing the prowess of these advanced LLMs for targeted tasks can enhance applications, ushering in a new era of linguistic computing.

Don’t let pre-trained models limit your vision. Our extensive development experience and LLM fine-tuning expertise enable us to build robust custom LLMs tailored to businesses’ specific needs. Contact our AI experts today and harness the limitless power of LLMs!

Listen to the article

What is Chainlink VRF

Author’s Bio

Akash Takyar

CEO LeewayHertz

Akash Takyar is the founder and CEO of LeewayHertz. With a proven track record of conceptualizing and architecting 100+ user-centric and scalable solutions for startups and enterprises, he brings a deep understanding of both technical and user experience aspects.
Akash's ability to build enterprise-grade technology solutions has garnered the trust of over 30 Fortune 500 companies, including Siemens, 3M, P&G, and Hershey's. Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups.

Write to Akash

Related Services

LLM Development

Transform your AI capabilities with our custom LLM development services, tailored to your industry's unique needs.

Explore Service

Start a conversation by filling the form

Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail post sign of an NDA.
All information will be kept confidential.

This field is hidden when viewing the form

OID

This field is hidden when viewing the form

Campaign ID

This field is hidden when viewing the form

Source

This field is hidden when viewing the form

Lead Source Description

First Name(Required)

Last Name(Required)

Company Email(Required)

Company Name(Required)

Job Title(Required)

Country(Required)

Select a state/province(Required)

Comments(Required)

Comments

This field is for validation purposes and should be left unchanged.

Fine-tuning Llama 2: Domain adaptation of a pre-trained model

What is Llama 2?

Why use Llama 2?

Open source: The bedrock of transparency

Ensuring safety through red-teaming

Transparent reporting: An insight into model evaluation

Empowering developers: “Responsible use guide” and “Acceptable use policy”

Engaging the global community

Launch your project with LeewayHertz

Why does Llama 2 matter in the AI landscape?

How does Llama 2 work?

Data sources and neural network training

Ensuring quality and mitigating bias

Adaptation for conversational context

Customization and fine-tuning

A thorough analysis of Llama 2 in comparison to other leading LLMs

What does fine-tuning an LLM mean?

Techniques for LLM fine-tuning

Launch your project with LeewayHertz

How can we perform fine-tuning on Llama 2?

PEFT approaches – LoRA and QLoRA

Low-rank Adaptation (LoRA)

Further developments: QLoRA

Fine-tuning the Llama 2 model with QLoRA

Launch your project with LeewayHertz

Challenges in fine-tuning Llama 2

Navigating the fine-tuning process

Securing computational assets

Streamlining distributed model training

How does LeewayHertz help in building Llama 2 model-powered solutions?

Strategic consulting

Data engineering for Llama 2

Fine-tuning expertise in Llama 2

Custom Llama 2 solutions

Seamless integration of Llama 2

Continuous evolution: Upgrades and maintenance

Endnote

Author’s Bio

Related Services

LLM Development

Start a conversation by filling the form

Insights

AI for financial compliance: Applications, benefits, technologies and solution

AI in change management: Use Cases, applications, implementation and benefits

Understanding AIOps: Its working, phases, types, use cases, benefits and implementation