When we think about Large Language Models or any other generative models, the first hardware that comes to mind is GPU. Without GPUs, many advancements in Generative AI, machine learning, deep learning, and data science would’ve been impossible. If 15 years ago, gamers were enthusiastic about the latest GPU technologies, today data scientists and machine learning engineers join them and pursue the news in this field too. Although usually gamers and ML users are looking at two different kinds of GPUs and graphic cards.
Hardware and Software Setup
Before loading any LLM model or training dataset, we need to find out what hardware and software we need for such a process.
As mentioned, We used the NVIDIA GeForce RTX 3090 GPU because it has one of the highest memory (24GB) among consumer GPUs (FYI: 4090 model has the same memory size too). It is based on Ampere Architecture which is the same architecture that famous A100 GPUs have. You can see more about GeForce RTX 3090 GPU specifications here.
After all the tests, We believe 24GB is the least amount of GPU memory that we need for working with LLMs with billions of parameters.
In addition to the graphics card, we need to make sure that our PC has a good ventilation system. During fine-tuning, the temperature of the GPU easily goes up, and its fans are not enough to keep it cool. Higher GPU temperature can lower the GPU performance and the process will take much longer time.
In addition to hardware, there are some software considerations that We must mention here. First of all, if you are a Windows user, We have a piece of bad news for you. Some libraries and tools only work on Linux. Specifically, bitsandbytes which is being used frequently for model quantization is not Windows-friendly. Some people made a wrapper for Windows (for example here), but they have their cons and pros. So, my advice is either to install Linux on WSL or like me have a dual boot system and fully switch to Linux while you are working on LLMs.
Also, you need to install PyTorch and a compatible CUDA version. My recommendation is to install CUDA 12.3 (link). Then you need to go to this page, and based on your system, CUDA version, and package manager system, download and install the correct PyTorch (https://pytorch.org/).
Note: If you are using CUDA 12.3, you might need to add or configure BNB_CUDA_VERSION and LD_LIBRARY_PATH env variables in your .bashrc file. Here is an example for your reference.
export BNB_CUDA_VERSION=123
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/<YOUR-USER-DIR>/local/cuda-12.3
And finally, you need to install the following packages on your system. We recommend creating a new virtual environment (venv) to avoid conflict with other installed packages on your system. Also, for your reference here are versions of packages that We successfully used:
torch==2.1.2
transformers==4.36.2
datasets==2.16.1
bitsandbytes==0.42.0
peft==0.7.1
Technical Background
Now that you have all the hardware and software ready for working with LLMs on your system, it is better to have a very brief review of the technical concepts that you will face in the next sections.
Large language models comprise millions or billions of parameters. Usually, we use pre-trained models that have been trained on billions and sometimes trillions of tokens after a long training process and usually millions of dollars in cost. Each of these model parameters takes 32 bits (4 bytes) of memory for loading. As a rule of thumb, each 1 billion parameters require about 4GB of memory to just load the model. One technique of using less memory for loading (and later inferencing or training a model later) is “Quantization”. In this technique, we reduce the precision of the model weights from 32-bit full precision (fb32) to 16-bit (fp16 or bfloat16), 8-bit (int8), or even less (read more). As you can imagine by reducing the precision of the model weights we can load larger models into a limited memory, but this comes at the cost of reducing the model’s performance. However, some studies suggested that the model performance difference between fp32 and bfloat16 is insignificant and many famous models (including Llama2) were pre-trained on bfloat16.
Quantization is a technique that we must use when we are fine-tuning or inferencing a large language model on a single GPU with 24GB of memory. Later you will see that the bitsandbytes
library can help us achieve model quantization.
Therefore, most people (even with significant hardware resources and budget) prefer to use a pre-trained model and only fine-tune it for their specific use case. Still, the process of full fine-tuning could be overwhelming with limited resources (such as a single GPU). Therefore, Paremer-Efficient Fine-Tuning (PEFT), which only updates a limited subset of model parameters, looks more realistic with less computed resources.
Among different PEFT techniques, LoRA (Low Ranking Adaption) is very popular because of its computing efficiency. In this technique, we freeze all the original model weights and instead, we train low-rank matrices that can be added to specific layers of the Transformer architecture (read more). In many cases, using LoRA for fine-tuning an LLM, we update 0.5% of model weights.
QLoRA is a variation of LoRA that combines LoRA with the Quantization concept that we explained. Specifically, in our QLoRA implementation, we will use nf4 or Normal Float 4 for fine-tuning the model. QLoRA is very helpful in our case study of fine-tuning a large model with a single consumer GPU.
Coding Time
Set up your Python environment
Create the following requirements.txt
file:
torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy
Then install and import the installed libraries:
pip install -r requirements.txt
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
Download LLaMA 2 model
As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.
To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub. As mentioned in the requirements step, you need to use the huggingface-cli login
command.
The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.
def load_model(model_name, bnb_config):
n_gpus = torch.cuda.device_count()
max_memory = f'{40960}MB'
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto", # dispatch efficiently the model on the available ressources
max_memory = {i: max_memory for We in range(n_gpus)},
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
# Needed for LLaMA tokenizer
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
Download a Dataset
There are many datasets that can help you fine-tune your model. You can even use your own dataset!
In this tutorial, we are going to download and use the Databricks Dolly 15k dataset, which contains 15,000 prompt/response pairs. It was crafted by over 5,000 Databricks employees during March and April of 2023.
This dataset is designed specifically for fine-tuning large language models. Released under the CC BY-SA 3.0 license, it can be used, modified, and extended by any individual or company, even for commercial applications. So it’s a perfect fit for our use case!
However, like most datasets, this one has its limitations. Indeed, pay attention to the following points:
- It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.
- Since the dataset has been created for Databricks by their own employees, it’s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.
- We only have access to the
train
split of the dataset, which is its largest subset.
# Load the databricks dataset from Hugging Face
from datasets import load_dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
Explore dataset
Once the dataset is downloaded, we can take a look at it to understand what it contains:
print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')
*** OUTPUT ***
Number of prompts: 15011
Column Names are: ['instruction', 'context', 'response', 'category']
As we can see, each sample is a dictionary that contains:
- An instruction : What could be entered by the user, such as a question
- A context : Help to interpret the sample
- A response : Answer to the instruction
- A category : Classify the sample between Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing
Pre-processing dataset
Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.
It will help us to format our prompts as follows:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Sea or Mountain
### Response:
We believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%
### End
To delimit each prompt part by hashtags, we can use the following function:
def create_prompt_formats(sample):
"""
Format various fields of the sample ('instruction', 'context', 'response')
Then concatenate them using two newline characters
:param sample: Sample dictionnary
"""
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
INSTRUCTION_KEY = "### Instruction:"
INPUT_KEY = "Input:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
blurb = f"{INTRO_BLURB}"
instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"
input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
response = f"{RESPONSE_KEY}\n{sample['response']}"
end = f"{END_KEY}"
parts = [part for part in [blurb, instruction, input_context, response, end] if part]
formatted_prompt = "\n\n".join(parts)
sample["text"] = formatted_prompt
return sample
Now, we will use our model tokenizer to process these prompts into tokenized ones.
The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
conf = model.config
max_length = None
for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
max_length = getattr(model.config, length_setting, None)
if max_length:
print(f"Found max lenth: {max_length}")
break
if not max_length:
max_length = 1024
print(f"Using default max length: {max_length}")
return max_length
def preprocess_batch(batch, tokenizer, max_length):
"""
Tokenizing a batch
"""
return tokenizer(
batch["text"],
max_length=max_length,
truncation=True,
)
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
"""Format & tokenize it so it is ready for training
:param tokenizer (AutoTokenizer): Model Tokenizer
:param max_length (int): Maximum number of tokens to emit from tokenizer
"""
# Add prompt to each sample
print("Preprocessing dataset...")
dataset = dataset.map(create_prompt_formats)#, batched=True)
# Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
_preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
dataset = dataset.map(
_preprocessing_function,
batched=True,
remove_columns=["instruction", "context", "response", "text", "category"],
)
# Filter out samples that have input_ids exceeding max_length
dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
# Shuffle dataset
dataset = dataset.shuffle(seed=seed)
return dataset
With these functions, our dataset will be ready for fine-tuning !
Create a bitsandbytes configuration
This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.
def create_bnb_config():
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
return bnb_config
To leverage the LoRa method, we need to wrap the model as a PeftModel.
To do this, we need to implement a LoRa configuration:
def create_peft_config(modules):
"""
Create Parameter-Efficient Fine-Tuning config for your model
:param modules: Names of the modules to apply Lora to
"""
config = LoraConfig(
r=16, # dimension of the updated matrices
lora_alpha=64, # parameter for scaling
target_modules=modules,
lora_dropout=0.1, # dropout probability for layers
bias="none",
task_type="CAUSAL_LM",
)
return config
Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')
return list(lora_module_names)
Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.
def print_trainable_parameters(model, use_4bit=False):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
num_params = param.numel()
# if using DS Zero 3 and the weights are initialized empty
if num_params == 0 and hasattr(param, "ds_numel"):
num_params = param.ds_numel
all_param += num_params
if param.requires_grad:
trainable_params += num_params
if use_4bit:
trainable_params /= 2
print(
f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
)
We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.
Train
Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:
# Load model from HF with user's token and with bitsandbytes config
model_name = "meta-llama/Llama-2-7b-hf"
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_name, bnb_config)
## Preprocess dataset
max_length = get_max_length(model)
dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)
Then, we can run our fine-tuning process:
def train(model, tokenizer, dataset, output_dir):
# Apply preprocessing to the model to prepare it by
# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
model.gradient_checkpointing_enable()
# 2 - Using the prepare_model_for_kbit_training method from PEFT
model = prepare_model_for_kbit_training(model)
# Get lora module names
modules = find_all_linear_names(model)
# Create PEFT config for these modules and wrap the model to PEFT
peft_config = create_peft_config(modules)
model = get_peft_model(model, peft_config)
# Print information about the percentage of trainable parameters
print_trainable_parameters(model)
# Training parameters
trainer = Trainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=20,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False # re-enable for inference to speed up predictions for similar inputs
### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
# Verifying the datatypes before training
dtypes = {}
for _, p in model.named_parameters():
dtype = p.dtype
if dtype not in dtypes: dtypes[dtype] = 0
dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items(): total+= v
for k, v in dtypes.items():
print(k, v, v/total)
do_train = True
# Launch training
print("Training...")
if do_train:
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
print(metrics)
###
# Saving model
print("Saving last checkpoint of the model...")
os.makedirs(output_dir, exist_ok=True)
trainer.model.save_pretrained(output_dir)
# Free memory for merging weights
del model
del trainer
torch.cuda.empty_cache()
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)
If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace themax_steps
argument by num_train_epochs
.
To later load and use the model for inference, we have used the trainer.model.save_pretrained(output_dir)
function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.
Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a EarlyStoppingCallback
, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.
Merge weights
Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()
output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)
# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)
Transformer Model
First of all, We chose the Mistral 7B model (mistralai/Mistral-7B-v0.1
) for this testing. The Mistral 7B model developed by Mistral AI is an open-source LLM released in September 2023 (link to the paper). From many aspects, this model outperforms famous models such as Llama2 (see the following charts).
Image from Mistral Release Blog Post (https://mistral.ai/news/announcing-mistral-7b/)
Dataset
Also, We used Databricks databricks-dolly-15k dataset (under CC BY-SA 3.0 license) for fine-tuning (read more). We used a small subset (1000 rows) of this data to reduce the fine-tuning time and proof of concept.
Configurations
At the model loading time, We used the following quantization config to overcome the GPU memory limitations that We am facing.
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
This quantization config is vital for fine-tuning a model on a single GPUT as it has one low-precision storage data type, nf4
(4-bit Normal Float) which is a 4-bit data type, and a computation data type which is bfloat16
. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to bfloat16
and then perform matrix multiplication in 16-bit (read more in the original paper).
Also, as mentioned before, we are using LoRA in conjunction with Quantization, known as QLoRA, to overcome the memory limitations. Here are my configs for the LoRA:
lora_config = LoraConfig(
r=16,
lora_alpha=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"],
bias="none",
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
For my LoRA configuration, We used 16 for rank. It is advised to set the rank between 4 and 16 to get a good trade-off between reducing the number of trainable parameters and the model performance. Finally, we applied LoRA to a subset of linear layers in our Mistral 7B transformer model.
Training and Monitoring
Using my single graphic card, We could complete 4 epochs of training (1000 steps). For me, one of the purposes of doing such a test, training an LLM on a single local GPU, is to monitor the hardware resources without any restrictions. One of the simplest tools for monitoring GPU during training is the Nvidia System Management Interface (SMI). Simply open a terminal and in your command line type:
nvidia-smi
or for continuously monitoring and updating, use (refreshes every 1 sec):
nvidia-smi -l 1
As a result, you will see the memory usage by each process on your GPU. In the following SMI view, We just loaded the model and it took about 5GB of memory (thanks to the quantization). Also, as you see the model is loaded by the Anaconda3 Python (a Jupyter notebook) process.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:29:00.0 On | N/A |
| 30% 37C P8 33W / 350W | 5346MiB / 24576MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1610 G /usr/lib/xorg/Xorg 179MiB |
| 0 N/A N/A 1820 G /usr/bin/gnome-shell 41MiB |
| 0 N/A N/A 108004 G ...2023.3.3/host-linux-x64/nsys-ui.bin 8MiB |
| 0 N/A N/A 168032 G ...seed-version=20240110-180219.406000 117MiB |
| 0 N/A N/A 327503 C /home/***/anaconda3/bin/python 4880MiB |
+---------------------------------------------------------------------------------------+
And here (the following snapshot) is the memory state after about 30 steps into the training process. As you see, the used GPU memory is now about 15GB.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:29:00.0 On | N/A |
| 30% 57C P2 341W / 350W | 15054MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1610 G /usr/lib/xorg/Xorg 179MiB |
| 0 N/A N/A 1820 G /usr/bin/gnome-shell 40MiB |
| 0 N/A N/A 108004 G ...2023.3.3/host-linux-x64/nsys-ui.bin 8MiB |
| 0 N/A N/A 168032 G ...seed-version=20240110-180219.406000 182MiB |
| 0 N/A N/A 327503 C /home/***/anaconda3/bin/python 14524MiB |
+---------------------------------------------------------------------------------------+
Although SMI is a simple tool for monitoring GPU memory usage, there are still a few advanced monitoring tools that provide more detailed information. One of these advanced tools is PyTorch Memory Snapshot which you can read more about in this interesting article.
Summary
In this article, we show that it is possible to fine-tune a large language model such as Mistral 7B on a single 24 GB GPU (such as NVIDIA GeForce RTX 3090 GPU). However, as discussed in detail, special PEFT techniques like QLoRA are necessary. Also, the batch size of the model is important and we might need much longer training time, just because of our limited resources.