Fine Tuning LLMs on a Single Consumer Graphic Card

Know Early AI Trends!

Sign-up to get Trends and Tools related to AI directly to your inbox

We don’t spam!

When we think about Large Language Models or any other generative models, the first hardware that comes to mind is GPU. Without GPUs, many advancements in Generative AI, machine learning, deep learning, and data science would’ve been impossible. If 15 years ago, gamers were enthusiastic about the latest GPU technologies, today data scientists and machine learning engineers join them and pursue the news in this field too. Although usually gamers and ML users are looking at two different kinds of GPUs and graphic cards.

Hardware and Software Setup

Before loading any LLM model or training dataset, we need to find out what hardware and software we need for such a process.

As mentioned, We used the NVIDIA GeForce RTX 3090 GPU because it has one of the highest memory (24GB) among consumer GPUs (FYI: 4090 model has the same memory size too). It is based on Ampere Architecture which is the same architecture that famous A100 GPUs have. You can see more about GeForce RTX 3090 GPU specifications here.

After all the tests, We believe 24GB is the least amount of GPU memory that we need for working with LLMs with billions of parameters.

In addition to the graphics card, we need to make sure that our PC has a good ventilation system. During fine-tuning, the temperature of the GPU easily goes up, and its fans are not enough to keep it cool. Higher GPU temperature can lower the GPU performance and the process will take much longer time.

In addition to hardware, there are some software considerations that We must mention here. First of all, if you are a Windows user, We have a piece of bad news for you. Some libraries and tools only work on Linux. Specifically, bitsandbytes which is being used frequently for model quantization is not Windows-friendly. Some people made a wrapper for Windows (for example here), but they have their cons and pros. So, my advice is either to install Linux on WSL or like me have a dual boot system and fully switch to Linux while you are working on LLMs.

Also, you need to install PyTorch and a compatible CUDA version. My recommendation is to install CUDA 12.3 (link). Then you need to go to this page, and based on your system, CUDA version, and package manager system, download and install the correct PyTorch (https://pytorch.org/).

Note: If you are using CUDA 12.3, you might need to add or configure BNB_CUDA_VERSION and LD_LIBRARY_PATH env variables in your .bashrc file. Here is an example for your reference.

export BNB_CUDA_VERSION=123

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/<YOUR-USER-DIR>/local/cuda-12.3

And finally, you need to install the following packages on your system. We recommend creating a new virtual environment (venv) to avoid conflict with other installed packages on your system. Also, for your reference here are versions of packages that We successfully used:

torch==2.1.2

transformers==4.36.2

datasets==2.16.1

bitsandbytes==0.42.0

peft==0.7.1

Technical Background

Now that you have all the hardware and software ready for working with LLMs on your system, it is better to have a very brief review of the technical concepts that you will face in the next sections.

Large language models comprise millions or billions of parameters. Usually, we use pre-trained models that have been trained on billions and sometimes trillions of tokens after a long training process and usually millions of dollars in cost. Each of these model parameters takes 32 bits (4 bytes) of memory for loading. As a rule of thumb, each 1 billion parameters require about 4GB of memory to just load the model. One technique of using less memory for loading (and later inferencing or training a model later) is “Quantization”. In this technique, we reduce the precision of the model weights from 32-bit full precision (fb32) to 16-bit (fp16 or bfloat16), 8-bit (int8), or even less (read more). As you can imagine by reducing the precision of the model weights we can load larger models into a limited memory, but this comes at the cost of reducing the model’s performance. However, some studies suggested that the model performance difference between fp32 and bfloat16 is insignificant and many famous models (including Llama2) were pre-trained on bfloat16.

Quantization is a technique that we must use when we are fine-tuning or inferencing a large language model on a single GPU with 24GB of memory. Later you will see that the bitsandbytes library can help us achieve model quantization.

Therefore, most people (even with significant hardware resources and budget) prefer to use a pre-trained model and only fine-tune it for their specific use case. Still, the process of full fine-tuning could be overwhelming with limited resources (such as a single GPU). Therefore, Paremer-Efficient Fine-Tuning (PEFT), which only updates a limited subset of model parameters, looks more realistic with less computed resources.

Among different PEFT techniques, LoRA (Low Ranking Adaption) is very popular because of its computing efficiency. In this technique, we freeze all the original model weights and instead, we train low-rank matrices that can be added to specific layers of the Transformer architecture (read more). In many cases, using LoRA for fine-tuning an LLM, we update 0.5% of model weights.

QLoRA is a variation of LoRA that combines LoRA with the Quantization concept that we explained. Specifically, in our QLoRA implementation, we will use nf4 or Normal Float 4 for fine-tuning the model. QLoRA is very helpful in our case study of fine-tuning a large model with a single consumer GPU.

Coding Time

Set up your Python environment

Create the following requirements.txt file:

torch

accelerate @ git+https://github.com/huggingface/accelerate.git

bitsandbytes

datasets==2.13.1

transformers @ git+https://github.com/huggingface/transformers.git

peft @ git+https://github.com/huggingface/peft.git

trl @ git+https://github.com/lvwerra/trl.git

scipy

Then install and import the installed libraries:

pip install -r requirements.txt
import argparse

import bitsandbytes as bnb

from datasets import load_dataset

from functools import partial

import os

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \

DataCollatorForLanguageModeling, Trainer, TrainingArguments

from datasets import load_dataset

Download LLaMA 2 model

As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.

To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub. As mentioned in the requirements step, you need to use the huggingface-cli login command.

The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.

def load_model(model_name, bnb_config):

n_gpus = torch.cuda.device_count()

max_memory = f'{40960}MB'

model = AutoModelForCausalLM.from_pretrained(

model_name,

quantization_config=bnb_config,

device_map="auto", # dispatch efficiently the model on the available ressources

max_memory = {i: max_memory for We in range(n_gpus)},

)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

# Needed for LLaMA tokenizer

tokenizer.pad_token = tokenizer.eos_token

return model, tokenizer

Download a Dataset

There are many datasets that can help you fine-tune your model. You can even use your own dataset!

In this tutorial, we are going to download and use the Databricks Dolly 15k dataset, which contains 15,000 prompt/response pairs. It was crafted by over 5,000 Databricks employees during March and April of 2023.

This dataset is designed specifically for fine-tuning large language models. Released under the CC BY-SA 3.0 license, it can be used, modified, and extended by any individual or company, even for commercial applications. So it’s a perfect fit for our use case!

However, like most datasets, this one has its limitations. Indeed, pay attention to the following points:

  • It consists of content collected from the public internet, which means it may contain objectionable, incorrect or biased content and typo errors, which could influence the behavior of models fine-tuned using this dataset.
  • Since the dataset has been created for Databricks by their own employees, it’s worth noting that the dataset reflects the interests and semantic choices of Databricks employees, which may not be representative of the global population at large.
  • We only have access to the train split of the dataset, which is its largest subset.
# Load the databricks dataset from Hugging Face

from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

Explore dataset

Once the dataset is downloaded, we can take a look at it to understand what it contains:

print(f'Number of prompts: {len(dataset)}')

print(f'Column names are: {dataset.column_names}')

*** OUTPUT ***

Number of prompts: 15011

Column Names are: ['instruction', 'context', 'response', 'category']

As we can see, each sample is a dictionary that contains:

  • An instruction : What could be entered by the user, such as a question
  • A context : Help to interpret the sample
  • A response : Answer to the instruction
  • A category : Classify the sample between Open Q&A, Closed Q&A, Extract information from Wikipedia, Summarize information from Wikipedia, Brainstorming, Classification, Creative writing

Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.

It will help us to format our prompts as follows:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

Sea or Mountain

### Response:

We believe Mountain are more attractive but Ocean has it's own beauty and this tropical weather definitely turn you on! SO 50% 50%

### End

To delimit each prompt part by hashtags, we can use the following function:

def create_prompt_formats(sample):

"""

Format various fields of the sample ('instruction', 'context', 'response')

Then concatenate them using two newline characters

:param sample: Sample dictionnary

"""

INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."

INSTRUCTION_KEY = "### Instruction:"

INPUT_KEY = "Input:"

RESPONSE_KEY = "### Response:"

END_KEY = "### End"

blurb = f"{INTRO_BLURB}"

instruction = f"{INSTRUCTION_KEY}\n{sample['instruction']}"

input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None

response = f"{RESPONSE_KEY}\n{sample['response']}"

end = f"{END_KEY}"

parts = [part for part in [blurb, instruction, input_context, response, end] if part]

formatted_prompt = "\n\n".join(parts)

sample["text"] = formatted_prompt

return sample

Now, we will use our model tokenizer to process these prompts into tokenized ones.

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py

def get_max_length(model):

conf = model.config

max_length = None

for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:

max_length = getattr(model.config, length_setting, None)

if max_length:

print(f"Found max lenth: {max_length}")

break

if not max_length:

max_length = 1024

print(f"Using default max length: {max_length}")

return max_length

def preprocess_batch(batch, tokenizer, max_length):

"""

Tokenizing a batch

"""

return tokenizer(

batch["text"],

max_length=max_length,

truncation=True,

)

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py

def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):

"""Format & tokenize it so it is ready for training

:param tokenizer (AutoTokenizer): Model Tokenizer

:param max_length (int): Maximum number of tokens to emit from tokenizer

"""

# Add prompt to each sample

print("Preprocessing dataset...")

dataset = dataset.map(create_prompt_formats)#, batched=True)

# Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields

_preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)

dataset = dataset.map(

_preprocessing_function,

batched=True,

remove_columns=["instruction", "context", "response", "text", "category"],

)

# Filter out samples that have input_ids exceeding max_length

dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

# Shuffle dataset

dataset = dataset.shuffle(seed=seed)

return dataset

With these functions, our dataset will be ready for fine-tuning !

Create a bitsandbytes configuration

This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

def create_bnb_config():

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_use_double_quant=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16,

)

return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a LoRa configuration:

def create_peft_config(modules):

"""

Create Parameter-Efficient Fine-Tuning config for your model

:param modules: Names of the modules to apply Lora to

"""

config = LoraConfig(

r=16, # dimension of the updated matrices

lora_alpha=64, # parameter for scaling

target_modules=modules,

lora_dropout=0.1, # dropout probability for layers

bias="none",

task_type="CAUSAL_LM",

)

return config

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):

cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)

lora_module_names = set()

for name, module in model.named_modules():

if isinstance(module, cls):

names = name.split('.')

lora_module_names.add(names[0] if len(names) == 1 else names[-1])

if 'lm_head' in lora_module_names: # needed for 16-bit

lora_module_names.remove('lm_head')

return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.

def print_trainable_parameters(model, use_4bit=False):

"""

Prints the number of trainable parameters in the model.

"""

trainable_params = 0

all_param = 0

for _, param in model.named_parameters():

num_params = param.numel()

# if using DS Zero 3 and the weights are initialized empty

if num_params == 0 and hasattr(param, "ds_numel"):

num_params = param.ds_numel

all_param += num_params

if param.requires_grad:

trainable_params += num_params

if use_4bit:

trainable_params /= 2

print(

f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"

)

We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

Train

Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:

# Load model from HF with user's token and with bitsandbytes config

model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config)
## Preprocess dataset

max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)

Then, we can run our fine-tuning process:

def train(model, tokenizer, dataset, output_dir):

# Apply preprocessing to the model to prepare it by

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning

model.gradient_checkpointing_enable()

# 2 - Using the prepare_model_for_kbit_training method from PEFT

model = prepare_model_for_kbit_training(model)

# Get lora module names

modules = find_all_linear_names(model)

# Create PEFT config for these modules and wrap the model to PEFT

peft_config = create_peft_config(modules)

model = get_peft_model(model, peft_config)

# Print information about the percentage of trainable parameters

print_trainable_parameters(model)

# Training parameters

trainer = Trainer(

model=model,

train_dataset=dataset,

args=TrainingArguments(

per_device_train_batch_size=1,

gradient_accumulation_steps=4,

warmup_steps=2,

max_steps=20,

learning_rate=2e-4,

fp16=True,

logging_steps=1,

output_dir="outputs",

optim="paged_adamw_8bit",

),

data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)

)

model.config.use_cache = False # re-enable for inference to speed up predictions for similar inputs

### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

# Verifying the datatypes before training

dtypes = {}

for _, p in model.named_parameters():

dtype = p.dtype

if dtype not in dtypes: dtypes[dtype] = 0

dtypes[dtype] += p.numel()

total = 0

for k, v in dtypes.items(): total+= v

for k, v in dtypes.items():

print(k, v, v/total)

do_train = True

# Launch training

print("Training...")

if do_train:

train_result = trainer.train()

metrics = train_result.metrics

trainer.log_metrics("train", metrics)

trainer.save_metrics("train", metrics)

trainer.save_state()

print(metrics)

###

# Saving model

print("Saving last checkpoint of the model...")

os.makedirs(output_dir, exist_ok=True)

trainer.model.save_pretrained(output_dir)

# Free memory for merging weights

del model

del trainer

torch.cuda.empty_cache()

output_dir = "results/llama2/final_checkpoint"

train(model, tokenizer, dataset, output_dir)

If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace themax_steps argument by num_train_epochs.

To later load and use the model for inference, we have used the trainer.model.save_pretrained(output_dir) function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a EarlyStoppingCallback, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.

Merge weights

Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)

model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"

os.makedirs(output_merged_dir, exist_ok=True)

model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.save_pretrained(output_merged_dir)

Transformer Model

First of all, We chose the Mistral 7B model (mistralai/Mistral-7B-v0.1) for this testing. The Mistral 7B model developed by Mistral AI is an open-source LLM released in September 2023 (link to the paper). From many aspects, this model outperforms famous models such as Llama2 (see the following charts).

Image from Mistral Release Blog Post (https://mistral.ai/news/announcing-mistral-7b/)

Dataset

Also, We used Databricks databricks-dolly-15k dataset (under CC BY-SA 3.0 license) for fine-tuning (read more). We used a small subset (1000 rows) of this data to reduce the fine-tuning time and proof of concept.

Configurations

At the model loading time, We used the following quantization config to overcome the GPU memory limitations that We am facing.

quantization_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_use_double_quant=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16,

)

This quantization config is vital for fine-tuning a model on a single GPUT as it has one low-precision storage data type, nf4 (4-bit Normal Float) which is a 4-bit data type, and a computation data type which is bfloat16. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to bfloat16 and then perform matrix multiplication in 16-bit (read more in the original paper).

Also, as mentioned before, we are using LoRA in conjunction with Quantization, known as QLoRA, to overcome the memory limitations. Here are my configs for the LoRA:

lora_config = LoraConfig(

r=16,

lora_alpha=64,

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"],

bias="none",

lora_dropout=0.05,

task_type="CAUSAL_LM",

)

For my LoRA configuration, We used 16 for rank. It is advised to set the rank between 4 and 16 to get a good trade-off between reducing the number of trainable parameters and the model performance. Finally, we applied LoRA to a subset of linear layers in our Mistral 7B transformer model.

Training and Monitoring

Using my single graphic card, We could complete 4 epochs of training (1000 steps). For me, one of the purposes of doing such a test, training an LLM on a single local GPU, is to monitor the hardware resources without any restrictions. One of the simplest tools for monitoring GPU during training is the Nvidia System Management Interface (SMI). Simply open a terminal and in your command line type:

nvidia-smi

or for continuously monitoring and updating, use (refreshes every 1 sec):

nvidia-smi -l 1

As a result, you will see the memory usage by each process on your GPU. In the following SMI view, We just loaded the model and it took about 5GB of memory (thanks to the quantization). Also, as you see the model is loaded by the Anaconda3 Python (a Jupyter notebook) process.

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA GeForce RTX 3090 On | 00000000:29:00.0 On | N/A |

| 30% 37C P8 33W / 350W | 5346MiB / 24576MiB | 5% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

| 0 N/A N/A 1610 G /usr/lib/xorg/Xorg 179MiB |

| 0 N/A N/A 1820 G /usr/bin/gnome-shell 41MiB |

| 0 N/A N/A 108004 G ...2023.3.3/host-linux-x64/nsys-ui.bin 8MiB |

| 0 N/A N/A 168032 G ...seed-version=20240110-180219.406000 117MiB |

| 0 N/A N/A 327503 C /home/***/anaconda3/bin/python 4880MiB |

+---------------------------------------------------------------------------------------+

And here (the following snapshot) is the memory state after about 30 steps into the training process. As you see, the used GPU memory is now about 15GB.

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA GeForce RTX 3090 On | 00000000:29:00.0 On | N/A |

| 30% 57C P2 341W / 350W | 15054MiB / 24576MiB | 100% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

| 0 N/A N/A 1610 G /usr/lib/xorg/Xorg 179MiB |

| 0 N/A N/A 1820 G /usr/bin/gnome-shell 40MiB |

| 0 N/A N/A 108004 G ...2023.3.3/host-linux-x64/nsys-ui.bin 8MiB |

| 0 N/A N/A 168032 G ...seed-version=20240110-180219.406000 182MiB |

| 0 N/A N/A 327503 C /home/***/anaconda3/bin/python 14524MiB |

+---------------------------------------------------------------------------------------+

Although SMI is a simple tool for monitoring GPU memory usage, there are still a few advanced monitoring tools that provide more detailed information. One of these advanced tools is PyTorch Memory Snapshot which you can read more about in this interesting article.

Summary

In this article, we show that it is possible to fine-tune a large language model such as Mistral 7B on a single 24 GB GPU (such as NVIDIA GeForce RTX 3090 GPU). However, as discussed in detail, special PEFT techniques like QLoRA are necessary. Also, the batch size of the model is important and we might need much longer training time, just because of our limited resources.