Road to Efficient LLMs 2: QLoRA

Introduction

Previously, we discussed Low-Rank Adapters (LoRA) as a method for efficiently fine-tuning large language models (LLMs). In this post, we will discuss QLoRA, a new quantization method that builds on LoRA to enable even more efficient LLM fine-tuning. QLoRA reduces the memory requirements for fine-tuning a 65 billion parameter LLM from over 780GB to less than 48GB of GPU memory. This is done by quantizing the LLM weights to 4-bit precision before fine-tuning. Despite this drastic memory reduction, QLoRA maintains full performance compared to regular 16-bit fine-tuning in experiments. In this article, we will try to explore the theoretical part of QLoRA to understand the fundamentals of it.

Weights Quantization

Before we dive into the concept of QLoRA, let’s understand about weights quantization in this context. Weights quantization reduces the memory footprint of neural network models by converting the weights from high precision data types like 32-bit floats (FP32) to lower precision types like 8-bit integers (Int8).

Each FP32 weight takes up 4 bytes of memory. By quantizing to Int8, which uses just 1 byte per weight, we can reduce the memory usage by 4x.

For example, fine-tuning a 70 billion parameter model with full FP32 precision requires about 280GB of memory just for the model weights, excluding things like gradients and optimizer states. Using 8-bit quantization could shrink this to around 70GB - making large model fine-tuning much more feasible to run on standard GPUs.

Quantization

Let’s go through a simple example to understand how quantization converts weights from higher to lower precision. We’ll quantize a tensor from 32-bit floating point (FP32) to 8-bit integers (INT8). INT8 has a range of [-127, 127] with 256 possible values ($2^8$). The conversion is as below:

\[\mathbf{X}^{\mathrm{In} \text { 8 }}=\operatorname{round}\left(\frac{127}{\operatorname{absmax}\left(\mathbf{X}^{\mathrm{FP} 32}\right)} \mathbf{X}^{\mathrm{FP} 32}\right)=\operatorname{round}\left(c^{\mathrm{FP} 32} \cdot \mathbf{X}^{\mathrm{FP} 32}\right),\]

where $\mathbf{X}^{\mathrm{FP} 32}$ is the original input tensor in FP32 and ${\operatorname{absmax}\left(\mathbf{X}^{\mathrm{FP} 32}\right)}$ is the absolute maximun value in the tensor $\mathbf{X}^{\mathrm{FP} 32}$. The output $\mathbf{X}^{\mathrm{In} \text { 8 }}$ is the INT8 representation of original tensor. Also, we represent

\[\frac{127}{\operatorname{absmax}\left(\mathbf{X}^{\mathrm{FP} 32}\right)}=c^{\mathrm{FP} 32}\]

which is the quantization constant that are needed for the dequantization operation. Now, let’s walk through an example with implementation. Given tensors below:

import torch
tensor = torch.tensor([0.32, -1.76, 0.025, -1.22])

To convert our data into the INT8 format, we can create a function based on the equation we discussed earlier.

def quantize(X_FP32):
    # Get the absolute maximun value in tensor X_FP32
    abs_max = torch.max(torch.abs(X_FP32))
    # Define the quantization constant C
    c_FP32 = 127 / abs_max
    # Round the multiplication result to nearest value
    X_Int8 = torch.round(c_FP32 * X_FP32)
    return X_Int8, c_FP32

And we will get the following result:

quantized_tensor, c = quantize(tensor)

print("Original Tensor:", tensor)
print("Quantized Tensor:", quantized_tensor)
# Output
Original Tensor: tensor([ 0.3200, -1.7600,  0.0250, -1.2200])
Quantized Tensor: tensor([  23., -127.,    2.,  -88.])

From our results, we’ve effectively converted the FP32 tensor into its INT8 counterpart. This means we can now store our tensor values as INT8 [23,-127,2,-88], allowing us to save on memory compared to the original FP32 storage.

Dequantization

Although the quantized weights will be stored to reduce the memory requirements, but when we perform fine-tuning, often we need to perform dequantization to convert back to original data type during computation as the quantization process itself introduces approximation errors since the full precision values are rounded or truncated to fit within the quantized range. These errors can impact the model’s performance (accuracy) if left unaddressed.

During the fine-tuning process, backpropagation is used to compute gradients that indicate how much each weight in the network should change to minimize the loss. These gradient computations rely on the exact values and functions used during the forward pass. If the forward pass uses quantized values, the gradient calculations can be less accurate or even unstable, especially when using very low-bit quantization. Thus, to obtain meaningful and stable gradients, the weights (and sometimes activations) are dequantized for the forward and backward passes.

To perform dequantization, we can use the following equation:

\[\operatorname{dequant}\left(c^{\mathrm{FP} 32}, \mathbf{X}^{\mathrm{In} 18}\right)=\frac{\mathbf{X}^{\mathrm{In} 18}}{c^{\mathrm{FP} 32}}=\mathbf{X}^{\mathrm{FP} 32}\]

Let’s see how it works in code:

def dequantize(c_FP32, X_Int8):
    X_FP32_dequant = X_Int8 / c_FP32
    return X_FP32_dequant

Continue from previous implementation, we can see how the dequantized value looks like:

tensor = torch.tensor([0.32, -1.76, 0.025, -1.22])

quantized_tensor, c = quantize(tensor)
dequantized_tensor = dequantize(c, quantized_tensor)

print("Original Tensor:", tensor)
print("Quantized Tensor:", quantized_tensor)
print("Dequantized Tensor:", dequantized_tensor)
print("c:", c)

# Output
Original Tensor: tensor([ 0.3200, -1.7600,  0.0250, -1.2200])
Quantized Tensor: tensor([  23., -127.,    2.,  -88.])
Dequantized Tensor: tensor([ 0.3187, -1.7600,  0.0277, -1.2195])
c: tensor(72.1591)

When we look closely, we see that the dequantized tensor isn’t an exact match to the original tensor. However, it’s closer and more accurate than using just the quantized values for calculations. Take, for instance, the original value of 0.32. After dequantization, we get 0.3187—a slight difference of 0.0013. It’s crucial to minimize such discrepancies. Even though they seem small, they can add up as data moves through different layers of a model. This cumulative error is a key reason why fine-tuning with low-bit precision weights doesn’t match the accuracy of full-precision fine-tuning.

Note: the quantization constant are kept as original tensors data type so that we can use it for dequantization.

Blockwise-quantization

Sometimes, our set of weights can have unusually high or low values, known as outliers. These outliers can distort the range of values we’re trying to quantize. Let’s take an example to understand this better:

Suppose we introduce an outlier, say 100, to our previous tensor:

tensor = torch.tensor([0.32, -1.76, 0.025, -1.22,100.1])

quantized_tensor, c = quantize(tensor)
dequantized_tensor = dequantize(c, quantized_tensor)

print("Original Tensor:", tensor)
print("Quantized Tensor:", quantized_tensor)
print("Dequantized Tensor:", dequantized_tensor)
print("c:", c)

# Output
Original Tensor: tensor([ 3.2000e-01, -1.7600e+00,  2.5000e-02, -1.2200e+00,  1.0010e+02])
Quantized Tensor: tensor([  0.,  -2.,   0.,  -2., 127.])
Dequantized Tensor: tensor([  0.0000,  -1.5764,   0.0000,  -1.5764, 100.1000])
c: tensor(1.2687)

Here, the outlier has caused significant errors and most of your quantization levels (i.e., the discrete values to which you are mapping) might be wasted on a very small range of outlier values. The values 0.32 and 0.025 both get mapped to 0 after quantization, which is not ideal.

To address this, we can divide our tensor into smaller ‘blocks’ and apply quantization to each block separately. This way, an outlier in one block doesn’t affect the values in another block.

So, if we divide our tensor into $n$ blocks, we’ll also have $n$ unique quantization constants.

QLoRA

By now, it’s evident that QLoRA is designed to enhance the quantization process, particularly for LLMs. It brings forth a new datatype, the 4-bit NormalFloat (NF4), which allows model weights to be quantized down to a 4-bit precision, leading to further reduced memory needs. Furthermore, QLoRA adopts an innovative approach by applying an additional layer of quantization to the quantization constant itself, termed as double quantization.

NF4

NF4 is a new data type that builds upon the concept of quantile quantization. At its core, quantile quantization is an enhanced form of the standard quantization process, with the key difference being the use of quantiles to set levels during quantization. If that sounds a bit tricky, don’t fret! We’ll break it down with an example shortly.

In our earlier discussions on INT-8, we highlighted how high-precision data could be represented within a range of [-127, 127] (Int8). Comparatively, with NF4, which operates on a 4-bit representation, we are limited to $2^4$ (or 16) distinct values. This is considerably fewer than the 256 values we get with INT-8. A natural question arises: How do we determine which of these 16 values can best represent the myriad potential weight values of a model?

According to the paper:

Since pretrained neural network weights usually have a zero-centered normal distribution with standard deviation σ, we can transform all weights to a single fixed distribution by scaling σ such that the distribution fits exactly into the range of our data type. For our data type, we set the arbitrary range [−1, 1]. As such, both the quantiles for the data type and the neural network weights need to be normalized into this range.

Given that most pretrained neural network weights adhere to this distribution, it becomes clear that we can segment the normal distribution into 16 distinct levels (or values). To achieve this, we can employ the quantile function associated with a normal distribution to pinpoint the 16 exact quantiles. Let’s dive into the specifics of how we can derive these 16 values from the normal distribution to represent our model’s weights more effectively.

Start from an offset value. In the original implementation, they use offset=0.9677083. The purpose to use an offset is that:

they want to find the quantiles which have equal area to the left and the right side of the quantile. This means that they do not start with 0 or 1 quantile for the normal distribution
A symmetric k-bit quantization would mean that there’s no exact representation for the value zero. To fix this and ensure an exact zero representation, an asymmetric data type is introduced. This is done by considering quantiles for both positive and negative values separately and then merging them, ensuring only one zero value. We create $2^{k-1}$ values for negative part and $2^{k-1}+1$ for positive part and remove the overlapping value.

Note: In certain case, input data or feature maps are often padded with zeros to maintain a specific spatial dimension or to ensure the alignment of data. An exact representation of zero ensures that these paddings are accurately represented and do not introduce artifacts in the computations. Without an exact zero representation, values close to zero might get rounded to small non-zero values during quantization. This can introduce inaccuracies in the computations, especially when these small values get multiplied with other large values.

# create 8 evenly spaced probabilities
torch.linspace(offset, 0.5, 9)[:-1]
# create 7 evenly spaced probabilities
torch.linspace(offset, 0.5, 8)[:-1]

# Output for positive
tensor([0.9677, 0.9092, 0.8508, 0.7923, 0.7339, 0.6754, 0.6169, 0.5585])

# Output for negative
tensor([0.9677, 0.9009, 0.8341, 0.7673, 0.7004, 0.6336, 0.5668])

The breakdown for asymmetric quantization is as follows:

There are 8 values for the positive side. There are 7 values for the negative side. Together, this sums up to 15 values. The missing value is 0, which we will deliberately incorporate later. This intentional addition of 0 is in line with our previous discussions about the importance of having an exact representation for zero as one of the quantiles.

Apply quantile function to get the corresponding quantile.

v1 = norm.ppf(torch.linspace(offset, 0.5, 9)[:-1]).tolist()
v3 = (-norm.ppf(torch.linspace(offset, 0.5, 8)[:-1])).tolist()

We will get the following output:

tensor([-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911,  0.0000])
tensor([0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0000])

which is equivalent to the one present in the paper: Note: the complete code is taken from the official repo of QLoRA

To apply quantization with NF4, we need only make minor adjustments to the earlier quantization process. The following figure illustrates the complete quantization and dequantization procedure when utilizing NF4.

Double Quantization

For dequantizing stored weights, it’s essential to also store the quantization constant, represented as $c^{\mathrm{FP} 32}$. Given that blockwise quantization is employed, there will be $n$ quantization constants retained in their original data type. In the context of expansive models, like LLMs, this translates to a substantial number of quantization constants that must be stored, leading to increased memory overhead.

To mitigate this memory burden, the “Double Quantization” strategy is put forth. In this approach, the first round of quantization constants (expressed as $c^{\mathrm{FP} 32}$) undergoes another quantization process. This dual-step procedure yields:

Quantized quantization constants: $c_2^{\mathrm{FP} 8}$

Second-level quantization constants: $c_1^{\mathrm{FP} 32}$.

Opting for an 8-bit representation is strategic, as no noticeable performance drop was detected.

Combine Everything

Until now, our discussion has primarily revolved around the Quantization aspect of QLoRA. But you might wonder, where does the LoRA component fit into the picture?

Note: If LoRA is a new term for you, I recommend going through my earlier blog post) for a comprehensive understanding.

In essence, the process involves quantizing the pre-trained model weights using NF4. After this, only the LoRA weights undergo fine-tuning, while the pre-trained weights remain unchanged or “frozen”.

For a clearer picture, consider the forward pass of QLoRA for just one linear layer in the quantized base model paired with a single LoRA adapter:

\[\mathbf{Y}^{\mathrm{BF} 16}=\mathbf{X}^{\mathrm{BF} 16} \text { doubleDequant }\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k}-\mathrm{bit}}, \mathbf{W}^{\mathrm{NF} 4}\right)+\mathbf{X}^{\mathrm{BF} 16} \mathbf{L}_1^{\mathrm{BF} 16} \mathbf{L}_2^{\mathrm{BF} 16},\]

Here, $\mathbf{W}$ denotes the weight of the pretrained model, and $\mathbf{L}_1^{\mathrm{BF} 16} \mathbf{L}_2^{\mathrm{BF}16}$ represents the LoRA adapters. The function, doubleDequant $(\cdot)$, or double dequantization, is defined as:

\[\operatorname{doubleDequant}\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k}-\mathrm{bit}}, \mathbf{W}^{\mathrm{k}-\mathrm{bit}}\right)=\operatorname{dequant}\left(\operatorname{dequant}\left(c_1^{\mathrm{FP} 32}, c_2^{\mathrm{k}-\mathrm{bit}}\right), \mathbf{W}^{4 \mathrm{bit}}\right)=\mathbf{W}^{\mathrm{BF} 16},\]

For this model, NF4 is employed for $\mathbf{W}$ and FP8 is used for $c_2$. Additionally, the block size designated for $\mathbf{W}$ is 64, while for $c_2$, it stands at 256.

Conclusion

In today’s blog, we’ve explored a method to streamline the fine-tuning of LLMs. The strategy revolves around quantizing the pre-trained weights using a novel 4-bit representation and integrating LoRA adapters across layers for the fine-tuning process. This approach significantly trims down memory demands, making it feasible to fine-tune LLMs using consumer-grade GPUs. Notably, QLoRA has emerged as the go-to method for fine-tuning, given its comparable performance to traditional full fine-tuning techniques. In our upcoming content, we’ll dive into the practicalities of fine-tuning LLMs with QLoRA. Stay tuned!

References

Paper: