Reducing resource requirements: Hands-on LLM quantization and post-quantization performance assessment for dummies
- Intro
- Post training quantization VS quantization aware training
- Precision to memory usage prediction
- Rule of thumb: accuracy VS resource requirements balance
- Running quantized model inference
- Automatic post-quantization accuracy evaluation
Intro
In LLM context, quantization
is a technique used to reduce the precision of the models parameters (weights and biases) in order to reduce
the model’s memory footprint.
While it has a lot of advantages, the main ones are that it allows us to reduce hardware requirements for the model we use (or use a bigger model with the same hardware),
and reduce inference cost (or increase the speed).
Similar to Effective LLM fine-tuning for dummies, the content of this post is for professionals (such as software engineers and architects) which aren’t specialized in AI and data science domains but required to gain sufficient knowledge to enable and lead features and projects in these domains, so the content is simplified and covers the main subjects without getting into too many details while maintaining the hands-on approach.
Post training quantization VS quantization aware training
Typically, quantization process can be implemented in two different phases, during training (or fine-tuning), or during inference.
Quantization during training usually referred to as QAT
(quantization aware training), while quantization during inference referred to as PTQ
(post training quantization).
QAT
is a computationally expensive process but might result in a better model accuracy than the cheaper, simpler PTQ
.
With that being said, depending on the data and the precision after the quantization, the reduction in the model accuracy is often negligible.
In this post, we’ll do some hands-on PTQ
quantization.
Precision to memory usage prediction
As a generic rule of thumb, to predict approximate memory requirement of a specific model, you should multiply the number of parameters
by the precision of the weights.
Let’s take falcon-7b
with a single precision float 32bit
weights as an example.
Float 32bit
takes up 4 bytes, so if we multiply the number of the bytes by the number of parameters, we’ll get:
4 × 7e9 = 2.8e10 = 28,000,000,000 bytes = 28gb
.
Based on the same logic, weight reduction to a half precision float 16bit
which takes 2 bytes will require approximately 14gb memory.
Rule of thumb: accuracy VS resource requirements balance
The optimal weight precision to accuracy balance, like a lot of things in the LLM ecosystem - depends on the use case and the data.
As a rule of thumb, quantizing to int 8bit
offers hardly noticed accuracy degradation while reducing memory requirements by 50% or 75%, depending on the original
parameter precision.
With that being said, more aggressive precision reduction such as int4
, float4
, or even double quantization
does exist and definitely worth testing whether they are good enough
for your data and use case.
Running quantized model inference
Similar to the Effective LLM fine-tuning for dummies post,
I will use LIT-GPT
as the tool due to its simplicity.
Running quantized inference with LIT-GPT is pretty straightforward, for example:
This command will run an inference with a 16bit precision parameters.
As explained above, there additional options such as int4
, float4
etc.
For the full list of options check LIT-GPT quantization Github page.
Automatic post-quantization accuracy evaluation
It’s a good idea to manually evaluate handpicked set of examples before and after quantization to assess the model’s performance after quantization.
But if we want to perform a more thorough assessment on a larger test set, or to run it as a step in one of our CI/CD pipelines, an automatic tool will be required.
One of these tools is evalute
, which can be found in hugging face (or Github).
Evaluate
provides 53 easy tools evaluation tools which we can leverage to test post quantization model performance.
Pre and post quantization summarization quality example using evaluate
For the sake of the example, let’s assume that our use case is to test text summarization capabilities of a language model pre- and post-quantization.
One of the multiple metrics we can use to compare the quality of the summary is ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
We can run the evaluation module twice, first time with the pre-quantized inference output, second time with post-quantized - and compare the scores.
As explained above, similar ROUGE score does not guarantee a good post quantization accuracy since it just measures similar terms and phrases, but it’s enough
to demonstrate the ease of use of the evaluate
package.