Original article excerpt
Server-side extracted preview paragraphs from the original source.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
If you want to fine-tune an open model on your own data, you are probably interested in so-called parameter-efficient fine-tuning, in short PEFT. This term describes techniques that significantly reduce the memory requirement to fine-tune a model. Although there are dozens of these techniques, almost everyone chooses one called “LoRA”. In this blog post, we explore whether LoRA is really the best choice, what tools are available to make an informed decision, and how you can benefit from extending your horizon beyond LoRA.
There are countless open models available, but they often aren't quite good enough for your use case. Prompting may help, but it usually isn't enough. Rather than training a new model from scratch, you should consider fine-tuning an existing one.
Fine-tuning, however, is memory-hungry: you generally need enough memory to fit the whole model several times over. Quantization reduces a model's memory footprint, but quantized models can't be fine-tuned directly. So a set of techniques emerged to cut the memory needed for fine-tuning, called "parameter-efficient fine-tuning", or PEFT.
With PEFT, you can fine-tune a model using only a fraction of that memory and even fine-tune quantized models. It offers other advantages, such as tiny checkpoint sizes, greater resistance to catastrophic forgetting, and the ability to serve multiple fine-tunes from the same base model.
At Hugging Face, we develop the PEFT library, which implements many PEFT techniques behind a unified API and integrates well with the ecosystem, for example Transformers and Diffusers. It also supports multiple quantization methods, enabling further accessibility in parameter-efficient fine-tuning. PEFT provides a good starting point, whether you want to fine-tune on your own data or you're researching a new PEFT method.
One parameter-efficient fine-tuning technique that emerged early and proved to be quite effective is called “Low Rank Adaptation”, or short “LoRA”. It works by adding a handful of parameters on top of the base model, freezing the base model weights, and only training those few parameters.