What is LoRA, and why does everyone use it?
A short, picture-led primer on Low-Rank Adaptation — the trick that makes fine-tuning huge models tractable on a single GPU.
You have a huge pretrained neural network. Hundreds of millions of parameters. It’s almost right for your task — but not quite. You’d like to teach it something new: a new language, a new style, a new domain. And ideally without renting a rack of A100s.
That’s the problem LoRA solves. Short for Low-Rank Adaptation, introduced in a 2021 Microsoft paper, now the default approach for fine-tuning large models in basically every open-source toolkit. The Norwegian voice for CosyVoice 3 I just released? Trained with LoRA on a single RTX 3090. The full model has ~500 million parameters; the part I actually trained is ~13 million. That ratio is why this kind of work is possible at all on a single consumer GPU.
The core idea
Skip to the punchline: instead of changing the model’s weights directly, LoRA learns a small addition to those weights, expressed as the product of two skinny matrices.
If a weight matrix somewhere inside the model has shape — say — the naive thing to do during fine-tuning is to update all of those numbers. That’s about 16 million parameters per layer, and a real model has dozens of layers.
LoRA parameterizes the change as a product of two thin matrices:
where the rank is small — typically 8, 16, or 24.
When you multiply and together you get something the same shape as , but parameterized by only numbers instead of . For on a matrix that’s about 65 thousand numbers instead of 16 million — roughly 250× fewer.
During training stays frozen and only and receive gradient updates. During inference you can either keep them separate as , or merge them once into a new effective weight and run the model normally.
A worked example: rank-1 on a tiny matrix
The numbers above are easier to believe if you watch them unroll once. Say is a matrix — 16 numbers if we were to fine-tune it naively. Instead we parameterize a rank-1 LoRA ():
That’s 8 trainable numbers in total. The product is an outer product:
16 numbers out — the same shape as — but every row is a scalar multiple of . The scalar for row is just . That’s what rank 1 means: the whole matrix lies on a single line through the origin in row-space. 8 parameters encode 16 values, but those 16 values are tied together — you can’t set them independently.
Crank up and you give the LoRA more independent directions to move along. With you get the sum of two such rank-1 matrices; with (the rank I used for the CosyVoice run) you get 24. Each extra rank costs more parameters; in return you get more expressive capacity.
Why does it work?
The empirical observation behind LoRA: when you fine-tune a pretrained model, the update to its weights turns out to be highly structured. Even though, in principle, fine-tuning could change every single parameter, in practice the change behaves as if it has a much lower rank than the original weights. You’re not exploring all 16 million dimensions of the update space — you’re moving along a small handful of meaningful directions.
So you bake that low-rank assumption directly into the parameterization. It isn’t perfect for every task, but it’s surprisingly close to full fine-tuning quality for a tiny fraction of the parameters.
What this gets you in practice
The savings stack up:
- Memory. You only need gradients and optimizer state for the LoRA matrices, not the whole model. On a half-billion-parameter model with
r = 24, that’s the difference between needing 40+ GB of VRAM and fitting comfortably on a 24 GB consumer card. - Training speed. Fewer parameters → fewer gradient computations → faster steps.
- Tiny output files. A LoRA adapter for a 500M model is around 50 MB instead of 2 GB. You can share them, swap them, stack multiple of them on the same base model.
- Plugin-style fine-tuning. Because the original weights are untouched, you can keep one base model in memory and swap LoRAs in and out depending on the task — different voices, different domains, different languages.
The cost is real but usually small: you’re giving up some capacity. There are tasks where full fine-tuning meaningfully outperforms LoRA, especially when the target distribution is far from what the model was originally pretrained on. For most adaptation tasks — including teaching a multilingual TTS model a new language — the trade looks great.
What I used for CosyVoice 3
For the Norwegian Bokmål LoRA on Fun-CosyVoice3-0.5B-2512:
- Target: the Qwen2-0.5B language-model frontend. The flow-matcher decoder downstream is left untouched.
- Rank:
r = 24. - Coverage: applied to all 24 transformer blocks of the LLM.
- Trainable parameters: ~13.2 million out of ~500 million total — about 2.6 % of the full model.
That’s the part of the model that learns to map Norwegian text to semantic speech tokens. The decoder already knows how to render those tokens into audio; what needed teaching was the upstream “how does Norwegian sound.”