What is LoRA, and why does everyone use it?

You have a huge pretrained neural network. Hundreds of millions of parameters. It’s almost right for your task — but not quite. You’d like to teach it something new: a new language, a new style, a new domain. And ideally without renting a rack of A100s.

That’s the problem LoRA solves. Short for Low-Rank Adaptation, introduced in a 2021 Microsoft paper, now the default approach for fine-tuning large models in basically every open-source toolkit. The Norwegian voice for CosyVoice 3 I just released? Trained with LoRA on a single RTX 3090. The full model has ~500 million parameters; the part I actually trained is ~13 million. That ratio is why this kind of work is possible at all on a single consumer GPU.

The core idea

Skip to the punchline: instead of changing the model’s weights directly, LoRA learns a small addition to those weights, expressed as the product of two skinny matrices.

If a weight matrix $W$ somewhere inside the model has shape $d \times k$ — say $4096 \times 4096$ — the naive thing to do during fine-tuning is to update all $d \cdot k$ of those numbers. That’s about 16 million parameters per layer, and a real model has dozens of layers.

LoRA parameterizes the change as a product of two thin matrices:

\Delta W = B \cdot A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k}

where the rank $r$ is small — typically 8, 16, or 24.

When you multiply $B$ and $A$ together you get something the same shape as $W$ , but parameterized by only $r \cdot (d + k)$ numbers instead of $d \cdot k$ . For $r = 8$ on a $4096 \times 4096$ matrix that’s about 65 thousand numbers instead of 16 million — roughly 250× fewer.

During training $W$ stays frozen and only $A$ and $B$ receive gradient updates. During inference you can either keep them separate as $y = Wx + BAx$ , or merge them once into a new effective weight $W' = W + BA$ and run the model normally.

A worked example: rank-1 on a tiny matrix

The numbers above are easier to believe if you watch them unroll once. Say $W$ is a $4 \times 4$ matrix — 16 numbers if we were to fine-tune it naively. Instead we parameterize a rank-1 LoRA ( $r = 1$ ):

B = \begin{bmatrix} 2 \\ 1 \\ 0 \\ -1 \end{bmatrix}, \qquad A = \begin{bmatrix} 1 & 0 & -1 & 2 \end{bmatrix}

That’s 8 trainable numbers in total. The product $BA$ is an outer product:

\Delta W = B \cdot A = \begin{bmatrix} 2 & 0 & -2 & 4 \\ 1 & 0 & -1 & 2 \\ 0 & 0 & 0 & 0 \\ -1 & 0 & 1 & -2 \end{bmatrix}

16 numbers out — the same shape as $W$ — but every row is a scalar multiple of $A$ . The scalar for row $i$ is just $B_i$ . That’s what rank 1 means: the whole matrix lies on a single line through the origin in row-space. 8 parameters encode 16 values, but those 16 values are tied together — you can’t set them independently.

Crank $r$ up and you give the LoRA more independent directions to move along. With $r = 2$ you get the sum of two such rank-1 matrices; with $r = 24$ (the rank I used for the CosyVoice run) you get 24. Each extra rank costs $(d + k)$ more parameters; in return you get more expressive capacity.

Why does it work?

The empirical observation behind LoRA: when you fine-tune a pretrained model, the update to its weights turns out to be highly structured. Even though, in principle, fine-tuning could change every single parameter, in practice the change behaves as if it has a much lower rank than the original weights. You’re not exploring all 16 million dimensions of the update space — you’re moving along a small handful of meaningful directions.

So you bake that low-rank assumption directly into the parameterization. It isn’t perfect for every task, but it’s surprisingly close to full fine-tuning quality for a tiny fraction of the parameters.

What this gets you in practice

The savings stack up:

Memory. You only need gradients and optimizer state for the LoRA matrices, not the whole model. On a half-billion-parameter model with r = 24, that’s the difference between needing 40+ GB of VRAM and fitting comfortably on a 24 GB consumer card.
Training speed. Fewer parameters → fewer gradient computations → faster steps.
Tiny output files. A LoRA adapter for a 500M model is around 50 MB instead of 2 GB. You can share them, swap them, stack multiple of them on the same base model.
Plugin-style fine-tuning. Because the original weights are untouched, you can keep one base model in memory and swap LoRAs in and out depending on the task — different voices, different domains, different languages.

The cost is real but usually small: you’re giving up some capacity. There are tasks where full fine-tuning meaningfully outperforms LoRA, especially when the target distribution is far from what the model was originally pretrained on. For most adaptation tasks — including teaching a multilingual TTS model a new language — the trade looks great.

What I used for CosyVoice 3

For the Norwegian Bokmål LoRA on Fun-CosyVoice3-0.5B-2512:

Target: the Qwen2-0.5B language-model frontend. The flow-matcher decoder downstream is left untouched.
Rank: r = 24.
Coverage: applied to all 24 transformer blocks of the LLM.
Trainable parameters: ~13.2 million out of ~500 million total — about 2.6 % of the full model.

That’s the part of the model that learns to map Norwegian text to semantic speech tokens. The decoder already knows how to render those tokens into audio; what needed teaching was the upstream “how does Norwegian sound.”