~/organiccode.net

devlog

What is LoRA, and why does everyone use it?

A short, picture-led primer on Low-Rank Adaptation — the trick that makes fine-tuning huge models tractable on a single GPU.

lora ml primer fine-tuning

You have a huge pretrained neural network. Hundreds of millions of parameters. It’s almost right for your task — but not quite. You’d like to teach it something new: a new language, a new style, a new domain. And ideally without renting a rack of A100s.

That’s the problem LoRA solves. Short for Low-Rank Adaptation, introduced in a 2021 Microsoft paper, now the default approach for fine-tuning large models in basically every open-source toolkit. The Norwegian voice for CosyVoice 3 I just released? Trained with LoRA on a single RTX 3090. The full model has ~500 million parameters; the part I actually trained is ~13 million. That ratio is why this kind of work is possible at all on a single consumer GPU.

The core idea

Skip to the punchline: instead of changing the model’s weights directly, LoRA learns a small addition to those weights, expressed as the product of two skinny matrices.

Wd × kfrozen+Bd × r·Ar × ktrainable=W + BAeffectiveweight

If a weight matrix WW somewhere inside the model has shape d×kd \times k — say 4096×40964096 \times 4096 — the naive thing to do during fine-tuning is to update all dkd \cdot k of those numbers. That’s about 16 million parameters per layer, and a real model has dozens of layers.

LoRA parameterizes the change as a product of two thin matrices:

ΔW=BA,BRd×r,  ARr×k\Delta W = B \cdot A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k}

where the rank rr is small — typically 8, 16, or 24.

When you multiply BB and AA together you get something the same shape as WW, but parameterized by only r(d+k)r \cdot (d + k) numbers instead of dkd \cdot k. For r=8r = 8 on a 4096×40964096 \times 4096 matrix that’s about 65 thousand numbers instead of 16 million — roughly 250× fewer.

During training WW stays frozen and only AA and BB receive gradient updates. During inference you can either keep them separate as y=Wx+BAxy = Wx + BAx, or merge them once into a new effective weight W=W+BAW' = W + BA and run the model normally.

A worked example: rank-1 on a tiny matrix

The numbers above are easier to believe if you watch them unroll once. Say WW is a 4×44 \times 4 matrix — 16 numbers if we were to fine-tune it naively. Instead we parameterize a rank-1 LoRA (r=1r = 1):

B=[2101],A=[1012]B = \begin{bmatrix} 2 \\ 1 \\ 0 \\ -1 \end{bmatrix}, \qquad A = \begin{bmatrix} 1 & 0 & -1 & 2 \end{bmatrix}

That’s 8 trainable numbers in total. The product BABA is an outer product:

ΔW=BA=[2024101200001012]\Delta W = B \cdot A = \begin{bmatrix} 2 & 0 & -2 & 4 \\ 1 & 0 & -1 & 2 \\ 0 & 0 & 0 & 0 \\ -1 & 0 & 1 & -2 \end{bmatrix}

16 numbers out — the same shape as WW — but every row is a scalar multiple of AA. The scalar for row ii is just BiB_i. That’s what rank 1 means: the whole matrix lies on a single line through the origin in row-space. 8 parameters encode 16 values, but those 16 values are tied together — you can’t set them independently.

Crank rr up and you give the LoRA more independent directions to move along. With r=2r = 2 you get the sum of two such rank-1 matrices; with r=24r = 24 (the rank I used for the CosyVoice run) you get 24. Each extra rank costs (d+k)(d + k) more parameters; in return you get more expressive capacity.

Why does it work?

The empirical observation behind LoRA: when you fine-tune a pretrained model, the update to its weights turns out to be highly structured. Even though, in principle, fine-tuning could change every single parameter, in practice the change behaves as if it has a much lower rank than the original weights. You’re not exploring all 16 million dimensions of the update space — you’re moving along a small handful of meaningful directions.

So you bake that low-rank assumption directly into the parameterization. It isn’t perfect for every task, but it’s surprisingly close to full fine-tuning quality for a tiny fraction of the parameters.

What this gets you in practice

full fine-tune~500M paramsLoRA (r=24)~13M params≈ 2.6 % of the model is actually trained

The savings stack up:

  • Memory. You only need gradients and optimizer state for the LoRA matrices, not the whole model. On a half-billion-parameter model with r = 24, that’s the difference between needing 40+ GB of VRAM and fitting comfortably on a 24 GB consumer card.
  • Training speed. Fewer parameters → fewer gradient computations → faster steps.
  • Tiny output files. A LoRA adapter for a 500M model is around 50 MB instead of 2 GB. You can share them, swap them, stack multiple of them on the same base model.
  • Plugin-style fine-tuning. Because the original weights are untouched, you can keep one base model in memory and swap LoRAs in and out depending on the task — different voices, different domains, different languages.

The cost is real but usually small: you’re giving up some capacity. There are tasks where full fine-tuning meaningfully outperforms LoRA, especially when the target distribution is far from what the model was originally pretrained on. For most adaptation tasks — including teaching a multilingual TTS model a new language — the trade looks great.

What I used for CosyVoice 3

For the Norwegian Bokmål LoRA on Fun-CosyVoice3-0.5B-2512:

  • Target: the Qwen2-0.5B language-model frontend. The flow-matcher decoder downstream is left untouched.
  • Rank: r = 24.
  • Coverage: applied to all 24 transformer blocks of the LLM.
  • Trainable parameters: ~13.2 million out of ~500 million total — about 2.6 % of the full model.

That’s the part of the model that learns to map Norwegian text to semantic speech tokens. The decoder already knows how to render those tokens into audio; what needed teaching was the upstream “how does Norwegian sound.”