~/organiccode.net

devlog

A Norwegian voice, trained on a gaming PC

There's a hole in the Norwegian open-source TTS landscape, and I want to ship a game through it. So I fine-tuned CosyVoice 3 on ~458 hours of Bokmål.

tts cosyvoice lora norwegian learnloop huggingface devlog

Samples

gen ref name
Atle Antonsen
Kari Nessa Nordtun
Håvard Grønlie
Jonas Gahr Støre
Barack Obama

Full release on Hugging Face

Why?

I’ve been looking around for Norwegian text-to-speech models, on and off, for a while. When I started making learning games for my kids, Learnloop, I started looking properly. Having text read aloud turns out to be essential when you can’t read yet. And it turns out there is no modern text-to-speech model for Norwegian. Paying voice actors for a hobby project is out. Sending money to big American tech companies for it doesn’t sit right either. This is low-hanging fruit we should already have! So, time to start training.

And that’s what I did. Today I pushed the first release: AlexKjes/cosyvoice3-norwegian-lora. A LoRA fine-tune of Fun-CosyVoice3-0.5B-2512 — Alibaba’s CosyVoice 3 base — on ~458 hours of Norwegian Bokmål. Step 20,880, after about 50 hours of training on an RTX 3090.

Why CosyVoice 3

Before this I was running F5-TTS on the same stack. F5 is fantastic at English voice cloning and decent at Norwegian after fine-tuning, but it has one structural limitation that I kept running into: its prosody comes entirely from the reference clip. The model doesn’t understand what it’s saying. Give it a question with a monotone reference voice and you get a monotone question back. For game dialogue — where the same voice actor needs to deliver calm exposition, an excited turn, a quiet aside — that’s not enough.

CosyVoice 3 is the next generation in that lineage and a much more interesting piece of architecture. It’s a two-stage model: a Qwen2-0.5B LLM frontend reads the text and emits semantic speech tokens, then a flow matcher decodes those tokens into audio. The upstream team’s pitch is that CV3 is “designed for zero-shot multilingual speech synthesis in the wild,” with state-of-the-art numbers on “content consistency, speaker similarity, and prosody naturalness.” That maps cleanly to the two things F5 couldn’t give me:

  • Real semantic prosody. Because the first stage is an actual language model, it intones questions as questions and puts stress on the right word in a sentence without needing the reference clip to demonstrate it. That alone is the difference between “a robotic narrator” and “something that sounds like dialogue.”
  • Prompt-directed synthesis. Upstream describes it as supporting “various instructions such as languages, dialects, emotions, speed, volume, etc.” In practice that means you can tell the model how to read a line — “si dette stille og ettertenksomt”, “med entusiasme” — and it listens. For interactive game characters that’s exactly the steering knob I want.

It also supports bi-streaming inference — text streaming in, audio streaming out — with a claimed end-to-end latency as low as 150 ms. I haven’t measured my own numbers yet, but even an order of magnitude off would still be the difference between a noticeable pause and something that feels like real speech.

Training: LoRA, not full fine-tuning

Even though CosyVoice 3 is “only” half a billion parameters by modern LLM standards, training all of them on a single RTX 3090 is a non-starter — the optimizer state alone wouldn’t fit. So this is a LoRA fine-tune: instead of updating the model’s weights directly, I’m training a small low-rank “delta” that gets added on top of the frozen original. About 13 million trainable parameters out of ~500 million — roughly 2.6 % of the model.

The LoRA targets the Qwen2-0.5B LLM frontend at rank 24, applied to all 24 transformer blocks. The flow-matching decoder downstream is left alone; it already knows how to render speech tokens into clean audio, and the part that needed teaching was the upstream “how does Norwegian sound.” Total wall time: ~50 hours on a single 3090.

If you’ve never run into LoRA before, I wrote a short illustrated primer: What is LoRA, and why does everyone use it? Two diagrams, no math beyond matrix shapes.

The dataset

~458 hours of Norwegian Bokmål, from two open corpora published by the AI Lab at the National Library of Norway:

SourceClipsHoursLicense
NbAiLab/NST~219,000~540Apache 2.0
NbAiLab/NPSC~32,000~140CC-0

The data pipeline runs on the home cluster and has been chugging away for most of a year. Every clip goes through the same treatment: Demucs strips music and background noise, nb-whisper-large does the first transcription pass, pyannote 3.1 keeps only single-speaker segments, and a second pass through nb-wav2vec2-1b-bokmaal produces CTC labels that the trainer actually consumes. Length and word-confidence filters trim the long tail of clips that are too short, too noisy, or where the transcription disagrees with itself.

NST is the close-mic backbone; NPSC adds parliamentary speech with more natural rhythm and turn-taking. Both are openly licensed, so the resulting model is unencumbered on the data side.

Try it

It’s live at https://huggingface.co/AlexKjes/cosyvoice3-norwegian-lora. The release is CC BY-NC 4.0 for now — I want to evaluate the model properly before relaxing the license.

Install and run

The LoRA is just a checkpoint that sits on top of the upstream CosyVoice 3 toolkit, so the setup is “clone CosyVoice, fetch the base model, load my weights on top.”

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt

Then pull the base model and the Norwegian LoRA checkpoint:

from huggingface_hub import snapshot_download

snapshot_download(
    "FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
    local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)
snapshot_download(
    "AlexKjes/cosyvoice3-norwegian-lora",
    local_dir="pretrained_models/cosyvoice3-norwegian-lora",
)

Load the base, merge the LoRA’s EMA weights into the Qwen2 frontend, and run zero-shot inference:

import torch
from cosyvoice.cli.cosyvoice import CosyVoice3

cosy = CosyVoice3("pretrained_models/Fun-CosyVoice3-0.5B", fp16=True)

state = torch.load(
    "pretrained_models/cosyvoice3-norwegian-lora/model_20880_ema.pt",
    map_location="cpu",
    weights_only=False,
)
state = {k: v for k, v in state.items() if k not in ("step", "epoch")}
cosy.model.llm.load_state_dict(state, strict=False)

audio_chunks = []
for chunk in cosy.inference_zero_shot(
    tts_text="Norsk talesyntese skal være tilgjengelig for alle.",
    prompt_text="You are a helpful assistant.<|endofprompt|>" + ref_transcript,
    prompt_speech_16k=ref_audio_16k,
):
    audio_chunks.append(chunk["tts_speech"])

The "You are a helpful assistant.<|endofprompt|>" prefix on prompt_text is what tells the Qwen2 frontend it’s in inference mode. Without it the model drifts.

What’s next

  • Round-trip eval on the curated test set (TTS → nb-whisper-medium → WER). The numbers go into the same TensorBoard the loss curve is in, so I can see quality and convergence side by side.
  • Wire it into Learnloop and hear it in actual game context — the only test that ultimately matters.
  • Better coverage of Nynorsk and dialects. Both are underrepresented in the current dataset, the model knows it, and you can hear it. The data pipeline has room to grow.

Slowly, but it’s moving.