STEP 01
Collect conversation pairs
Each training example is one JSON object on one line — a JSONL file. The schema follows the OpenAI / Hugging Face chat template: a messages array with role/content turns.
// One example per line
{"messages":[
{"role":"user","content":"What is LoRA?"},
{"role":"assistant","content":"Low-Rank Adaptation..."}
]}
Aim for 100–1000+ high-quality pairs. Diversity matters more than volume.
STEP 02
LoRA fine-tune in PyTorch
LoRA (Low-Rank Adaptation) freezes the base weights and trains small adapter matrices. You can fine-tune a 1–3B parameter model on a single consumer GPU.
# Using PEFT + Transformers
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
config = LoraConfig(r=8, lora_alpha=16,
target_modules=["q_proj","v_proj"])
model = get_peft_model(model, config)
# ... train on your .jsonl ...
model.save_pretrained("./adapter")
STEP 03
Merge & convert to GGUF
GGUF (GPT-Generated Unified Format) is llama.cpp's binary format — quantized, mmap-friendly, single-file. Merge the LoRA adapter into the base model, then convert.
# Merge adapter into base
model = model.merge_and_unload()
model.save_pretrained("./merged")
# Convert with llama.cpp
$ python convert_hf_to_gguf.py ./merged \
--outfile model.gguf --outtype q4_k_m
q4_k_m ≈ 4-bit, balanced size/quality. Try q8_0 for higher fidelity.
STEP 04
Serve with Ollama Modelfile
A Modelfile tells Ollama how to load your GGUF: base file, chat template, system prompt, sampling parameters.
# Modelfile
FROM ./model.gguf
TEMPLATE """{{ if .System }}<|system|>{{ .System }}
{{ end }}<|user|>{{ .Prompt }}
<|assistant|>"""
SYSTEM "You are a helpful assistant."
PARAMETER temperature 0.7
PARAMETER stop "<|user|>"
# Then:
$ ollama create my-slm -f Modelfile
$ ollama run my-slm