Skip to content

Fine-tuning the Model

It is recommended to use a CPU with high single-core performance for fine-tuning. Otherwise, there may be a CPU bottleneck (cause not yet identified — PRs to fix are welcome).

Before You Begin — Environment Setup

It’s simple, don’t worry.

bash
git clone https://github.com/qqqqqf-q/Qing-Digital-Self.git --depth 1

Create a virtual environment:

bash
python3 -m venv venv

Activate the virtual environment:

  • On Linux/Mac:
bash
source venv/bin/activate
  • On Windows:
bash
.\venv\Scripts\activate

Install dependencies:

bash
pip install -r requirements.txt

PS: I spent a very long time debugging dependency issues at this step — no idea why I had so many weird problems before() But this requirements file is my own tested version, so it should be stable.


If you need the Unsloth + Torch version provided by Unsloth, run the following:

bash
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

It will output a pip command — copy it and run it in your shell. For example:

bash
pip install --upgrade pip && pip install "unsloth[cu126-ampere-torch270] @ git+https://github.com/unslothai/unsloth.git"

Now the Actual Fine-tuning

Parameters can actually be left empty during testing, as defaults are provided. By default, it seems to use 8-bit quantization (this needs modification).

Run the fine-tuning script:

bash
python run_finetune.py
Parameter NameTypeDefault ValueDescription
--repo_idstr'Qwen/Qwen3-30B-A3B-Instruct-2507'HF repository ID
--local_dirstr'qwen3-30b-a3b-instruct'Local model directory
--use_unslothstr'false'Whether to use Unsloth
--use_qlorastr'true'Whether to use QLoRA
--data_pathstr'training_data.jsonl'Training data path
--eval_data_pathstrNoneEvaluation data path
--max_samplesstrNoneMaximum number of training samples
--max_eval_samplesstrNoneMaximum number of evaluation samples
--model_max_lengthstr'2048'Maximum sequence length
--output_dirstr'finetune/models/qwen3-30b-a3b-qlora'Output directory
--seedstr'42'Random seed
--per_device_train_batch_sizestr'1'Per-device training batch size
--per_device_eval_batch_sizestr'1'Per-device evaluation batch size
--gradient_accumulation_stepsstr'16'Gradient accumulation steps
--learning_ratestr'2e-4'Learning rate
--num_train_epochsstr'3'Number of training epochs
--max_stepsstr'-1'Maximum steps (-1 means unlimited)
--lora_rstr'16'LoRA rank
--lora_alphastr'32'LoRA alpha
--lora_dropoutstr'0.05'LoRA dropout rate
--target_modulesstr'too long, check file'LoRA target modules
--weight_decaystr'0.0'Weight decay
--moe_enablestr'false'Whether to enable MoE injection logic
--moe_lora_scopestr'expert_only'LoRA injection scope
--moe_expert_patternsstr'too long to include here, check file'Expert linear layer patterns
--moe_router_patternsstr'markdown would parse it, check file'Router/gating linear layer patterns
--moe_max_experts_lorastr'-1'Max number of LoRA experts per layer
--moe_dry_runstr'false'Whether to do a dry run only
--load_precisionstr'fp16'Model load precision: int8 / int4 / fp16
--logging_stepsstr'1'Logging interval (steps)
--eval_stepsstr'50'Evaluation interval (steps)
--save_stepsstr'200'Model save interval (steps)
--save_total_limitstr'2'Maximum number of saved models
--warmup_ratiostr'0.05'Learning rate warmup ratio
--lr_scheduler_typestr'cosine'Learning rate scheduler type
--resume_from_checkpointstrNonePath to resume training from checkpoint
--no-gradient_checkpointingflagFalseDisable gradient checkpointing (enable by adding this flag)
--no-merge_and_saveflagFalseDo not merge and save model (enable by adding this flag)
--fp16str'true'Whether to use fp16
--optimstr'adamw_torch_fused'Optimizer name
--dataloader_pin_memorystr'false'Whether to pin DataLoader memory
--dataloader_num_workersstr'0'Number of DataLoader workers
--dataloader_prefetch_factorstr'2'DataLoader prefetch factor
--use_flash_attention_2str'false'Use FlashAttention2 (not effective for Unsloth)

The parameters are still quite complex — it’s best to consult an AI for help. Below is an example of fine-tuning qwen3-8b-base on an RTX 4090:

bash
python3 run_finetune.py --output_dir /root/autodl-fs/qwen3-8b-qing-v4 --local_dir qwen3-8b-base --data_path ./training_data_ruozhi.jsonl --eval_data_path ./training_data_ruozhi_eval.jsonl --use_qlora true --lora_dropout 0.05 --num_train_epochs 8 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --learning_rate 2e-5 --lr_scheduler cosine --logging_steps 5 --eval_steps 40 --save_steps 200 --warmup_ratio 0.05 --dataloader_num_workers 16 --fp16 true --use_unsloth true --no-gradient_checkpointing --dataloader_prefetch_factor 4

Evaluation Set Not Working

  • Check that --eval_data_path is correct.
  • Ensure evaluation data format matches training data.
  • Look for console output saying “no evaluation data path provided”.

GPU Out of Memory

  • Reduce --per_device_eval_batch_size.
  • Reduce --max_eval_samples.
  • Increase the --eval_steps interval.