Skip to content

Compile llama.cpp

The following three steps depend on having llama.cpp compiled.

bash
git clone https://github.com/ggerganov/llama.cpp --depth 1
cd llama.cpp
mkdir build && cd build
cmake .. -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_NATIVE=ON
cmake --build . --config Release

Convert HuggingFace Weights to GGUF

Command format:

bash
python3 ./convert_hf_to_gguf.py <HF_model_path> --outfile <output_GGUF_path> --outtype <precision_type>
  • <HF_model_path>: Path to the HuggingFace format model directory (usually after fine-tuning or downloading).
  • <output_GGUF_path>: Path where the converted .gguf model will be saved.
  • <precision_type>: Precision type — 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto'.

Example:

bash
python3 convert_hf_to_gguf.py /root/autodl-tmp/finetune/models/qwen3-8b-qlora/merged --outfile /root/autodl-fs/qwen3-8b-fp16-agent.gguf --outtype f16

Quantize the Model

Command format:

bash
./build/bin/llama-quantize <input_GGUF_path> <output_GGUF_path> <quantization_level>
  • <input_GGUF_path>: Path to the unquantized .gguf file.
  • <output_GGUF_path>: Path where the quantized .gguf file will be saved.
  • <quantization_level>: For example Q4_0, Q4_K_M, Q8_0, etc., depending on your needs and hardware.

Example:

bash
./build/bin/llama-quantize \
  /root/autodl-fs/qwen3-8b-fp16-agent.gguf \
  /root/autodl-fs/qwen3-8b-q8_0-agent.gguf \
  Q8_0

7. Run Model Test

Command format:

bash
./build/bin/llama-run <GGUF_model_path>
  • <GGUF_model_path>: Path to the GGUF model you want to test (either original or quantized).

Example:

bash
./build/bin/llama-run /root/autodl-fs/qwen3-8b-fp16-agent.gguf

8. High-Speed File Download from Server

You can download directly from your server provider’s storage without powering on (saving costs).

Or, using lftp:

Command format:

bash
lftp -u {username},{password} -p {port} sftp://{server_address} -e "set xfer:clobber true; pget -n {threads} {server_file_path} -o {local_file_name_or_path}; bye"
  • pget: Enables parallel download.
  • -n: Number of threads (recommend 64+, even 256 for better performance).

Example:

bash
lftp -u root,askdjiwhakjd -p 27391 sftp://yourserver.com -e "set xfer:clobber true; pget -n 256 /root/autodl-fs/qwen3-8b-fp16-agent.gguf -o qwen3-8b-fp16-agent.gguf; bye"