Skip to main content
Osservatorio
Modelli locali per la GenAI
ggufllama.cppk-quantsinferenza locale

GGUF e llama.cpp: il formato dei pesi per l'inferenza locale

Cos'è GGUF, perché ha soppiantato il vecchio GGML, come leggere i suffissi come Q4_K_M e perché llama.cpp è diventato il motore di riferimento per far girare i modelli sulla CPU e su GPU miste.

Osservatorio Evolutivo2 min di lettura

Abstract (EN)

GGUF is the file format that made local inference portable. Born in the llama.cpp project as the successor to GGML, it packs a model's quantized weights together with its metadata and tokenizer into a single file that runs across CPU and mixed GPU setups. This article explains what GGUF stores, how to decode the quantization suffixes such as Q4_K_M, and why llama.cpp became the reference engine that tools like Ollama and LM Studio build on. We aim to let a reader pick the right GGUF variant for their hardware with confidence, reading the file name as a compact spec of size and quality.

Loading article content...