quantizzazionegptqawqmemoria gpu

Quantizzazione: far entrare un modello nella tua GPU

Perché i pesi a 4 bit occupano un quarto di quelli a 16 bit, cosa cambia tra GPTQ, AWQ e bitsandbytes, e come leggere il compromesso tra memoria e qualità.

Osservatorio Evolutivo3 min di lettura

Abstract (EN)

Quantization is the single most important technique for running large language models on modest hardware. By storing weights in fewer bits (typically 8 or 4 instead of 16) it cuts memory footprint roughly in proportion, at some cost in accuracy. This article explains the core idea, contrasts the main post-training methods a practitioner meets (bitsandbytes, GPTQ, AWQ), and gives a rule of thumb for the memory a quantized model needs. We frame quantization as an explicit quality-versus-memory trade, measurable through perplexity, rather than a free lunch, and point to where the trade tends to break down at very low bit widths.

Loading article content...

Fonti

← Modelli locali per la GenAI GGUF e llama.cpp: il formato dei pesi per l'inferenza locale →