Quantizzazione: far entrare un modello nella tua GPU
Perché i pesi a 4 bit occupano un quarto di quelli a 16 bit, cosa cambia tra GPTQ, AWQ e bitsandbytes, e come leggere il compromesso tra memoria e qualità.
Abstract (EN)
Quantization is the single most important technique for running large language models on modest hardware. By storing weights in fewer bits (typically 8 or 4 instead of 16) it cuts memory footprint roughly in proportion, at some cost in accuracy. This article explains the core idea, contrasts the main post-training methods a practitioner meets (bitsandbytes, GPTQ, AWQ), and gives a rule of thumb for the memory a quantized model needs. We frame quantization as an explicit quality-versus-memory trade, measurable through perplexity, rather than a free lunch, and point to where the trade tends to break down at very low bit widths.
Fonti