ChatRaj
LLM internals

Quantization

Quantization compresses a large language model by storing its weights at lower numerical precision, typically dropping from 16-bit floats to 8-bit or 4-bit integers.

Bottom line
Quantization compresses a large language model by storing its weights at lower numerical precision, typically dropping from 16-bit floats to 8-bit or 4-bit integers. The model uses 2x to 4x less memory and runs faster, with most modern methods losing only a few percent of benchmark quality. It is what makes 70B models runnable on a single consumer GPU.
Reviewed by ··5 min read
Jump to section

What quantization actually is

Quantization is the process of reducing the numerical precision of a model's weights, and sometimes its activations, so the model takes less memory and computes faster. A trained large language model ships in 16-bit floating point (FP16 or BF16) by default. Each weight is two bytes. A 70B parameter model in FP16 therefore weighs roughly 140 GB on disk and needs that much VRAM to load, well past what any single consumer GPU offers in 2026.

Drop those weights to 8-bit integers and the model halves in size. Drop them to 4-bit and it shrinks to a quarter, around 35 to 40 GB for that same 70B model, which fits on a single high-end consumer card. The arithmetic is the same idea you use when rounding 3.14159 to 3.14: you trade a tiny bit of precision for a much smaller number to carry around.

There are two practical regimes. The first is weight-only quantization, the most common setup for inference. Weights live on disk in low precision, get loaded that way, and are dequantized on the fly to FP16 (or matrix-multiplied directly in low precision) when the model runs. Activations, the intermediate tensors flowing through the network, stay at full precision. The second regime is full activation quantization, where both weights and activations are stored in low precision. This is harder to get right because activations have a wider dynamic range, but it is used heavily in training and in high-throughput serving where every byte of memory bandwidth counts.

The bit-width landscape: FP16, FP8, INT8, INT4

There are five formats worth knowing.

FP16 / BF16 (16-bit) is the unquantized baseline. BF16 trades a few bits of mantissa for a wider exponent range and is what most LLMs train in.

FP8 (8-bit float) keeps the floating-point structure but uses one byte per weight. It has a wider dynamic range than INT8, which helps when activation outliers blow up your quantization scale. NVIDIA's Hopper and Blackwell GPUs have native FP8 matrix units, so it has become the default high-throughput inference format in production serving.

INT8 (8-bit integer) has uniform precision across its range. The classic bitsandbytes 8-bit format is essentially free quality-wise on most models, and the format every inference runtime supports.

INT4 (4-bit) is the aggressive end. Each weight gets 16 possible values. Smart algorithms use group-wise scales and outlier handling so this still works. The two production methods you will see are GPTQ (post-training, layer-wise reconstruction using second-order information) from arXiv:2210.17323, and AWQ (activation-aware weight quantization) from arXiv:2306.00978, which protects the small percentage of weights that activations care about most. Bitsandbytes also offers 4-bit NF4, and HQQ provides a calibration-free path.

INT2 and below is research territory. Quality drops measurably and consistently at 2 bits.

Formats matter as much as bit width. For consumer-hardware inference you want GGUF, llama.cpp's single-file format that bundles tokenizer, architecture, and weights together. The popular tier is Q4_K_M, which uses about 4.5 bits per weight on average with importance-matrix scaling and lands a 7B model around 4.1 GB on disk. For server-side inference with vLLM or TGI you want safetensors files carrying AWQ or GPTQ metadata.

Why quantization matters for AI chatbots

Quantization is the technique that decoupled "can run open-source LLMs" from "owns an eight-GPU server." A Llama 3 70B model in Q4_K_M GGUF fits on a single RTX 4090 with room left for a small KV cache. The same model in FP16 would need five of those cards. That is the difference between a hobbyist project and a research budget.

For a chatbot operator the practical impact is in three places. Memory drops 2x to 4x, which means you can serve a larger model or fit a longer context window into the same VRAM (the KV cache often dominates memory at long contexts, and it can be quantized too). Throughput rises because memory bandwidth is the bottleneck in single-batch decoding, and smaller weights move faster. Cost per token drops correspondingly: a quantized model on a cheaper GPU often serves the same load as a full-precision model on a bigger one.

ChatRaj uses provider-hosted LLMs in production, so most operators never touch quantization directly. But operators self-hosting open-source models almost always quantize, and anyone benchmarking "Llama 3 70B for our chatbot" is implicitly benchmarking a quantized version. Knowing the format and bit width matters: a Q4_K_M result tells you very little about how that model performs at FP16, and vice versa.

Quantization is sometimes confused with fine-tuning. They are unrelated operations. Fine-tuning updates the weights to a new distribution; quantization compresses whatever weights exist. The crossover is QLoRA, which fine-tunes a quantized base model by training small adapter weights in higher precision on top of it.

Quality cost: how much do you lose at 4-bit?

The short answer: surprisingly little. Modern 4-bit GPTQ and AWQ methods retain roughly 95 to 99 percent of FP16 quality on standard benchmarks like MMLU and HellaSwag. Community measurements on Llama-family models report WikiText-2 perplexity increases of 0.1 to 0.3 points relative to FP16 at Q4_K_M, a degradation that is statistically real but imperceptible in chat or instruction-following.

Where you do see drops is in long-tail reasoning and in non-English languages, both of which lean more heavily on the weights that quantization rounds away. Coding benchmarks tend to be slightly more sensitive than general knowledge ones. If your chatbot is a customer support deflection layer answering FAQ-style questions in English, you will not notice the difference between Q4_K_M and FP16. If you are running code generation in a less-resourced language, run the benchmark before you ship.

INT2 is where quality breaks. Going from 4 bits to 2 bits roughly doubles the perplexity gap, and the failure modes become qualitatively different: the model starts producing fluent but factually drifting answers rather than slightly-worse versions of the right one. For now, 4-bit is the practical floor for production chatbots.

FAQ

Common Quantization questions

Not noticeably for chat. Modern methods like AWQ and Q4_K_M lose only a few benchmark points, usually 1 to 3 percent on MMLU and a WikiText-2 perplexity bump of 0.1 to 0.3 points. In conversational use the difference is imperceptible. Where you do see drops is long-tail reasoning, code generation, and non-English languages, so benchmark on your real workload before shipping.

Was this helpful?

Ship your first chatbot in 60 seconds.

Sign in with Google and you'll be answering visitor questions before your coffee gets cold.

60-second setup · One-line install · Works on any site

Works on any website
SShopify
WWebflow
WPWordPress
SqSquarespace
FFramer
</>Plain HTML