Does 4-bit quantization hurt model quality?

Not noticeably for chat. Modern methods like AWQ and Q4_K_M lose only a few benchmark points, usually 1 to 3 percent on MMLU and a WikiText-2 perplexity bump of 0.1 to 0.3 points. In conversational use the difference is imperceptible. Where you do see drops is long tail reasoning, code generation, and non-English languages, so benchmark on your real workload before shipping.

What is the difference between GPTQ and AWQ?

Both are post-training 4-bit weight only methods, but they pick which precision to spend where in different ways. GPTQ uses approximate second-order information to reconstruct each layer column by column. AWQ is activation-aware: it identifies the small fraction of weight channels that activations care about most and protects those. AWQ tends to be slightly better at very low bit counts and is roughly 1.45x faster than GPTQ at inference.

Is FP8 better than INT8?

Different tradeoffs. FP8 has a wider dynamic range, which is forgiving when activation outliers would otherwise blow up an INT8 quantization scale. INT8 has more uniform precision across its range and is supported on more hardware. Both work in production in 2026. On Hopper and Blackwell GPUs with native FP8 matrix units, FP8 is now the default high-throughput inference format.

Can I quantize an embedding model?

Yes, and it usually works very well. Embedding models tend to quantize cleanly because retrieval is robust to small per-dimension errors. The dot product or cosine similarity that drives vector search averages out per-coordinate quantization noise, so you can drop an embedding model to INT8 or even INT4 and still get retrieval quality close to the full-precision baseline.

What is the GGUF format?

GGUF (GPT-Generated Unified Format) is llama.cpp's single-file format for quantized models. It bundles the tokenizer, architecture metadata, and quantized weights into one portable file, which makes it the de facto consumer format. Q4_K_M is the popular tier, averaging about 4.5 bits per weight with importance-matrix scaling. Most Hugging Face local inference UIs (LM Studio, Ollama, text-generation-webui) accept GGUF directly.

What is Model Quantization? (8-bit, 4-bit, FP8 Explained)

What quantization actually is

Quantization is the process of reducing the numerical precision of a model's weights, and sometimes its activations, so the model takes less memory and computes faster. A trained large language model ships in 16-bit floating point (FP16 or BF16) by default. Each weight is two bytes. A 70B parameter model in FP16 therefore weighs roughly 140 GB on disk and needs that much VRAM to load, well past what any single consumer GPU offers in 2026.

Drop those weights to 8-bit integers and the model halves in size. Drop them to 4-bit and it shrinks to a quarter, around 35 to 40 GB for that same 70B model, which fits on a single high-end consumer card. The arithmetic is the same idea you use when rounding 3.14159 to 3.14: you trade a tiny bit of precision for a much smaller number to carry around.

There are two practical regimes. The first is weight-only quantization, the most common setup for inference. Weights live on disk in low precision, get loaded that way, and are dequantized on the fly to FP16 (or matrix-multiplied directly in low precision) when the model runs. Activations, the intermediate tensors flowing through the network, stay at full precision. The second regime is full activation quantization, where both weights and activations are stored in low precision. This is harder to get right because activations have a wider dynamic range, but it is used heavily in training and in high-throughput serving where every byte of memory bandwidth counts.

The bit-width landscape: FP16, FP8, INT8, INT4

There are five formats worth knowing.

FP16 / BF16 (16-bit) is the unquantized baseline. BF16 trades a few bits of mantissa for a wider exponent range and is what most LLMs train in.

FP8 (8-bit float) keeps the floating-point structure but uses one byte per weight. It has a wider dynamic range than INT8, which helps when activation outliers blow up your quantization scale. NVIDIA's Hopper and Blackwell GPUs have native FP8 matrix units, so it has become the default high-throughput inference format in production serving.

INT8 (8-bit integer) has uniform precision across its range. The classic bitsandbytes 8-bit format is essentially free quality-wise on most models, and the format every inference runtime supports.

INT4 (4-bit) is the aggressive end. Each weight gets 16 possible values. Smart algorithms use group-wise scales and outlier handling so this still works. The two production methods you will see are GPTQ (post-training, layer-wise reconstruction using second-order information) from arXiv:2210.17323, and AWQ (activation-aware weight quantization) from arXiv:2306.00978, which protects the small percentage of weights that activations care about most. Bitsandbytes also offers 4-bit NF4, and HQQ provides a calibration-free path.

INT2 and below is research territory. Quality drops measurably and consistently at 2 bits.

Formats matter as much as bit width. For consumer-hardware inference you want GGUF, llama.cpp's single-file format that bundles tokenizer, architecture, and weights together. The popular tier is Q4_K_M, which uses about 4.5 bits per weight on average with importance-matrix scaling and lands a 7B model around 4.1 GB on disk. For server-side inference with vLLM or TGI you want safetensors files carrying AWQ or GPTQ metadata.

Why quantization matters for AI chatbots

Quantization is the technique that decoupled "can run open-source LLMs" from "owns an eight-GPU server." A Llama 3 70B model in Q4_K_M GGUF fits on a single RTX 4090 with room left for a small KV cache. The same model in FP16 would need five of those cards. That is the difference between a hobbyist project and a research budget.

For a chatbot operator the practical impact is in three places. Memory drops 2x to 4x, which means you can serve a larger model or fit a longer context window into the same VRAM (the KV cache often dominates memory at long contexts, and it can be quantized too). Throughput rises because memory bandwidth is the bottleneck in single-batch decoding, and smaller weights move faster. Cost per token drops correspondingly: a quantized model on a cheaper GPU often serves the same load as a full-precision model on a bigger one.

ChatRaj uses provider-hosted LLMs in production, so most operators never touch quantization directly. But operators self-hosting open-source models almost always quantize, and anyone benchmarking "Llama 3 70B for our chatbot" is implicitly benchmarking a quantized version. Knowing the format and bit width matters: a Q4_K_M result tells you very little about how that model performs at FP16, and vice versa.

Quantization is sometimes confused with fine-tuning. They are unrelated operations. Fine-tuning updates the weights to a new distribution; quantization compresses whatever weights exist. The crossover is QLoRA, which fine-tunes a quantized base model by training small adapter weights in higher precision on top of it.

Quality cost: how much do you lose at 4-bit?

The short answer: surprisingly little. Modern 4-bit GPTQ and AWQ methods retain roughly 95 to 99 percent of FP16 quality on standard benchmarks like MMLU and HellaSwag. Community measurements on Llama-family models report WikiText-2 perplexity increases of 0.1 to 0.3 points relative to FP16 at Q4_K_M, a degradation that is statistically real but imperceptible in chat or instruction-following.

Where you do see drops is in long tail reasoning and in non-English languages, both of which lean more heavily on the weights that quantization rounds away. Coding benchmarks tend to be slightly more sensitive than general knowledge ones. If your chatbot is a customer support deflection layer answering FAQ style questions in English, you will not notice the difference between Q4_K_M and FP16. If you are running code generation in a less-resourced language, run the benchmark before you ship.

INT2 is where quality breaks. Going from 4 bits to 2 bits roughly doubles the perplexity gap, and the failure modes become qualitatively different: the model starts producing fluent but factually drifting answers rather than slightly worse versions of the right one. For now, 4-bit is the practical floor for production chatbots.

Quantization

What quantization actually is

The bit-width landscape: FP16, FP8, INT8, INT4

Why quantization matters for AI chatbots

Quality cost: how much do you lose at 4-bit?

Common Quantization questions

Sources & further reading

Ship your first chatbot in 60 seconds.

Quantization

What quantization actually is

The bit-width landscape: FP16, FP8, INT8, INT4

Why quantization matters for AI chatbots

Quality cost: how much do you lose at 4-bit?

Related terms

Common Quantization questions

Sources & further reading

Ship your first chatbot in 60 seconds.