Quantization

Quantization is a research technique used to reduce the precision of model weights and activations in machine learning models, resulting in more efficient and scalable AI systems. By reducing the numerical precision of model parameters, quantization enables faster inference times, lower memory usage, and improved performance on edge devices, making it a crucial technique for deploying AI models in resource-constrained environments, and a key area of research in the field of deep learning and AI optimization.

17 stories

•

24h: 0%

•

7d: 0

•

120 comments

Top contributors:sgt3v PaulHoule chmjkb rickesh_tn ilitirit

Stories

Quantization

Related Stories

The LLM Lobotomy?

Bit Is All We Need: Binary Normalized Neural Networks

Qwen 3 0.6b Quantized Running at 90tok/s on A19 Pro Chip

Run 35b Llms on Dual Pascal Gpus with Qlora

Sinkhorn: Make Llms Even Smaller Through Quantisation While Maintaining Accuracy

Butterflyquant: Ultra-Low-Bit LLM Quantization

Compress Vectors by 4x by Using 8-Bit Rotational Quantization

Run Unsloth Dynamic Ggufs Using Docker Model Runner

Quantized Float Exposed

Quantized LLM Training in Pure Cuda/c++

Human+ai Loops Stay Stable Even with Quantization

Qkv Core – Run 7b Llms on 4gb Vram via Surgical Memory Alignment

Kimi K2 1t Model (4-Bit Quant) 2x512gb M3 Ultras with Mlx-Lm and Mx.distributed

Bits Is All You Need (and 3.6 Bit What You Have?) for Resource-Efficient Llms?

Satquant: Fix Yolov8 Quantization Accuracy on Satellite Imagery (edge Tpu)

How the Polyblocks AI Compiler Works

Speculations About Deepseek-Ocr and Quantization