March 17, 2024

The Potential of Higher Parameter, Lower Precision Language Models

As the field of large language models continues its rapid advancement, a key area meriting further research is the balance between model parameter count and numerical precision. While traditionally focusing on 32-bit or 16-bit floating point representations, recent explorations have unveiled the promise of quantizing models to 8-bit or even lower precision without sacrificing performance.

What makes this shift towards lower precision so intriguing is that the memory footprint savings enable packing vastly more parameters into the same model size. An 8-bit quantized model with 14 billion parameters occupies a similar memory footprint as a 16-bit model with just 7 billion parameters. Counterintuitively, evidence suggests these higher parameter count, lower precision architectures can outperform their higher precision, lower parameter counterparts on language benchmarks.

The potential upsides of this approach are multifaceted:

Scalability and Capacity: For large language model deployments, the limiting factor is often accelerator or GPU memory rather than raw compute power. By quantizing to lower precision, we can fit models with drastically higher parameter counts and capacity within constrained memory budgets.
Hardware Efficiency: Lower precision models might leverage specialized hardware like Google’s Tensor Processing Units (TPUs) better, as Google’s TPU, and other products like it, have a large focus on integer arithmetic. This could unlock a new level of computational performance and efficiency.
Representation Power: With their increased parameter counts, lower precision models gain the ability to capture more nuance, complexity, and rich representations from their training data.
Deployment Flexibility: Lower precision arithmetic has far wider hardware support, from high-end data center chips to mobile processors. This democratizes running large language models across a range of devices and compute environments.

While open research questions remain around optimal parameter counts, precision thresholds where catastrophic performance degradation sets in, efficient quantization pipelines, and quantization-aware training strategies, the potential upsides make this direction an urgent priority.

As a research community, we should dedicate significant resources towards innovations in quantization, sparse compression, and algorithmic/hardware breakthroughs that maximize language model capacity within reasonable computational constraints. Only by continuing to push the boundaries of what’s possible with low precision, high parameter count architectures can we enable the next generation of large language model capabilities. I propose the creation of a 14B 8 bit model (trained at 8 bits, not just a quantized 16 bit model). If given the proper resources I believe that this model could outperform many 7B 16bit models, while being able to run on the same hardware. I believe that such a model is worth further investigation.

The path forward lies in the intersection of developing novel low-precision training techniques, ultra-efficient low-precision arithmetic kernels and algorithms, and fully leveraging the specialized AI hardware and accelerators designed for low-bit arithmetic. With a concerted research effort, we may soon unleash a new frontier of large language model performance without being tethered to the shackles of high precision computation.

You should also read:

Running Mamba Models on Oobabooga's Text-Generation-Webui