Popular Technique To Make AI More Efficient Has Drawbacks

A popular technique that makes AI run more efficiently is coming near to hitting its limits. The most popular AI technique, which starts with Q, begins to show cracks: Quantization.

There is a set of other types of quantization, but most of them reduce the number of bits needed to represent information, and that’s how AI models work much more efficiently. Consider the language to refer to it as “noon” vs. “12:00:01.004“. Each is technically correct, but one is vastly more precise than the other. Simplification allows a model to perform millions more calculations in much less computational work. In AI, quantization reduces the precision of model parameters- the parameters- the internal variables used in making predictions. In return, these models use fewer bits and become less resource-intensive, cheaper, and faster to operate. However, recent research indicates a tradeoff with quantization, which may limit its long-term effectiveness.

A study led by Harvard, Stanford, and others showed that quantized models tend to suffer when their unquantized versions have been trained for long periods on very large datasets.

Essentially, shrinking a large model by quantization may only sometimes be very helpful because one could train a smaller model right from the beginning. This is a problem for AI companies that rely on heavy models for better performance and then quantize them to cut costs. These effects are already observable in some situations. According to one developer, Meta’s Llama 3 model is more sensitive to quantization than most models. Training costs are much lower than inference-running an AI model to produce results. For example, $191 million was reportedly the cost of Google’s Gemini model. Mundane work with this model, such as answering search queries, could run into the billions of dollars annually.

However, large AI labs keep training humongous models, hoping large quantities of data and computational resources will yield better AI. Meta’s Llama 3 was trained on 15 trillion tokens, a tremendous ramp from the 2 trillion the Llama 2 achieved. Even this level of scaling is returning diminishing results. Some of the newest models with sizes in the thousands of billions have yet to deliver as promised.

Can quantization be made more efficient?

According to the researchers’ hypothesis, models trained with “low precision” are stronger from the beginning. Precision refers to the precision in the representation of numbers within the model. Most of today’s models are trained at 16-bit precision and then quantized to 8-bit. More interest is also seen in lower precision formats, such as those developed by Nvidia under FP4 at 4-bit precision to save memory and energy. The quality decreases too much unless the model is very large or going too low—below 7-bit precision. This study reveals the nature of AI complexity and its limitations. While quantization is an invaluable tool, it is not one-size-fits-all. AI models require a careful balance between efficiency and accuracy. “There’s simply no way to keep cutting costs without it hurting performance,” Tanishq Kumar, one of the study’s authors, explains.

It could focus more training time and effort on smaller models using highly curated high-quality datasets. It may also focus on developing new architectures for optimized low-precision training.

Quantization is no longer an unnecessary tool in making AI more efficient. However, some of its limitations are slowly becoming apparent. The way forward should be through the smartness of the training methods and innovation to ensure that quality will not be sacrificed in the furtherance of AI.

Can quantization be made more efficient?

Related Articles