New AI Training Method Slashes Memory Needs by 18x

Researchers introduce QZO, a technique that drastically reduces GPU memory usage for fine-tuning large language models.

A new research paper details Quantized Zeroth-order Optimization (QZO), a method designed to make fine-tuning large language models far more memory-efficient. This approach combines model quantization with zeroth-order optimization, potentially lowering the barrier for adapting powerful AI models.

By Sarah Kline

September 15, 2025

4 min read

New AI Training Method Slashes Memory Needs by 18x

Key Facts

QZO (Quantized Zeroth-order Optimization) is a new method for fine-tuning neural networks.
It aims to minimize memory usage on model weights, gradients, and optimizer states.
QZO can reduce total memory cost by more than 18x compared to full-parameter 16-bit fine-tuning.
It uses zeroth-order optimization to approximate gradients and model quantization (e.g., bfloat16 to int4) for weights.
The method addresses the precision gap by perturbing the continuous quantization scale for gradient estimation.

Why You Care

Ever feel like your computer just can’t keep up with the latest AI models? Do you struggle with the sheer memory demands of AI? A new creation could change everything for you. Researchers have unveiled a technique that dramatically cuts the memory needed to train AI. This means more accessible and efficient AI creation for everyone.

What Actually Happened

Researchers have introduced a novel method called Quantized Zeroth-order Optimization (QZO). This technique aims to minimize memory usage across model weights, gradients, and optimizer states, according to the announcement. It tackles the significant challenge of adapting large language models (LLMs) to specific tasks. LLMs often require immense GPU memory, which is a major bottleneck.

The core idea behind QZO is twofold. First, it eliminates the need for gradients and optimizer states by using zeroth-order optimization. This process approximates gradients by subtly changing weights during forward passes. This helps identify the correct gradient directions. Second, QZO employs model quantization. This means converting large data types, like bfloat16, into smaller ones, such as int4. This step significantly reduces the memory footprint of the model weights themselves.

However, directly applying zeroth-order optimization to quantized weights is problematic. There’s a precision gap between discrete weights and continuous gradients. The team revealed that their QZO approach solves this by perturbing the continuous quantization scale. This allows for accurate gradient estimation. They also use a directional derivative clipping method to stabilize training.

Why This Matters to You

This creation is crucial for anyone working with or interested in large language models. If you’ve ever tried to fine-tune an LLM, you know the memory struggle is real. QZO offers a practical approach to this widespread problem.

Key Benefits of QZO:

Reduced Memory Footprint: Decreases total memory cost significantly.
Enhanced Accessibility: Makes fine-tuning AI models more feasible.
Faster Iteration: Potentially speeds up the creation cycle for AI applications.
Broader Application: Allows models to run on less hardware.

Imagine you’re a developer trying to customize a large language model for a niche application. Previously, you might need access to expensive, high-end GPUs. With QZO, the memory requirements drop dramatically. This could allow you to achieve your goals with more affordable hardware. How might this impact your next AI project?

As the paper states, “Compared to full-parameter fine-tuning in 16 bits, QZO can reduce the total memory cost by more than 18x.” This is a substantial reduction. It opens up new possibilities for researchers and developers alike. Your ability to experiment and innovate with AI could be greatly enhanced.

The Surprising Finding

The most striking aspect of this research is the sheer scale of memory reduction achieved. The study finds that QZO can reduce total memory cost by over 18 times compared to traditional 16-bit fine-tuning. This figure is quite remarkable. It challenges the common assumption that AI models will always demand ever-increasing hardware resources.

Many in the AI community believed that memory efficiency improvements would be incremental. However, this research shows a significant leap. It suggests that clever algorithmic design can overcome hardware limitations. This is particularly surprising given the complexity of combining quantization with zeroth-order optimization. The precision gap between discrete and continuous values was a known hurdle. The researchers’ approach, perturbing the continuous quantization scale, is an elegant way to bypass this. It offers a new path forward for memory-constrained AI creation.

What Happens Next

This new method, QZO, is orthogonal to existing post-training quantization methods. This means it can potentially be combined with other techniques. We can expect further research and integration efforts in the coming months. The team revealed this approach could be widely adopted.

For example, imagine a small startup wanting to fine-tune a large AI model for a specific industry. They might not have access to a supercomputer. QZO could allow them to perform this task on more modest cloud infrastructure. This would save significant costs. The documentation indicates that QZO is applicable to various machine learning domains. These include computation and language (cs.CL) and computer vision and pattern recognition (cs.CV).

Expect to see early adopters begin experimenting with QZO in the next 6-12 months. Developers should consider how this memory-saving technique could impact their project timelines and budget. The industry implications are clear: more efficient AI training could democratize access to AI. It could also accelerate the pace of creation. This is a crucial step towards making AI more accessible and sustainable.

Ready to start creating?