LLM Training Stability: Is AdamW the Right Tool for the Job?
New research questions the theoretical reliability of AdamW, the industry-standard AI optimizer, when faced with the extreme heavy-tailed noise common in large-scale model training.
TL;DR
- Researchers are questioning if AdamW, the standard tool for training AI, is theoretically sound for the "heavy-tailed" noise seen in modern LLM pretraining.
- Emerging sign-based optimizers like Lion and Muon may offer better stability by handling extreme data spikes that confuse traditional mathematical methods.
Background
To train an AI, you show it data, calculate the error, and nudge the model's internal weights to reduce that error. This nudge is handled by an algorithm called an optimizer. For years, AdamW has been the dominant choice. It is utilized to train almost every major Large Language Model (LLM) in existence. However, AdamW was built on the assumption that errors follow a predictable, "normal" distribution. In reality, training data is messy and often contains extreme outliers—what mathematicians call heavy-tailed noise.
What happened
A new theoretical inquiry highlights a growing disconnect between how we train AI and why it actually works [^1]. AdamW, which stands for Adam with Decoupled Weight Decay, was introduced to fix issues with how older optimizers handled regularization [^2]. It became the industry standard because it demonstrated superior performance in empirical tests. However, most of the mathematical proofs justifying AdamW assume a "finite-variance" regime. This means the noise or errors the model encounters during training are expected to stay within a relatively predictable range, similar to a standard bell curve where extreme outliers are statistically impossible.
The problem is that empirical evidence from massive LLM pretraining runs shows that stochastic gradient noise—the "static" in the learning process—is typically heavy-tailed [^3]. In a heavy-tailed distribution, extreme events or "shocks" are much more likely than a standard model would predict. When an optimizer like AdamW hits one of these massive spikes in data, its internal math can become unstable. AdamW works by tracking the "moving average" of the gradients and their squares. A single massive outlier can skew that average so drastically that the model "forgets" what it learned or experiences a catastrophic loss spike, where the error rate shoots to infinity.
Researchers are now looking at alternative "sign-based" optimizers such as Lion and Muon. Unlike AdamW, which looks at the magnitude of the gradient (how big the error is), sign-based optimizers often care primarily about the direction of the error [^1]. This makes them inherently more resistant to outliers. If a data point is 100 times larger than normal, a sign-based optimizer treats it the same as a data point that is only 2 times larger, as long as they point in the same direction. Recent work suggests these newer methods achieve sharper performance rates under heavy-tailed conditions, leaving the theoretical crown of AdamW in jeopardy. The core of the issue is whether the "adaptive" nature of AdamW—the very thing that made it famous—is actually a liability when the training data is as chaotic as the modern internet.
Why it matters
If the mathematical foundation of our most popular optimizer is shaky, then our ability to scale AI is limited by expensive trial and error rather than engineering precision. Training a frontier model costs tens of millions of dollars in electricity and hardware time. If a training run crashes halfway through because the optimizer could not handle a heavy-tailed spike in the data, that is a massive waste of resources. Understanding the theory behind these failures allows engineers to build more resilient systems that do not require constant human monitoring during the months-long training process. This is not just a theoretical concern; "loss spikes" are a well-known headache for AI labs at Google, Meta, and OpenAI.
Furthermore, this shift toward sign-based optimizers could lead to more efficient models. If we can use optimizers that are mathematically tuned for the actual noise found in web-scale data, we can reach higher levels of intelligence with less hardware. This democratization of training efficiency is crucial for smaller organizations that cannot afford to restart a failed run. It moves AI from an art form of "vibe-based" tuning into a rigorous branch of statistical engineering. By moving away from AdamW toward methods that handle noise more gracefully, the industry can reduce the carbon footprint of AI training while simultaneously increasing the reliability of the resulting models.
Practical example
Imagine you are teaching a student to shoot a basketball. Most of the time, their misses are small—hitting the front of the rim or the backboard. You use a standard correction method (like AdamW) to give them small adjustments. But suddenly, for one shot, a gust of wind catches the ball and it flies into the parking lot. If you use AdamW logic, you look at the size of that miss and tell the student: "You missed by 50 feet, so next time, aim 50 feet in the opposite direction." This massive over-correction ruins their form and they forget how to shoot.
A sign-based optimizer like Lion would instead say: "You missed to the left. Just aim a little more to the right next time." It ignores the fact that the ball went 50 feet away; it only cares that the direction was wrong. By ignoring the extreme outlier, the student stays on track.
Related gear
We recommend this text because it provides the mathematical foundation for the optimization challenges and heavy-tailed noise regimes discussed in the research.
Optimization for Machine Learning
★★★★ 4.4