AI·4 min read
Granular AI: Shrinking Models by Replacing Submodules
New research demonstrates that replacing specific sub-components of an AI model, rather than entire layers, leads to significantly better performance in compressed Large Language Models.
TL;DR\n* Researchers have introduced a method to shrink Large Language Models by replacing specific sub-components instead of removing entire architectural layers.\n* This granular approach allows for more precise compression, enabling models to remain highly capable while significantly reducing their memory and computational requirements.\n\n## Background\nLarge Language Models (LLMs) like GPT-4 or Llama-3 are computationally expensive. To run these models on consumer hardware, developers use compression techniques to reduce their size. One common method is "pruning" or "replacement," where parts of the model's brain are identified as redundant and either deleted or swapped for simpler mathematical functions. Traditionally, this has been done at the "layer" level—the large, contiguous blocks that make up a transformer's architecture. However, removing a whole layer is a blunt instrument that often degrades the model's reasoning abilities [^2].\n\n## What happened\nA new study titled "From Layers to Submodules" suggests that the current approach to model compression is far too restrictive [^1]. Most existing methods follow two rigid rules: they must replace an entire layer at once, and they must select layers that are right next to each other. The researchers argue that redundancy in AI models is not neatly organized in contiguous blocks. Instead, it is scattered across smaller units called submodules.\n\nA standard transformer layer consists of two primary submodules: the Multi-Head Attention (MHA) mechanism, which helps the model understand relationships between words, and the Feed-Forward Network (FFN), which handles the processing of information. In the past, if a researcher wanted to shrink a model, they would have to remove both the MHA and the FFN of a specific layer simultaneously. The new research demonstrates that it is much more effective to look at these submodules individually. For instance, a model might have a very important Attention mechanism in Layer 5 but a highly redundant Feed-Forward Network in that same layer. By only replacing the redundant FFN, the model keeps the intelligence of the Attention head while still saving space [^1].\n\nThe researchers tested this by moving from "full-layer granularity" to "submodule granularity." They found that the most redundant parts of a model are often non-contiguous. This means that instead of cutting out Layers 10 through 15, it might be better to cut the Attention head of Layer 2, the FFN of Layer 8, and the Attention head of Layer 20. When they applied this surgical approach, they were able to compress models to a much smaller size while maintaining a higher accuracy floor than traditional methods. This process uses replacement-based compression, where the identified submodule is replaced by a smaller, fitted module that mimics the original's output with fewer parameters. This allows for a smooth transition that does not shock the rest of the neural network [^1].\n\n## Why it matters\nThe transition from layer-level to submodule-level compression is a significant shift in how we optimize AI. As models grow to hundreds of billions of parameters, the coarse method of removing entire layers becomes increasingly inefficient. It is the difference between using a chainsaw and a scalpel. By targeting submodules, we can preserve the subtle nuances of a model's logic that are often lost during aggressive pruning [^2].\n\nThis precision has direct implications for edge AI—running powerful models locally on smartphones, laptops, and IoT devices. If we can shrink a 70-billion parameter model down to the size of a 7-billion parameter model without losing its sophisticated reasoning, we reduce the need for expensive, centralized cloud servers. This improves data privacy, as sensitive information never needs to leave the user's device, and it lowers the carbon footprint associated with the massive energy demands of AI data centers.\n\nFurthermore, this research provides a clearer map of how AI thinks. By identifying which specific submodules are redundant, researchers can better understand which parts of the transformer architecture are doing the heavy lifting. This feedback loop will likely influence how the next generation of models is trained from scratch. Instead of building massive, uniform layers, engineers might start designing models with varying submodule densities, leading to naturally more efficient architectures that require less post-training compression.\n\n## Practical example\nImagine you are trying to lighten a heavy backpack for a long hike. The old way of compressing the bag would be to remove entire categories of items. You might decide to leave behind all your cooking gear. While this makes the bag much lighter, you now have no way to boil water or heat food, which is a major loss of functionality and safety.\n\nThe submodule way is more strategic. Instead of throwing out all the cooking gear, you look at individual items. You keep the lightweight stove (the important Attention submodule) but replace the heavy cast-iron skillet with a small titanium pot (replacing a redundant submodule with a fitted version). You then realize you have two identical flashlights in different pockets, so you remove one. In the end, your backpack is just as light as if you had thrown out the whole cooking kit, but you still have every capability you started with—just in a more efficient, streamlined form.
Related gear
We recommend this foundational text because it provides the essential mathematical definitions for the layers and submodules that modern compression techniques aim to optimize.
AdvertisementAmazon
Deep Learning (Adaptive Computation and Machine Learning series)
★★★★★ 4.8
$80.00View on Amazon →