Mixture of Experts: How GPT-4 and Mixtral Save Trillions of Operations

The technological landscape of artificial intelligence is undergoing a quiet revolution, driven by the Mixture of Experts (MoE) architecture. Instead of engaging all neurons of a giant model to answer every query, MoE models use a router that selects and activates only a few of many small specialized subnetworks—'experts.' For example, in the Mixtral 8x7B model from Mistral AI, only 2 out of 8 experts are engaged for each response token, meaning it works with 13 billion parameters out of 47 billion total. This allows the model to rival the quality of Llama 2 70B but operate almost 6 times faster during inference.

Before the advent of MoE, the development of language models hit a 'hardware' ceiling: increasing parameters linearly increased computational costs and training expenses. Traditional dense models, such as GPT-3, required using all their 175 billion parameters to generate any word, making their scaling and operation extremely expensive. MoE solves this problem by offering a path to creating models with trillions of parameters that remain economically viable for training and, critically, for everyday use.

Technically, MoE is integrated into standard transformer blocks, replacing the dense feed-forward layer with a set of experts and a router. The router, often implemented via a simple linear layer with a softmax function, determines which experts are most relevant for the current input token. Key innovations in recent years, such as Google's GShard and Switch Transformer, have addressed issues of training instability and uneven expert load. Modern implementations, as in Mixtral, use 'sparse' activation, where the top-2 experts are selected for each token, ensuring a balance between quality and speed.

Although OpenAI does not disclose details of GPT-4's internal structure, numerous leaks and expert analyses, including from Elon Musk himself, indicate it is an MoE model. This explains its incredible efficiency and quality despite an allegedly astronomical total parameter count. French startup Mistral AI has made MoE its competitive advantage by openly releasing the Mixtral 8x7B model, sparking a wave of enthusiasm in the open-source community. Major players, including Google and xAI, are actively researching and implementing this architecture in their developments, recognizing it as the de facto standard for the next generation of LLMs.

For the industry, MoE means democratizing access to giant models. Reducing inference costs allows deploying powerful AI on consumer hardware and in commercial applications with limited budgets. For end users, this translates to faster, cheaper, and higher-quality services—from chatbots to programming tools. However, the approach presents new challenges: MoE models require significantly more video memory to store all parameters, even if they are not active, complicating their distribution.

The prospects for MoE development lie in increasing the number of experts, improving routing algorithms, and creating hybrid models. The next logical step will be models with thousands of experts and smarter selection systems, possibly involving small language models for the routing itself. The question of effectively pre-training such sparse architectures and managing the 'narrow specialization' of experts remains open. Nevertheless, it is clear that MoE is not a temporary trend but a fundamental shift that will define the architecture of the largest AI systems for years to come, bringing us closer to creating more competent, accessible, and environmentally friendly artificial intelligence models.

Mixture of Experts: How GPT-4 and Mixtral Save Trillions of Operations

Discussion 0

Related Articles