Today, a team of researchers introduced Token Superposition Training (TST), a modification to the standard large language model (LLM) pretraining loop that achieves a 2-3× wall-clock speedup without altering the model’s architecture, optimizer, tokenizer, or training data. TST operates in two phases: initially utilizing contiguous bags of tokens for prediction in a superposition phase, followed by traditional next-token prediction. Notably, TST can be integrated with existing techniques such as sparse attention and MoE routing, enhancing training efficiency through independent improvements from input averaging and output multi-hot cross-entropy. This release marks the beginning of a series of research updates from the Nous pretraining group, aimed at advancing pretraining methodologies.
Nous: Nous Research is an open-source AI research organization dedicated to developing advanced language models and efficiency techniques. Their pretraining group released Token Superposition Training (TST), a drop-in modification to LLM pretraining that uses token bagging in the early phase for improved convergence. This release marks the first in a series of upcoming research announcements from the group.
@bloc97_: Bowen Peng (@bloc97_) is an AI researcher at Nous Research focused on language model pretraining innovations. He co-led the development and validation of Token Superposition Training (TST) across dense and MoE architectures. His work emphasizes simple changes that enhance training dynamics without architectural alterations.
@gigant_theo: Théo Gigant (@gigant_theo) is a research scientist at Nous Research with a background in multimodal deep learning from his PhD at Université Paris-Saclay and CentraleSupélec. He co-led the Token Superposition Training (TST) initiative for efficient LLM pretraining. His contributions highlight practical ablation studies confirming independent efficiency gains.
@theemozilla: Emozilla (@theemozilla) is an AI researcher and co-founder/CTO of Nous Research. He co-led the Token Superposition Training (TST) project, which advances pretraining by decoupling speed from inference. His perspectives emphasize broad compatibility with emerging techniques like sparse attention and MoE routing.
Token Superposition Training: Token Superposition Training (TST) is a pretraining method that averages contiguous input token embeddings and predicts output token bags via modified cross-entropy during the initial training phase before standard next-token prediction. It leverages two compatible mechanisms: input-side averaging as a regularizer or coarse pre-pretraining, and output-side multi-token prediction. TST separates training efficiency from inference architecture for easy integration with other optimizations.
Stackability: TST integrates cleanly with sparse attention, MoE routing, alternative tokenizers, and optimizer tweaks.
Training Phases: TST divides pretraining into an early superposition phase on token bags followed by standard next-token prediction for seamless transition.
Independent Ablations: Input averaging and output multi-hot cross-entropy each provide distinct efficiency benefits that combine additively.
