Nous Research unveils Lighthouse Attention, boosting training speed by 1.4-1.7× at 98K context

Today, researchers announced the release of Lighthouse Attention, a new hierarchical attention mechanism designed for long-context pre-training, which achieves a significant speedup—1.4 to 1.7 times faster wall-clock pre-training at 98K context—compared to standard attention models. This innovative approach pools queries, keys, and values symmetrically into a multi-level pyramid, allowing the model to run the forward and backward passes approximately 17 times faster without the need for custom sparse attention kernels. This release aligns with a growing trend among leading AI labs and open-source communities to enhance transformer context windows, aiming to address long-sequence modeling’s current bottlenecks in frontier language models and leverage ongoing optimizations from established libraries like FlashAttention.

bloc97: bloc97 is a machine learning and systems researcher known for work on efficient attention mechanisms and large language model training techniques, often focusing on practical performance improvements on modern GPUs. In this announcement, bloc97 is identified as one of the leads on the Lighthouse Attention project, helping design and validate the hierarchical selection approach and its high-throughput implementation.
Mozilla: In this context, “Mozilla” refers to the researcher using the handle @theemozilla, associated with work on large language models and attention architectures rather than the Mozilla Corporation behind Firefox. The announcement credits this individual as one of the leads on Lighthouse Attention, contributing to the algorithmic design and experimental evaluation of long-context pre‑training with hierarchical selection.
Subho Ghosh: Subho Ghosh is an AI researcher and engineer who works on large-scale transformer training, attention algorithms, and high-performance PyTorch-based tooling. In this news, Subho Ghosh is highlighted as a lead author and maintainer of the Lighthouse Attention paper and GitHub repository, providing the reference implementation and training configurations used to demonstrate the method.
Lighthouse Attention: Lighthouse Attention is a selection-based hierarchical attention mechanism designed to make training large language models with very long contexts more efficient by wrapping standard scaled dot‑product or FlashAttention kernels with a sparse, pyramid-style selection layer. In this news, Lighthouse Attention is presented as a training-time method that significantly accelerates long-context pre‑training while preserving the model’s ability to use full dense attention at inference, with an open-source implementation and paper detailing its design and benchmarks.

`json
{“Long-context_trend”: “Leading AI labs and open-source communities are increasingly focused on extending transformer context windows through architectural changes, aiming to address the challenges of long-sequence modeling in language models.”, “Open_source_tooling”: “In recent times, there has been a notable emphasis on releasing long-context training techniques as reproducible open-source patches compatible with popular frameworks like PyTorch, facilitating experimentation on multi-GPU clusters.”, “Efficient_attention_research”: “Current research prioritizes attention mechanisms that remain compatible with standard dense kernels, allowing models to leverage ongoing optimizations in libraries such as FlashAttention and cuDNN without the need for custom sparse kernels.”}
`