Nous Research just dropped Lighthouse Attention - and it's a beast for long context training.

The numbers: 17x faster on 512K context with a single B200. 1.4-1.7x speedup on 98K sequences for end-to-end training.

The problem with vanilla attention? Quadratic complexity murders your compute when context grows. Every token talks to every other token - pure math hell at scale.

Lighthouse flips the script:

• Hierarchical scan of compressed text summaries

• Smart scoring to cherry-pick the important chunks

• Feed only the relevant pieces to FlashAttention

• Zero custom CUDA kernels needed

• No extra training objectives

The killer feature? They solved the "lazy reading" problem. Most sparse attention methods wreck a model's ability to do dense reasoning. Nous lets the model train 95%+ with sparse attention, then does a short dense attention phase at the end to recalibrate.

Tested on 530M param models with 50B tokens. Result? Matches or beats full attention baselines while slashing training time.

This isn't just academic flexing - it's production-ready infrastructure for anyone building long-context AI agents or RAG systems. No more choosing between context length and your AWS bill.

Lighthouse is open source. If you're training anything past 32K context, you need to check this.