Πληροφορίες από 苏格(@suge)

GLM-5.2 gives open-weight coding models a real 1M-token context window. The hard part is serving that full window on the hardware many teams already run in production: Hopper.
We quantized GLM-5.2-FP8 into W4AFP8 and validated it on a single 8×H200 node with SGLang. The checkpoint cuts weight memory from 755 GB to 368 GB, freeing 387 GB of HBM for the 1M-token KV cache and runtime headroom.
Why this matters
GLM-5.2 already solved the model side of long context: sparse attention, IndexShare, MTP speculative decoding, tool use, reasoning, and a 1,048,576-token window. Deployment still has a second problem. A 1M-token window needs room for the model weights, KV cache, CUDA graphs, runtime buffers, and serving overhead.
The official FP8 checkpoint is the right general serving baseline. On Hopper, that baseline leaves much less memory slack once you push toward the full context window. W4AFP8 changes the memory budget without changing the model family, tokenizer, API shape, or GLM-5.2 behavior.