Xiaomi MiMo drops UltraSpeed inference mode: single node breaks 1,000 tokens/s
The Xiaomi MiMo team, in collaboration with the AI compilation optimization system group TileRT, has launched the MiMo-V2.5-Pro-UltraSpeed inference mode. This setup has successfully achieved a staggering generation speed of over 1,000 tokens/s with a trillion-parameter MoE model on a standard 8-card GPU node, peaking at around 1,200 tokens/s. This marks the first time, without the need for specialized chips, that the 1T model has surpassed the thousand-token generation speed using only standard off-the-shelf hardware.
Why it matters: This breakthrough significantly lowers the inference costs for large models, enabling trillion-parameter MoE models to perform real-time inference on standard GPU servers, thus accelerating the deployment of AI applications.
#小米 #MiMo #AI #推理 #LargeModel
The Xiaomi MiMo team, in collaboration with the AI compilation optimization system group TileRT, has launched the MiMo-V2.5-Pro-UltraSpeed inference mode. This setup has successfully achieved a staggering generation speed of over 1,000 tokens/s with a trillion-parameter MoE model on a standard 8-card GPU node, peaking at around 1,200 tokens/s. This marks the first time, without the need for specialized chips, that the 1T model has surpassed the thousand-token generation speed using only standard off-the-shelf hardware.
Why it matters: This breakthrough significantly lowers the inference costs for large models, enabling trillion-parameter MoE models to perform real-time inference on standard GPU servers, thus accelerating the deployment of AI applications.
#小米 #MiMo #AI #推理 #LargeModel