← All podcasts
Thumbnail for Is This The FASTEST AI Model In The World?!! (Xiaomi MiMo V2.5 Pro UltraSpeed)

Is This The FASTEST AI Model In The World?!! (Xiaomi MiMo V2.5 Pro UltraSpeed)

Better Stack·8:54v1.1

Overview

This is a short solo explainer from Andress at Better Stack, walking through the engineering behind Xiaomi's MiMo V2.5 Pro UltraSpeed — a one-trillion-parameter AI model developed with systems partner TileRT that claims to exceed 1,000 tokens per second on standard server hardware. The episode covers the three technical optimisations that enable the speed, then tests the model live on a series of coding tasks.

Bottom Line

The episode is useful for developers and AI enthusiasts who want a quick, accessible explanation of how extreme inference speed can be achieved without custom hardware. It requires light attention — the technical content is explained plainly, and the live tests give a concrete sense of the model's real-world behaviour. It is mainly worth the time for people already following AI model releases; it is too brief and surface-level to serve as a deep technical reference.

Key Themes

What Was Discussed

Speed in context The episode opens by framing the model's claimed speed against current frontier models. GPT-4.5 and Claude Opus are described as running at roughly 50–60 tokens per second. MiMo V2.5 UltraSpeed reportedly exceeds 1,000 tokens per second — and peaked at 3,451 tokens per second during one test — on a single server with eight standard GPUs, rather than specialised hardware.

The three engineering layers Xiaomi and TileRT describe their approach as attacking latency from three directions simultaneously.

The first is selective FP4 quantisation. Compressing a trillion-parameter model to 4-bit precision reduces the memory bandwidth required during inference, but typically degrades output quality. To address this, they used quantisation-aware training (QAT) and kept the model's core routing layers at higher precision, aiming to preserve accuracy while easing memory pressure.

The second is DFlash speculative decoding. Standard speculative decoding uses a small draft model to guess one token at a time, which the main model then verifies. DFlash instead predicts an entire block of tokens in a single parallel forward pass. In coding tasks, the main model reportedly accepts an average of 6.3 out of every 8 predicted tokens, allowing the system to advance in large steps rather than incrementally.

The third is TileRT's persistent engine kernel. At high throughput, standard GPUs lose time clearing state between operations — pauses measured in microseconds that compound at scale. TileRT's solution is a kernel that remains permanently loaded inside the GPU. Using a technique called warp specialisation, different sections of the hardware handle data movement, computation, and communication simultaneously, keeping the pipeline continuously active.

Live coding tests The model was tested on four tasks. It solved a hard LeetCode problem very quickly, though the presenter notes the question may have been in the training data. It built a personal finance dashboard in a single HTML file in 65 seconds, with most functionality working. An attempt to generate a maths explainer page with ten concepts caused the model to freeze twice; reducing the scope to five concepts produced a result in 75 seconds, though content beyond the first three sections was broken or empty. Finally, it produced a functional Subway Surfer clone in Three.js in 50 seconds, which the presenter describes as genuinely playable after two follow-up prompts.

Notable Points

DFlash acceptance rate as a practical metric. The claim that the main model accepts 6.3 out of 8 DFlash-predicted tokens during coding tasks is a specific and measurable figure. If accurate, it suggests the draft predictions are highly aligned with the main model's behaviour for structured code generation — which would be a meaningful engineering result, not just a speed claim.

Context instability under load. The model froze twice when asked to generate a ten-concept maths explainer, and incomplete output on the five-concept version suggests possible context dropping during long reasoning phases. This is a practical limitation the presenter flags directly, noting the model is not yet comparable to Claude Opus or GPT-4.5 in output reliability.

Peak speed versus sustained speed. The 3,451 tokens-per-second figure was observed on a short, structured task (LeetCode) that may have been in training data. Sustained speeds across longer generative tasks were lower — around 500–700 tokens per second for reasoning and approximately 1,000 for output — which is still fast but meaningfully different from the headline number.

Trillion-parameter scale on commodity hardware. The episode's core claim — that a one-trillion-parameter model can run at these speeds on eight standard GPUs — is notable because scale of this kind has typically required specialised infrastructure. The engineering choices described are framed as the mechanism that makes this possible, though the episode does not provide independent verification.

Worth Listening If…

Skip If…

Read full transcript →Watch on YouTube ↗