Is This The FASTEST AI Model In The World?!! (Xiaomi MiMo V2.5 Pro UltraSpeed)

Better Stack·8:54en

Watch on YouTube ↗

Transcript

0:00

Holy cow. Show me, you know, the Chinese

company that makes phones just made an

AI model which might be the fastest in

the world. It's called Show Me Nemo V2.5

Ultra Speed, and it is truly

mind-blowing. In today's video, we'll

take a look at this model, see how it

works, and I actually managed to get

early access to this model, so we'll

also test it out with some interesting

examples to see how fast it actually is.

It's going to be a lot of fun. So, let's

dive into it.

Before we look under the hood of this

model, let's see what massive

differences are we actually dealing with

here. So, on Frontier models like GPT

5.5 or Claude 4 Opus, you're often

waiting through massive reasoning lags,

scraping by at roughly 50 or 60 tokens

per second. Now, that's not bad, but

it's kind of slow. But X Show's new MIMO

Ultra Speed model is clocking in at over

1,000 tokens per second. And what's even

crazier is the fact that this model is

also massive in size. It's a one

trillion parameter mixture of experts

model. So you might be thinking, okay,

they're probably using some kind of

super advanced custom hardware setup for

this. Well, actually not quite. Showi

teamed up with their systems partner

Tile RT and they achieved this by using

just a single standard server with eight

commodity GPUs. But if that's not the

answer, then that begs the question, how

do you force a trillion parameter model

to spit out text at microssecond speeds

on a standard hardware? Well, they came

up with something they call extreme

model system code design. They attacked

the latency bottleneck from three

different angles simultaneously. First,

they optimized memory bandwidth. Moving

a trillion parameters through GPU memory

during text generation phase creates

massive traffic jams. To fix this, Show

me used MXFP4 quantization. But because

4-bit compression can normally make an

AI less accurate, they used quantization

aware training or QAT and they kept the

core routing layers at a higher

precision. This alleviated the memory

pressure while keeping the model's

intelligence nearly identical to the

uncompressed version. Second, they

ultimately changed the way the model

predicts words. So, standard speculative

decoding works by having a tiny draft

model guess a few words ahead and then

the massive main model checks the math.

But, Show Me did something different

here with what they call Dlash. Instead

of guessing one token at a time, it

predicts an entire block of hidden

tokens all at once in parallel forward

pass. And through testing, they

discovered that when you use it for

coding tasks, the main model actually

keeps an average of 6.3 out of every

eight tokens that Dlash guesses. So, it

essentially lets the model take massive

eight token leaps forward at a time

instead of taking baby steps. And third,

they used the special engine which

solves a really annoying hardware

bottleneck. So when you're pushing a

thousand tokens a second, standard GPUs

actually can't keep up with the

instruction logic. Normally a GPU

launches a math operation, finishes it,

clears out the memory, and then waits to

launch the next one. And even though

these pauses only last microsconds, they

completely kill your momentum. To fix

that, Tile RT built a persistent engine

kernel that just sits inside the GPU and

never leaves. They used a trick called

warp specialization to assign permanent

roles to different parts of the

hardware. While one section is moving

data, another is running the math and a

third is handling communication all at

the exact same time. So, the pipeline

literally never stops moving. And this

is so interesting because I just did a

video on diffusion Gemma, which is also

super fast, but it tackles the same

problem in a very different way. So

check out that video if you're

interested. And that, my friends, is how

Show Me gets to 1,000 token per second

speeds, allegedly. But now, let's

actually test it out and see if this

promise holds up. So for my first test,

I decided to take one of Lead Code's

hard questions and run it by the model.

And it was blazingly fast. How? wild is

that. Plus, as we can see here, it

peaked at 3,451

tokens per second, which is absolutely

insane. Now, there might be a

possibility that this lead code question

was part of the model's training data.

So, as impressive as this looks, it's

probably not a fair comparison. So,

let's move on to something more

sophisticated. Next, I asked it to build

a simple UI personal finance dashboard

in one single HTML file with no external

libraries and nothing too fancy. And in

this test, we could now actually see how

insanely performant it is. It was

averaging about 700 tokens per second

for the reasoning part and about 1,000

tokens per second for the output

operations. And it took the model just

65 seconds to complete the task. And I

think the result is pretty good. Albeit

some of the buttons are not working and

some of the actions are broken, but the

design as a whole is pretty good. I

mean, not bad for a one- minute task. So

then I decided to challenge the model to

build something even more sophisticated.

I prompted it to build a con academy

style math explainer web page showcasing

10 popular math concepts to see how

complex of a website can we actually

produce here. And this is where things

started getting a bit rough. I tried

this test twice and both times after

about two or three minutes, the model

just stopped generating and completely

froze. So I assumed that with this task,

I hit the model's context limit or maybe

show me has put a rate limiter of some

sort. So then I decided to simplify the

task a bit by asking it to design a web

page with only five mathematical

concepts. And this time it finally

worked. It managed to finish the task in

75 seconds. And the output is actually

quite nice. And the first three

mathematical concept widgets are

actually functional. But everything past

that point is broken, nonfunctional or

empty. So I don't know what exactly

happened here. Maybe the model dropped

some of its context during the reasoning

phase. But nonetheless, I think this is

a pretty good result, especially taking

into consideration that we were

averaging 500 tokens per second during

the reasoning phase. And for my last

test, I decided to do something a little

bit more fun. I just simply prompted

this very short sentence to build a

Subway Surfer clone using 3JS. And it

actually managed to build a fully

functional Subway Surfer clone in just

50 seconds. Now, that is crazy. I do

have to say that although it is

functional, as you can see here, it

doesn't include any obstacles or coins

or anything like that. So, it's kind of

boring. So, I then decided to give it a

follow-up prompt to fix these minor

issues. And after two passes, it managed

to successfully add some coins and some

obstacles. And honestly, when I was

testing it, this was a flawless demo.

The functionality was there. Everything

was working. It was even saving my high

score after every round. So, this

particular demo really surprised me in a

very positive way. I'm sure nowadays we

can all build Subway Surfer clones with

other models as well, but the fact that

I could get a working prototype, which

is not completely terrible, and which is

actually fun to play, and all of that in

just 50 seconds with some follow-up

prompts. That is pretty impressive. So,

as we all saw in the tests, the model

managed to reach a record speed of more

than 3,000 tokens per second. So, this

is indeed the absolute fastest model

I've ever seen. And as far as the

outputs go, I mean, yeah, sure, some of

them are broken, some of them are

halfbaked. Surely this is no clawed opus

or GPT 5.5, but I'm sure that Xomi's

models will definitely keep improving

over time. So, it's going to be very

interesting to see what they come up

with in the future. So, there you have

it, folks. That is MIMO V2.5 Ultra Speed

in a nutshell. So, what do you think

about this model? Are you impressed,

disappointed, indifferent? Let us know

in the comments section down below. And

folks, if you like these types of

technical breakdowns, please let me know

by smashing that like button underneath

the video. And also, don't forget to

subscribe to our channel. This has been

Andress from Better Stack, and I will

see you in the next videos.